❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

In universities, sometimes simple questions aren't simple

By: cks
29 March 2025 at 02:13

Over on the Fediverse I shared a recent learning experience:

Me, an innocent: "So, how many professors are there in our university department?"
Admin person with a thousand yard stare: "Well, it depends on what you mean by 'professor', 'in', and 'department." <unfolds large and complicated chart>

In many companies and other organizations, the status of people is usually straightforward. In a university, things are quite often not so clear, and in my department all three words in my joke are in fact not a joke (although you could argue that two overlap).

For 'professor', there are a whole collection of potential statuses beyond 'tenured or tenure stream'. Professors may be officially retired but still dropping by to some degree ('emeritus'), appointed only for a limited period (but doing research, not just teaching), hired as sessional instructors for teaching, given a 'status-only' appointment, and other possible situations.

(In my university, there's such a thing as teaching stream faculty, who are entirely distinct from sessional instructors. In other universities, all professors are what we here would call 'research stream' professors and do research work as well as teaching.)

For 'in', even once you have a regular full time tenure stream professor, there's a wide range of possibilities for a professor to be cross appointed (also) between departments (or sometimes 'partially appointed' by two departments). These sort of multi-department appointments are done for many reasons, including to enable a professor in one department to supervise graduate students in another one. How much of the professor's salary each department pays varies, as does where the professor actually does their research and what facilities they use in each department.

(Sometime a multi-department professor will be quite active in both departments because their core research is cross-disciplinary, for example.)

For 'department', this is a local peculiarity in my university. We have three campuses, and professors are normally associated with a specific campus. Depending on how you define 'the department', you might or might not consider Computer Science professors at the satellite campuses to be part of the (main campus) department. Sometimes it depends on what the professors opt to do, for example whether or not they will use our main research computing facilities, or whether they'll be supervising graduate students located at our main campus.

Which answers you want for all of these depends on what you're going to use the resulting number (or numbers) for. There is no singular and correct answer for 'how many professors are there in the department'. The corollary to this is that any time we're asked how many professors are in our department, we have to quiz the people asking about what parts matter to them (or guess, or give complicated and conditional answers, or all of the above).

(Asking 'how many professor FTEs do we have' isn't any better.)

PS: If you think this complicates the life of any computer IAM system that's trying to be a comprehensive source of answers, you would be correct. Locally, my group doesn't even attempt to track these complexities and instead has a much simpler view of things that works well enough for our purposes (mostly managing Unix accounts).

US sanctions and your VPN (and certain big US-based cloud providers)

By: cks
28 March 2025 at 02:43

As you may have heard (also) and to simplify, the US government requires US-based organizations to not 'do business with' certain countries and regions (what this means in practice depends in part which lawyer you ask, or more to the point, that the US-based organization asked). As a Canadian university, we have people from various places around the world, including sanctioned areas, and sometimes they go back home. Also, we have a VPN, and sometimes when people go back home, they use our VPN for various reasons (including that they're continuing to do various academic work while they're back at home). Like many VPNs, ours normally routes all of your traffic out of our VPN public exit IPs (because people want this, for good reasons).

Getting around geographical restrictions by using a VPN is a time honored Internet tradition. As a result of it being a time honored Internet tradition, a certain large cloud provider with a lot of expertise in browsers doesn't just determine what your country is based on your public IP; instead, as far as we can tell, it will try to sniff all sorts of attributes of your browser and your behavior and so on to tell if you're actually located in a sanctioned place despite what your public IP is. If this large cloud provider decides that you (the person operating through the VPN) actually are in a sanctioned region, it then seems to mark your VPN's public exit IP as 'actually this is in a sanctioned area' and apply the result to other people who are also working through the VPN.

(Well, I simplify. In real life the public IP involved may only be one part of a signature that causes the large cloud provider to decide that a particular connection or request is from a sanctioned area.)

Based on what we observed, this large cloud provider appears to deal with connections and HTTP requests from sanctioned regions by refusing to talk to you. Naturally this includes refusing to talk to your VPN's public exit IP when it has decided that your VPN's IP is really in a sanctioned country. When this sequence of events happened to us, this behavior provided us an interesting and exciting opportunity to discover how many companies hosted some part of their (web) infrastructure and assets (static or otherwise) on the large cloud provider, and also how hard to diagnose the resulting failures were. Some pages didn't load at all; some pages loaded only partially, or had stuff that was supposed to work but didn't (because fetching JavaScript had failed); with some places you could load their main landing page (on one website) but then not move to the pages (on another website at a subdomain) that you needed to use to get things done.

The partial good news (for us) was that this large cloud provider would reconsider its view of where your VPN's public exit IP 'was' after a day or two, at which point everything would go back to working for a while. This was also sort of the bad news, because it made figuring out what was going on somewhat more complicated and hit or miss.

If this is relevant to your work and your VPNs, all I can suggest is to get people to use different VPNs with different public exit IPs depending on where the are (or force them to, if you have some mechanism for that).

PS: This can presumably also happen if some of your people are merely traveling to and in the sanctioned region, either for work (including attending academic conferences) or for a vacation (or both).

(This is a sysadmin war story from a couple of years ago, but I have no reason to believe the situation is any different today. We learned some troubleshooting lessons from it.)

Three ways I know of to authenticate SSH connections with OIDC tokens

By: cks
27 March 2025 at 02:56

Suppose, not hypothetically, that you have an MFA equipped OIDC identity provider (an 'OP' in the jargon), and you would like to use it to authenticate SSH connections. Specifically, like with IMAP, you might want to do this through OIDC/OAuth2 tokens that are issued by your OP to client programs, which the client programs can then use to prove your identity to the SSH server(s). One reason you might want to do this is because it's hard to find non-annoying, MFA-enabled ways of authenticating SSH, and your OIDC OP is right there and probably already supports sessions and so on. So far I've found three different projects that will do this directly, each with their own clever approach and various tradeoffs.

(The bad news is that all of them require various amounts of additional software, including on client machines. This leaves SSH apps on phones and tablets somewhat out in the cold.)

The first is ssh-oidc, which is a joint effort of various European academic parties, although I believe it's also used elsewhere (cf). Based on reading the documentation, ssh-oidc works by directly passing the OIDC token to the server, I believe through a SSH 'challenge' as part of challenge/response authentication, and then verifying it on the server through a PAM module and associated tools. This is clever, but I'm not sure if you can continue to do plain password authentication (at least not without PAM tricks to selectively apply their PAM module depending on, eg, the network area the connection is coming from).

Second is Smallstep's DIY Single-Sign-On for SSH (also). This works by setting up a SSH certificate authority and having the CA software issue signed, short-lived SSH client certificates in exchange for OIDC authentication from your OP. With client side software, these client certificates will be automatically set up for use by ssh, and on servers all you need is to trust your SSH CA. I believe you could even set this up for personal use on servers you SSH to, since you set up a personally trusted SSH CA. On the positive side, this requires minimal server changes and no extra server software, and preserves your ability to directly authenticate with passwords (and perhaps some MFA challenge). On the negative side, you now have a SSH CA you have to trust.

(One reason to care about still supporting passwords plus another MFA challenge is that it means that people without the client software can still log in with MFA, although perhaps somewhat painfully.)

The third option, which I've only recently become aware of, is Cloudflare's recently open-sourced 'opkssh' (via, Github). OPKSSH builds on something called OpenPubkey, which uses a clever trick to embed a public key you provide in (signed) OIDC tokens from your OP (for details see here). OPKSSH uses this to put a basically regular SSH public key into such an augmented OIDC token, then smuggles it from the client to the server by embedding the entire token in a SSH (client) certificate; on the server, it uses an AuthorizedKeysCommand to verify the token, extract the public key, and tell the SSH server to use the public key for verification (see How it works for more details). If you want, as far as I can see OPKSSH still supports using regular SSH public keys and also passwords (possibly plus an MFA challenge).

(Right now OPKSSH is not ready for use with third party OIDC OPs. Like so many things it's started out by only supporting the big, established OIDC places.)

It's quite possible that there are other options for direct (ie, non-VPN) OIDC based SSH authentication. If there are, I'd love to hear about them.

(OpenBao may be another 'SSH CA that authenticates you via OIDC' option; see eg Signed SSH certificates and also here and here. In general the OpenBao documentation gives me the feeling that using it merely to bridge between OIDC and SSH servers would be swatting a fly with an awkwardly large hammer.)

How we handle debconf questions during our Ubuntu installs

By: cks
26 March 2025 at 02:37

In a comment on How we automate installing extra packages during Ubuntu installs, David Magda asked how we dealt with the things that need debconf answers. This is a good question and we have two approaches that we use in combination. First, we have a prepared file of debconf selections for each Ubuntu version and we feed this into debconf-set-selections before we start installing packages. However in practice this file doesn't have much in it and we rarely remember to update it (and as a result, a bunch of it is somewhat obsolete). We generally only update this file if we discover debconf selections where the default doesn't work in our environment.

Second, we run apt-get with a bunch of environment variables set to muzzle debconf:

export DEBCONF_TERSE=yes
export DEBCONF_NOWARNINGS=yes
export DEBCONF_ADMIN_EMAIL=<null address>@<our domain>
export DEBIAN_FRONTEND=noninteractive

Traditionally I've considered muzzling debconf this way to be too dangerous to do during package updates or installing packages by hand. However, I consider it not so much safe as safe enough to do this during our standard install process. To put it one way, we're not starting out with a working system and potentially breaking it by letting some new or updated package pick bad defaults. Instead we're starting with a non-working system and hopefully ending up with a working one. If some package picks bad defaults and we wind up with problems, that's not much worse than we started out with and we'll fix it by updating our file of debconf selections and then redoing the install.

Also, in practice all of this gets worked out during our initial test installs of any new Ubuntu version (done on test virtual machines these days). By the time we're ready to start installing real servers with a new Ubuntu version, we've gone through most of the discovery process for debconf questions. Then the only time we're going to have problems during future system installs future is if a package update either changes the default answer for a current question (to a bad one) or adds a new question with a bad default. As far as I can remember, we haven't had either happen.

(Some of our servers need additional packages installed, which we do by hand (as mentioned), and sometimes the packages will insist on stopping to ask us questions or give us warnings. This is annoying, but so far not annoying enough to fix it by augmenting our standard debconf selections to deal with it.)

The pragmatics of doing fsync() after a re-open() of journals and logs

By: cks
25 March 2025 at 02:02

Recently I read Rob Norris' fsync() after open() is an elaborate no-op (via). This is a contrarian reaction to the CouchDB article that prompted my entry Always sync your log or journal files when you open them. At one level I can't disagree with Norris and the article; POSIX is indeed very limited about the guarantees it provides for a successful fsync() in a way that frustrates the 'fsync after open' case.

At another level, I disagree with the article. As Norris notes, there are systems that go beyond the minimum POSIX guarantees, and also the fsync() after open() approach is almost the best you can do and is much faster than your other (portable) option, which is to call sync() (on Linux you could call syncfs() instead). Under POSIX, sync() is allowed to return before the IO is complete, but at least sync() is supposed to definitely trigger flushing any unwritten data to disk, which is more than POSIX fsync() provides you (as Norris notes, POSIX permits fsync() to apply only to data written to that file descriptor, not all unwritten data for the underlying file). As far as fsync() goes, in practice I believe that almost all Unixes and Unix filesystems are going to be more generous than POSIX requires and fsync() all dirty data for a file, not just data written through your file descriptor.

Actually being as restrictive as POSIX allows would likely be a problem for Unix kernels. The kernel wants to index the filesystem cache by inode, including unwritten data. This makes it natural for fsync() to flush all unwritten data associated with the file regardless of who wrote it, because then the kernel needs no extra data to be attached to dirty buffers. If you wanted to be able to flush only dirty data associated with a file object or file descriptor, you'd need to either add metadata associated with dirty buffers or index the filesystem cache differently (which is clearly less natural and probably less efficient).

Adding metadata has an assortment of challenges and overheads. If you add it to dirty buffers themselves, you have to worry about clearing this metadata when a file descriptor is closed or a file object is deallocated (including when the process exits). If you instead attach metadata about dirty buffers to file descriptors or file objects, there's a variety of situations where other IO involving the buffer requires updating your metadata, including the kernel writing out dirty buffers on its own without a fsync() or a sync() and then perhaps deallocating the now clean buffer to free up memory.

Being as restrictive as POSIX allows probably also has low benefits in practice. To be a clear benefit, you would need to have multiple things writing significant amounts of data to the same file and fsync()'ing their data separately; this is when the file descriptor (or file object) specific fsync() saves you a bunch of data write traffic over the 'fsync() the entire file' approach. But as far as I know, this is a pretty unusual IO pattern. Much of the time, the thing fsync()'ing the file is the only writer, either because it's the only thing dealing with the file or because updates to the file are being coordinated through it so that processes don't step over each other.

PS: If you wanted to implement this, the simplest option would be to store the file descriptor and PID (as numbers) as additional metadata with each buffer. When the system fsync()'d a file, it could check the current file descriptor number and PID against the saved ones and only flush buffers where they matched, or where these values had been cleared to signal an uncertain owner. This would flush more than strictly necessary if the file descriptor number (or the process ID) had been reused or buffers had been touched in some way that caused the kernel to clear the metadata, but doing more work than POSIX strictly requires is relatively harmless.

Sidebar: fsync() and mmap() in POSIX

Under a strict reading of the POSIX fsync() specification, it's not entirely clear how you're properly supposed to fsync() data written through mmap() mappings. If 'all data for the open file descriptor' includes pages touched through mmap(), then you have to keep the file descriptor you used for mmap() open, despite POSIX mmap() otherwise implicitly allowing you to close it; my view is that this is at least surprising. If 'all data' only includes data directly written through the file descriptor with system calls, then there's no way to trigger a fsync() for mmap()'d data.

The obviousness of indexing the Unix filesystem buffer cache by inodes

By: cks
24 March 2025 at 02:34

Like most operating systems, Unix has an in-memory cache of filesystem data. Originally this was a fixed size buffer cache that was maintained separately from the memory used by processes, but later it became a unified cache that was used for both memory mappings established through mmap() and regular read() and write() IO (for good reasons). Whenever you have a cache, one of the things you need to decide is how the cache is indexed. The more or less required answer for Unix is that the filesystem cache is indexed by inode (and thus filesystem, as inodes are almost always attached to some filesystem).

Unix has three levels of indirection for straightforward IO. Processes open and deal with file descriptors, which refer to underlying file objects, which in turn refer to an inode. There are various situations, such as calling dup(), where you will wind up with two file descriptors that refer to the same underlying file object. Some state is specific to file descriptors, but other state is held at the level of file objects, and some state has to be held at the inode level, such as the last modification time of the inode. For mmap()'d files, we have a 'virtual memory area', which is a separate level of indirection that is on top of the inode.

The biggest reason to index the filesystem cache by inode instead of file descriptor or file object is coherence. If two processes separately open the same file, getting two separate file objects and two separate file descriptors, and then one process writes to the file while the other reads from it, we want the reading process to see the data that the writing process has written. The only thing the two processes naturally share is the inode of the file, so indexing the filesystem cache by inode is the easiest way to provide coherence. If the kernel indexed by file object or file descriptor, it would have to do extra work to propagate updates through all of the indirection. This includes the 'updates' of reading data off disk; if you index by inode, everyone reading from the file automatically sees fetched data with no extra work.

(Generally we also want this coherence for two processes that both mmap() the file, and for one process that mmap()s the file while another process read()s or write()s to it. Again this is easiest to achieve if everything is indexed by the inode.)

Another reason to index by inode is how easy it is to handle various situations in the filesystem cache when things are closed or removed, especially when the filesystem cache holds writes that are being buffered in memory before being flushed to disk. Processes frequently close file descriptors and drop file objects, including by exiting, but any buffered writes still need to be findable so they can be flushed to disk before, say, the filesystem itself is unmounted. Similarly, if an inode is deleted we don't want to flush its pending buffered writes to disk (and certainly we can't allocate blocks for them, since there's nothing to own those blocks any more), and we want to discard any clean buffers associated with it to free up memory. If you index the cache by inode, all you need is for filesystems to be able to find all their inodes; everything else more or less falls out naturally.

This doesn't absolutely require a Unix to index its filesystem buffer caches by inode. But I think it's clearly easiest to index the filesystem cache by inode, instead of the other available references. The inode is the common point for all IO involving a file (partly because it's what filesystems deal with), which makes it the easiest index; everyone has an inode reference and in a properly implemented Unix, everyone is using the same inode reference.

(In fact all sorts of fun tend to happen in Unixes if they have a filesystem that gives out different in-kernel inodes that all refer to the same on-disk filesystem object. Usually this happens by accident or filesystem bugs.)

How we automate installing extra packages during Ubuntu installs

By: cks
23 March 2025 at 02:52

We have a local system for installing Ubuntu machines, and one of the important things it does is install various additional Ubuntu packages that we want as part of our standard installs. These days we have two sorts of standard installs, a 'base' set of packages that everything gets and a broader set of packages that login servers and compute servers get (to make them more useful and usable by people). Specialized machines need additional packages, and while we can automate installation of those too, they're generally a small enough set of packages that we document them in our install instructions for each machine and install them by hand.

There are probably clever ways to do bulk installs of Ubuntu packages, but if so, we don't use them. Our approach is instead a brute force one. We have files that contain lists of packages, such as a 'base' file, and these files just contain a list of packages with optional comments:

# Partial example of Basic package set
amanda-client
curl
jq
[...]

# decodes kernel MCE/machine check events
rasdaemon

# Be able to build Debian (Ubuntu) packages on anything
build-essential fakeroot dpkg-dev devscripts automake 

(Like all of the rest of our configuration information, these package set files live in our central administrative filesystem. You could distribute them in some other way, for example fetching them with rsync or even HTTP.)

To install these packages, we use grep to extract the actual packages into a big list and feed the big list to apt-get. This is more or less:

pkgs=$(cat $PKGDIR/$s | grep -v '^#' | grep -v '^[ \t]*$')
apt-get -qq -y install $pkgs

(This will abort if any of the packages we list aren't available. We consider this a feature, because it means we have an error in the list of packages.)

A more organized and minimal approach might be to add the '--no-install-recommends' option, but we started without it and we don't particularly want to go back to find which recommended packages we'd have to explicitly add to our package lists.

At least some of the 'base' package installs could be done during the initial system install process from our customized Ubuntu server ISO image, since you can specify additional packages to install. However, doing package installs that way would create a series of issues in practice. We'd probably need to more carefully track which package came from which Ubuntu collection, since only some of them are enabled during the server install process, it would be harder to update the lists, and the tools for handling the whole process would be a lot more limited, as would our ability to troubleshoot any problems.

Doing this additional package install in our 'postinstall' process means that we're doing it in a full Unix environment where we have all of the standard Unix tools, and we can easily look around the system if and when there's a problem. Generally we've found that the more of our installs we can defer to once the system is running normally, the better.

(Also, the less the Ubuntu installer does, the faster it finishes and the sooner we can get back to our desks.)

(This entry was inspired by parts of a blog post I read recently and reflecting about how we've made setting up new versions of machines pretty easy, assuming our core infrastructure is there.)

The mystery (to me) of tiny font sizes in KDE programs I run

By: cks
22 March 2025 at 03:24

Over on the Fediverse I tried a KDE program and ran into a common issue for me:

It has been '0' days since a KDE app started up with too-small fonts on my bespoke fvwm based desktop, and had no text zoom. I guess I will go use a browser, at least I can zoom fonts there.

Maybe I could find a KDE settings thing and maybe find where and why KDE does this (it doesn't happen in GNOME apps), but honestly it's simpler to give up on KDE based programs and find other choices.

(The specific KDE program I was trying to use this time was NeoChat.)

My fvwm based desktop environment has an XSettings daemon running, which I use in part to set up a proper HiDPI environment (also, which doesn't talk about KDE fonts because I never figured that out). I suspect that my HiDPI display is part of why KDE programs often or always seem to pick tiny fonts, but I don't particularly know why. Based on the xsettingsd documentation and the registry, there doesn't seem to be any KDE specific font settings, and I'm setting the Gtk/FontName setting to a font that KDE doesn't seem to be using (which I could only verify once I found a way to see the font I was specifying).

After some searching I found the systemsettings program through the Arch wiki's page on KDE and was able to turn up its font sizes in a way that appears to be durable (ie, it stays after I stop and start systemsettings). However, this hasn't affected the fonts I see in NeoChat when I run it again. There are a bunch of font settings, but maybe NeoChat is using the 'small' font for some reason (apparently which app uses what font setting can be variable).

QT (the underlying GUI toolkit of much or all of KDE) has its own set of environment variables for scaling things on HiDPI displays, and setting $QT_SCALE_FACTOR does size up NeoChat (although apparently bits of Plasma ignore these, although I think I'm unlikely to run into this since I don't want to use KDE's desktop components).

Some KDE applications have their own settings files with their own font sizes; one example I know if is kdiff3. This is quite helpful because if I'm determined enough, I can either adjust the font sizes in the program's settings or at least go edit the configuration file (in this case, .config/kdiff3rc, I think, not .kde/share/config/kdiff3rc). However, not all KDE applications allow you to change font sizes through either their GUI or a settings file, and NeoChat appears to be one of the ones that don't.

In theory now that I've done all of this research I could resize NeoChat and perhaps other KDE applications through $QT_SCALE_FACTOR. In practice I feel I would rather switch to applications that interoperate better with the rest of my environment unless for some reason the KDE application is either my only choice or the significantly superior one (as it has been so far for kdiff3 for my usage).

Go's choice of multiple return values was the simpler option

By: cks
21 March 2025 at 02:56

Yesterday I wrote about Go's use of multiple return values and Go types, in reaction to Mond's Were multiple return values Go's biggest mistake?. One of the things that I forgot to mention in that entry is that I think Go's choice to have multiple values for function returns and a few other things was the simpler and more conservative approach in its overall language design.

In a statically typed language that expects to routinely use multiple return values, as Go was designed to with the 'result, error' pattern, returning multiple values as a typed tuple means that tuple-based types are pervasive. This creates pressures on both the language design and the API of the standard library, especially if you start out (as Go did) being a fairly strongly nominally typed language, where different names for the same concrete type can't be casually interchanged. Or to put it another way, having a frequently used tuple container (meta-)type significantly interacts with and affects the rest of the language.

(For example, if Go had handled multiple values through tuples as explicit typed entities, it might have had to start out with something like type aliases (added only in Go 1.9) and it might have been pushed toward some degree of structural typing, because that probably makes it easier to interact with all of the return value tuples flying around.)

Having multiple values as a special case for function returns, range, and so on doesn't create anywhere near this additional influence and pressure on the rest of the language. There are a whole bunch of questions and issues you don't face because multiple values aren't types and can't be stored or manipulated as single entities. Of course you have to be careful in the language specification and it's not trivial, but it's simpler and more contained than going the tuple type route. I also feel it's the more conservative approach, since it doesn't affect the rest of the language as much as a widely used tuple container type would.

(As Mond criticizes, it does create special cases. But Go is a pragmatic language that's willing to live with special cases.)

Go's multiple return values and (Go) types

By: cks
20 March 2025 at 03:31

Recently I read Were multiple return values Go's biggest mistake? (via), which wishes that Go had full blown tuple types (to put my spin on it). One of the things that struck me about Go's situation when I read the article is exactly the inverse of what the article is complaining about, which is that because Go allows multiple values for function return types (and in a few other places), it doesn't have to have tuple types.

One problem with tuple types in a statically typed language is that they must exist as types, whether declared explicitly or implicitly. In a language like Go, where type definitions create new distinct types even if the structure is the same, it isn't particularly difficult to wind up with an ergonomics problem. Suppose that you want to return a tuple that is a net.Conn and an error, a common pair of return values in the net package today. If that tuple is given a named type, everyone must use that type in various places; merely returning or storing an implicitly declared type that's structurally the same is not acceptable under Go's current type rules. Conversely, if that tuple is not given a type name in the net package, everyone is forced to stick to an anonymous tuple type. In addition, this up front choice is now an API; it's not API compatible to give your previously anonymous tuple type a name or vice versa, even if the types are structurally compatible.

(Since returning something and error is so common an idiom in Go, we're also looking at either a lot of anonymous types or a lot more named types. Consider how many different combinations of multiple return values you find in the net package alone.)

One advantage of multiple return values (and the other forms of tuple assignment, and for range clauses) is that they don't require actual formal types. Functions have a 'result type', which doesn't exist as an actual type, but you also needed to handle the same sort of 'not an actual type' thing for their 'parameter type'. My guess is that this let Go's designers skip a certain amount of complexity in Go's type system, because they didn't have to define an actual tuple (meta-)type or alternately expand how structs worked to cover the tuple usage case,

(Looked at from the right angle, structs are tuples with named fields, although then you get into questions of nested structs act in tuple-like contexts.)

A dynamically typed language like Python doesn't have this problem because there are no explicit types, so there's no need to have different types for different combinations of (return) values. There's simply a general tuple container type that can be any shape you want or need, and can be created and destructured on demand.

(I assume that some statically typed languages have worked out how to handle tuples as a data type within their type system. Rust has tuples, for example; I haven't looked into how they work in Rust's type system, for reasons.)

How ZFS knows and tracks the space usage of datasets

By: cks
19 March 2025 at 02:44

Anyone who's ever had to spend much time with 'zfs list -t all -o space' knows the basics of ZFS space usage accounting, with space used by the datasets, data unique to a particular snapshot (the 'USED' value for a snapshot), data used by snapshots in total, and so on. But today I discovered that I didn't really know how it all worked under the hood, so I went digging in the source code. The answer is that ZFS tracks all of these types of space usage directly as numbers, and updates them as blocks are logically freed.

(Although all of these are accessed from user space as ZFS properties, they're not conventional dataset properties; instead, ZFS materializes the property version any time you ask, from fields in its internal data structures. Some of these fields are different and accessed differently for snapshots and regular datasets, for example what 'zfs list' presents as 'USED'.)

All changes to a ZFS dataset happen in a ZFS transaction (group), which are assigned ever increasing numbers, the 'transaction group number(s)' (txg). This includes allocating blocks, which remember their 'birth txg', and making snapshots, which carry the txg they were made in and necessarily don't contain any blocks that were born after that txg. When ZFS wants to free a block in the live filesystem (either because you deleted the object or because you're writing new data and ZFS is doing its copy on write thing), it looks at the block's birth txg and the txg of the most recent snapshot; if the block is old enough that it has to be in that snapshot, then the block is not actually freed and the space for the block is transferred from 'USED' (by the filesystem) to 'USEDSNAP' (used only in snapshots). ZFS will then further check the block's txg against the txgs of snapshots to see if the block is unique to a particular snapshot, in which case its space will be added to that snapshot's 'USED'.

ZFS goes through a similar process when you delete a snapshot. As it runs around trying to free up the snapshot's space, it may discover that a block it's trying to free is now used only by one other snapshot, based on the relevant txgs. If so, the block's space is added to that snapshot's 'USED'. If the block is freed entirely, ZFS will decrease the 'USEDSNAP' number for the entire dataset. If the block is still used by several snapshots, no usage numbers need to be adjusted.

(Determining if a block is unique in the previous snapshot is fairly easy, since you can look at the birth txgs of the two previous snapshots. Determining if a block is now unique in the next snapshot (or for that matter is still in use in the dataset) is more complex and I don't understand the code involved; presumably it involves somehow looking at what blocks were freed and when. Interested parties can look into the OpenZFS code themselves, where there are some surprises.)

PS: One consequence of this is that there's no way after the fact to find out when space shifted from being used by the filesystem to used by snapshots (for example, when something large gets deleted in the filesystem and is now present only in snapshots). All you can do is capture the various numbers over time and then look at your historical data to see when they changed. The removal of snapshots is captured by ZFS pool history, but as far as I know this doesn't capture how the deletion affected the various space usage numbers.

I don't think error handling is a solved problem in language design

By: cks
18 March 2025 at 02:53

There are certain things about programming language design that are more or less solved problems, where we generally know what the good and bad approaches are. For example, over time we've wound up agreeing on various common control structures like for and while loops, if statements, and multi-option switch/case/etc statements. The syntax may vary (sometimes very much, as for example in Lisp), but the approach is more or less the same because we've come up with good approaches.

I don't believe this is the case with handling errors. One way to see this is to look at the wide variety of approaches and patterns that languages today take to error handling. There is at least 'errors as exceptions' (for example, Python), 'errors as values' (Go and C), and 'errors instead of results and you have to check' combined with 'if errors happen, panic' (both Rust). Even in Rust there are multiple idioms for dealing with errors; some Rust code will explicitly check its Result types, while other Rust code sprinkles '?' around and accepts that if the program sails off the happy path, it simply dies.

If you were creating a new programming language from scratch, there's no clear agreed answer to what error handling approach you should pick, not the way we have more or less agreed on how for, while, and so on should work. You'd be left to evaluate trade offs in language design and language ergonomics and to make (and justify) your choices, and there probably would always be people who think you should have chosen differently. The same is true of changing or evolving existing languages, where there's no generally agreed on 'good error handling' to move toward.

(The obvious corollary of this is that there's no generally agreed on keywords or other syntax for error handling, the way 'for' and 'while' are widely accepted as keywords as well as concepts. The closest we've come is that some forms of error handling have generally accepted keywords, such as try/catch for exception handling.)

I like to think that this will change at some point in the future. Surely there actually is a good pattern for error handling out there and at some point we will find it (if it hasn't already been found) and then converge on it, as we've converged on programming language things before. But I feel it's clear that we're not there yet today.

OIDC claim scopes and their interactions with OIDC token authentication

By: cks
17 March 2025 at 02:31

When I wrote about how SAML and OIDC differed in sharing information, where SAML shares every SAML 'attribute' by default and OIDC has 'scopes' for its 'claims', I said that the SAML approach was probably easier within an organization, where you already have trust in the clients. It turns out that there's an important exception to this I didn't realize at the time, and that's when programs (like mail clients) are using tokens to authenticate to servers (like IMAP servers).

In OIDC/OAuth2 (and probably in SAML as well), programs that obtain tokens can open them up and see all of the information that they contain, either inspecting them directly or using a public OIDC endpoint that allows them to 'introspect' the token for additional information (this is the same endpoint that will be used by your IMAP server or whatever). Unless you enjoy making a bespoke collection of (for example) IMAP clients, the information that programs need to obtain tokens is going to be more or less public within your organization and will probably (or even necessarily) leak outside of it.

(For example, you can readily discover all of the OIDC client IDs used by Thunderbird for the various large providers it supports. There's nothing stopping you from using those client IDs and client secrets yourself, although large providers may require your target to have specifically approved using Thunderbird with your target's accounts.)

This means that anyone who can persuade your people to authenticate through a program's usual flow can probably extract all of the information available in the token. They can do this either on the person's computer (capturing the token locally) or by persuading people that they need to 'authenticate to this service with IMAP OAuth2' or the like and then extracting the information from the token.

In the SAML world, this will by default be all of the information contained in the token. In the OIDC world, you can restrict the information made available through tokens issued through programs by restricting the scopes that you allow programs to ask for (and possibly different scopes for different programs, although this is a bit fragile; attackers may get to choose which program's client ID and so on they use).

(Realizing this is going to change what scopes we allow in our OIDC IdP for program client registrations. So far I had reflexively been giving them access to everything, just like our internal websites; now I think I'm going to narrow it down to almost nothing.)

Sidebar: How your token-consuming server knows what created them

When your server verifies OAuth2/OIDC tokens presented to it, the minimum thing you want to know is that they come from the expected OIDC identity provider, which is normally achieved automatically because you'll ask that OIDC IdP to verify that the token is good. However, you may also want to know that the token was specifically issued for use with your server, or through a program that's expected to be used for your server. The normal way to do this is through the 'aud' OIDC claim, which has at least the client ID (and in theory your OIDC IdP could add additional entries). If your OIDC IdP can issue tokens through multiple identities (perhaps to multiple parties, such as the major IdPs of, for example, Google and Microsoft), you may also want to verify the 'iss' (issuer) field instead or in addition to 'aud'.

Some notes on the OpenID Connect (OIDC) 'redirect uri'

By: cks
16 March 2025 at 02:57

The normal authentication process for OIDC is web-based and involves a series of HTTP redirects, interspersed with web pages that you interact with. Something that wants to authenticate you will redirect you to the OIDC identity server's website, which will ask you for your login and password and maybe MFA authentication, check them, and then HTTP redirect you back to a 'callback' or 'redirect' URL that will transfer a magic code from the OIDC server to the OIDC client (generally as a URL query parameter). All of this happens in your browser, which means that the OIDC client and server don't need to be able to directly talk to each other, allowing you to use an external cloud/SaaS OIDC IdP to authenticate to a high-security internal website that isn't reachable from the outside world and maybe isn't allowed to make random outgoing HTTP connections.

(The magic code transferred in the final HTTP redirect is apparently often not the authentication token itself but instead something the client can use for a short time to obtain the real authentication token. This does require the client to be able to make an outgoing HTTP connection, which is usually okay.)

When the OIDC client initiates the HTTP redirection to the OIDC IdP server, one of the parameters it passes along is the 'redirect uri' it wants the OIDC server to use to pass the magic code back to it. A malicious client (or something that's gotten a client's ID and secret) could do some mischief by manipulating this redirect URL, so the standard specifically requires that OIDC IdP have a list of allowed redirect uris for each registered client. The standard also says that in theory, the client's provided redirect uri and the configured redirect uris are compared as literal string values. So, for example, 'https://example.org/callback' doesn't match 'https://example.org/callback/'.

This is straightforward when it comes to websites as OIDC clients, since they should have well defined callback urls that you can configure directly into your OIDC IdP when you set up each of them. It gets more hairy when what you're dealing with is programs as OIDC clients, where they are (for example) trying to get an OIDC token so they can authenticate to your IMAP server with OAuth2, since these programs don't normally have a website. Historically, there are several approaches that people have taken for programs (or seem to have, based on my reading so far).

Very early on in OAuth2's history, people apparently defined the special redirect uri value 'urn:ietf:wg:oauth:2.0:oob' (which is now hard to find or identify documentation on). An OAuth2 IdP that saw this redirect uri (and maybe had it allowed for the client) was supposed to not redirect you but instead show you a HTML page with the magic OIDC code displayed on it, so you could copy and paste the code into your local program. This value is now obsolete but it may still be accepted by some IdPs (you can find it listed for Google in mutt_oauth2.py, and I spotted an OIDC IdP server that handles it).

Another option is that the IdP can provide an actual website that does the same thing; if you get HTTP redirected to it with a valid code, it will show you the code on a HTML page and you can copy and paste it. Based on mutt_oauth2.py again, it appears that Microsoft may have at one point done this, using https://login.microsoftonline.com/common/oauth2/nativeclient as the page. You can do this too with your own IdP (or your own website in general), although it's not recommended for all sorts of reasons.

The final broad approach is to use 'localhost' as the target host for the redirect. There are several ways to make this work, and one of them runs into complications with the IdP's redirect uri handling.

The obvious general approach is for your program to run a little HTTP server that listens on some port on localhost, and capture the code when the (local) browser gets the HTTP redirect to localhost and visits the server. The problem here is that you can't necessarily listen on port 80, so your redirect uri needs to include the port you're listening (eg 'http://localhost:7000'), and if your OIDC IdP is following the standard it must be configured not just with 'http://localhost' as the allowed redirect uri but the specific port you'll use. Also, because of string matching, if the OIDC IdP lists 'http://localhost:7000', you can't send 'http://localhost:7000/' despite them being the same URL.

(And your program has to use 'localhost', not '127.0.0.1' or the IPv6 loopback address; although the two have the same effect, they're obviously not string-identical.)

Based on experimental evidence from OIDC/OAuth2 client configurations, I strongly suspect that some large IdP providers have non-standard, relaxed handling of 'localhost' redirect uris such that their client configuration lists 'http://localhost' and the IdP will accept some random port glued on in the actual redirect uri (or maybe this behavior has been standardized now). I suspect that the IdPs may also accept the trailing slash case. Honestly, it's hard to see how you get out of this if you want to handle real client programs out in the wild.

(Some OIDC IdP software definitely does the standard compliant string comparison. The one I know of for sure is SimpleSAMLphp's OIDC module. Meanwhile, based on reading the source code, Dex uses a relaxed matching for localhost in its matching function, provided that there are no redirect uris register for the client. Dex also still accepts the urn:ietf:wg:oauth:2.0:oob redirect uri, so I suspect that there are still uses out there in the field.)

If the program has its own embedded web browser that it's in full control of, it can do what Thunderbird appears to do (based on reading its source code). As far as I can tell, Thunderbird doesn't run a local listening server; instead it intercepts the HTTP redirection to 'http://localhost' itself. When the IdP sends the final HTTP redirect to localhost with the code embedded in the URL, Thunderbird effectively just grabs the code from the redirect URL in the HTTP reply and never actually issues a HTTP request to the redirect target.

The final option is to not run a localhost HTTP server and to tell people running your program that when their browser gives them an 'unable to connect' error at the end of the OIDC authentication process, they need to go to the URL bar and copy the 'code' query parameter into the program (or if you're being friendly, let them copy and paste the entire URL and you extract the code parameter). This allows your program to use a fixed redirect uri, including just 'http://localhost', because it doesn't have to be able to listen on it or on any fixed port.

(This is effectively a more secure but less user friendly version of the old 'copy a code that the website displayed' OAuth2 approach, and that approach wasn't all that user friendly to start with.)

PS: An OIDC redirect uri apparently allows things other than http:// and https:// URLs; there is, for example, the 'openid-credential-offer' scheme. I believe that the OIDC IdP doesn't particularly do anything with those redirect uris other than accept them and issue a HTTP redirect to them with the appropriate code attached. It's up to your local program or system to intercept HTTP requests for those schemes and react appropriately, much like Thunderbird does, but perhaps easier because you can probably register the program as handling all 'whatever-special://' URLs so the redirect is automatically handed off to it.

(I suspect that there are more complexities in the whole OIDC and OAuth2 redirect uri area, since I'm new to the whole thing.)

Some notes on configuring Dovecot to authenticate via OIDC/OAuth2

By: cks
15 March 2025 at 03:01

Suppose, not hypothetically, that you have a relatively modern Dovecot server and a shiny new OIDC identity provider server ('OP' in OIDC jargon, 'IdP' in common usage), and you would like to get Dovecot to authenticate people's logins via OIDC. Ignoring certain practical problems, the way this is done is for your mail clients to obtain an OIDC token from your IdP, provide it to Dovecot via SASL OAUTHBEARER, and then for Dovecot to do the critical step of actually validating that token it received is good, still active, and contains all the information you need. Dovecot supports this through OAuth v2.0 authentication as a passdb (password database), but in the usual Dovecot fashion, the documentation on how to configure the parameters for validating tokens with your IdP is a little bit lacking in explanations. So here are some notes.

If you have a modern OIDC IdP, it will support OpenID Connect Discovery, including the provider configuration request on the path /.well-known/openid-configuration. Once you know this, if you're not that familiar with OIDC things you can request this URL from your OIDC IdP, feed the result through 'jq .', and then use it to pick out the specific IdP URLs you want to set up in things like the Dovecot file with all of the OAuth2 settings you need. If you do this, the only URL you want for Dovecot is the userinfo_endpoint URL. You will put this into Dovecot's introspection_url, and you'll leave introspection_mode set to the default of 'auth'.

You don't want to set tokeninfo_url to anything. This setting is (or was) used for validating tokens with OAuth2 servers before the introduction of RFC 7662. Back then, the defacto standard approach was to make a HTTP GET approach to some URL with the token pasted on the end (cf), and it's this URL that is being specified. This approach was replaced with RFC 7662 token introspection, and then replaced again with OpenID Connect UserInfo. If both tokeninfo_url and introspection_url are set, as in Dovecot's example for Google, the former takes priority.

(Since I've just peered deep into the Dovecot source code, it appears that setting 'introspection_mode = post' actually performs an (unauthenticated) token introspection request. The 'get' mode seems to be the same as setting tokeninfo_url. I think that if you set the 'post' mode, you also want to set active_attribute and perhaps active_value, but I don't know what to, because otherwise you aren't necessarily fully validating that the token is still active. Does my head hurt? Yes. The moral here is that you should use an OIDC IdP that supports OpenID Connect UserInfo.)

If your IdP serves different groups and provides different 'issuer' ('iss') values to them, you may want to set the Dovecot 'issuers =' to the specific issuer that applies to you. You'll also want to set 'username_attribute' to whatever OIDC claim is where your IdP puts what you consider the Dovecot username, which might be the email address or something else.

It would be nice if Dovecot could discover all of this for itself when you set openid_configuration_url, but in the current Dovecot, all this does is put that URL in the JSON of the error response that's sent to IMAP clients when they fail OAUTHBEARER authentication. IMAP clients may or may not do anything useful with it.

As far as I can tell from the Dovecot source code, setting 'scope =' primarily requires that the token contains those scopes. I believe that this is almost entirely a guard against the IMAP client requesting a token without OIDC scopes that contain claims you need elsewhere in Dovecot. However, this only verifies OIDC scopes, it doesn't verify the presence of specific OIDC claims.

So what you want to do is check your OIDC IdP's /.well-known/openid-configuration URL to find out its collection of endpoints, then set:

# Modern OIDC IdP/OP settings
introspection_url = <userinfo_endpoint>
username_attribute = <some claim, eg 'email'>

# not sure but seems common in Dovecot configs?
pass_attrs = pass=%{oauth2:access_token}

# optionally:
openid_configuration_url = <stick in the URL>

# you may need:
tls_ca_cert_file = /etc/ssl/certs/ca-certificates.crt

The OIDC scopes that IMAP clients should request when getting tokens should include a scope that gives the username_attribute claim, which is 'email' if the claim is 'email', and also apparently the requested scopes should include the offline_access scope.

If you want a test client to see if you've set up Dovecot correctly, one option is to appropriately modify a contributed Python program for Mutt (also the README), which has the useful property that it has an option to check all of IMAP, POP3, and authenticated SMTP once you've obtained a token. If you're just using it for testing purposes, you can change the 'gpg' stuff to 'cat' to just store the token with no fuss (and no security). Another option, which can be used for real IMAP clients too if you really want to, is an IMAP/etc OAuth2 proxy.

(If you want to use Mutt with OAuth2 with your IMAP server, see this article on it also, also, also. These days I would try quite hard to use age instead of GPG.)

Doing multi-tag matching through URLs on the modern web

By: cks
14 March 2025 at 02:46

So what happened is that Mike Hoye had a question about a perfectly reasonable ideas:

Question: is there wiki software out there that handles tags (date, word) with a reasonably graceful URL approach?

As in, site/wiki/2020/01 would give me all the pages tagged as 2020 and 01, site/wiki/foo/bar would give me a list of articles tagged foo and bar.

I got nerd-sniped by a side question but then, because I'd been nerd-sniped, I started thinking about the whole thing and it got more and more hair-raising as a thing done in practice.

This isn't because the idea of stacking selections like this is bad; 'site/wiki/foo/bar' is a perfectly reasonable and good way to express 'a list of articles tagged foo and bar'. Instead, it's because of how everything on the modern web eventually gets visited combined with how, in the natural state of this feature, 'site/wiki/bar/foo' is just a valid a URL for 'articles tagged both foo and bar'.

The combination, plus the increasing tendency of things on the modern web to rattle every available doorknob just to see what happens, means that even if you don't advertise 'bar/foo', sooner or later things are going to try it. And if you do make the combinations discoverable through HTML links, crawlers will find them very fast. At a minimum this means crawlers will see a lot of essentially duplicated content, and you'll have to go through all of the work to do the searches and generate the page listings and so on.

If I was going to implement something like this, I would define a canonical tag order and then, as early in request processing as possible, generate a HTTP redirect from any non-canonical ordering to the canonical one. I wouldn't bother checking if the tags were existed or anything, just determine that they are tags, put them in canonical order, and if the request order wasn't canonical, redirect. That way at least all of your work (and all of the crawler attention) is directed at one canonical version. Smart crawlers will notice that this is a redirect to something they already have (and hopefully not re-request it), and you can more easily use caching.

(And if search engines still matter, the search engines will see only your canonical version.)

This probably holds just as true for doing this sort of tag search through query parameters on GET queries; if you expose the result in a URL, you want to canonicalize it. However, GET query parameters are probably somewhat safer if you force people to form them manually and don't expose links to them. So far, web crawlers seem less likely to monkey around with query parameters than with URLs, based on my limited experience with the blog.

The commodification of desktop GUI behavior

By: cks
13 March 2025 at 03:08

Over on the Fediverse, I tried out a thesis:

Thesis: most desktop GUIs are not opinionated about how you interact with things, and this is why there are so many GUI toolkits and they make so little difference to programs, and also why the browser is a perfectly good cross-platform GUI (and why cross-platform GUIs in general).

Some GUIs are quite opinionated (eg Plan 9's Acme) but most are basically the same. Which isn't necessarily a bad thing but it creates a sameness.

(Custom GUIs are good for frequent users, bad for occasional ones.)

Desktop GUIs differ in how they look and to some extent in how you do certain things and how you expect 'native' programs to behave; I'm sure the fans of any particular platform can tell you all about little behaviors that they expect from native applications that imported ones lack. But I think we've pretty much converged on a set of fundamental behaviors for how to interact with GUI programs, or at least how to deal with basic ones, so in a lot of cases the question about GUIs is how things look, not how you do things at all.

(Complex programs have for some time been coming up with their own bespoke alternatives to, for example, huge cascades of menus. If these are successful they tend to get more broadly adopted by programs facing the same problems; consider the 'ribbon', which got what could be called a somewhat mixed reaction on its modern introduction.)

On the desktop, changing the GUI toolkit that a program uses (either on the same platform or on a different one) may require changing the structure of your code (in addition to ordinary code changes), but it probably won't change how your program operates. Things will look a bit different, maybe some standard platform features will appear or disappear, but it's not a completely different experience. This often includes moving your application from the desktop into the browser (a popular and useful 'cross-platform' environment in itself).

This is less true on mobile platforms, where my sense is that the two dominant platforms have evolved somewhat different idioms for how you interact with applications. A proper 'native' application behaves differently on the two platforms even if it's using mostly the same code base.

GUIs such as Plan 9's Acme show that this doesn't have to be the case; for that matter, so does GNU Emacs. GNU Emacs has a vague shell of a standard looking GUI but it's a thin layer over a much different and stranger vastness, and I believe that experienced Emacs people do very little interaction with it.

Some views on the common Apache modules for SAML or OIDC authentication

By: cks
12 March 2025 at 03:01

Suppose that you want to restrict access to parts of your Apache based website but you want something more sophisticated and modern than Apache Basic HTTP authentication. The traditional reason for this was to support 'single sign on' across all your (internal) websites; the modern reason is that a central authentication server is the easiest place to add full multi-factor authentication. The two dominant protocols for this are SAML and OIDC. There are commonly available Apache authentication modules for both protocols, in the form of Mellon (also) for SAML and OpenIDC for OIDC.

I've now used or at least tested the Ubuntu 24.04 version of both modules against the same SAML/OIDC identity provider, primarily because when you're setting up a SAML/OIDC IdP you need to be able to test it with something. Both modules work fine, but after my experiences I'm more likely to use OpenIDC than Mellon in most situations.

Mellon has two drawbacks and two potential advantages. The first drawback is that setting up a Mellon client ('SP') is more involved. Most of annoying stuff is automated for you with the mellon_create_metadata script (which you can get from the Mellon repository if it's not in your Mellon package), but you still have to give your IdP your XML blob and get their XML blob. The other drawback is that Mellon isn't integrated into the Apache 'Require' framework for authorization decisions; instead you have to make do with Mellon-specific directives.

The first potential advantage is that Mellon has a straightforward story for protecting two different areas of your website with two different IdPs, if you need to do that for some reason; you can just configure them in separate <Location> or <Directory> blocks and everything works out. If anything, it's a bit non-obvious how to protect various disconnected bits of your URL space with the same IdP without having to configure multiple SPs, one for each protected section of URL space. The second potential advantage is that in general SAML has an easier story for your IdP giving you random information, and Mellon will happily export every SAML attribute it gets into the environment your CGI or web application gets.

The first advantage of OpenIDC is that it's straightforward to configure when you have a single IdP, with no XML and generally low complexity. It's also straightforward to protect multiple disconnected URL areas with the same IdP but possibly different access restrictions. A third advantage is that OpenIDC is integrated into Apache's 'Require' system, although you have to use OpenIDC specific syntax like 'Require claim groups:agroup' (see the OpenIDC wiki on authorization).

In exchange for this, it seems to be quite involved to use OpenIDC if you need to use multiple OIDC identity providers to protect different bits of your website. It's apparently possible to do this in the same virtual host but it seems quite complex and requires a lot of parts, so if I was confronted with this problem I would try very hard to confine each web thing that needed a different IdP into a different virtual host. And OpenIDC has the general OIDC problem that it's harder to expose random information.

(All of the important OpenIDC Apache directives about picking an IdP can't be put in <Location> or <Directory> blocks, only in a virtual host as a whole. If you care about this, see the wiki on Multiple Providers and also access to different URL paths on a per-provider basis.)

We're very likely to only ever be working with a single IdP, so for us OpenIDC is likely to be easier, although not hugely so.

Sidebar: The easy approach for group based access control with either

Both Mellon and OpenIDC work fine together with the traditional Apache AuthGroupFile directive, provided (of course) that you have or build an Apache format group file using what you've told Mellon or OpenIDC to use as the 'user' for Apache authentication. If your IdP is using the same user (and group) information as your regular system is, then you may well already have this information around.

(This is especially likely if you're migrating from Apache Basic HTTP authentication, where you already needed to build this sort of stuff.)

Building your own Apache group file has the additional benefit that you can augment and manipulate group information in ways that might not fit well into your IdP. Your IdP has the drawback that it has to be general; your generated Apache group file can be narrowly specific for the needs of a particular web area.

The web browser as an enabler of minority platforms

By: cks
11 March 2025 at 03:35

Recently, I got involved in a discussion on the Fediverse over what I will simplify to the desirability (or lack of it) of cross platform toolkits, including the browser, and how they erase platform personality and opinions. This caused me to have a realization about what web browser based applications are doing for me, which is that being browser based is what lets me use them at all.

My environment is pretty far from being a significant platform; I think Unix desktop share is in the low single percent under the best of circumstances. If people had to develop platform specific versions of things like Grafana (which is a great application), they'd probably exist for Windows, maybe macOS, and at the outside, tablets (some applications would definitely exist on phones, but Grafana is a bit of a stretch). They probably wouldn't exist on Linux, especially not for free.

That the web browser is a cross platform environment means that I get these applications (including the Fediverse itself) essentially 'for free' (which is to say, it's because of the efforts of web browsers to support my platform and then give me their work for free). Developers of web applications don't have to do anything to make them work for me, not even so far as making it possible to build their software on Linux; it just happens for them without them even having to think about it.

Although I don't work in the browser as much as some people do, looking back the existence of implicitly cross platform web applications has been a reasonably important thing in letting me stick with Linux.

This applies to any minority platform, not just Linux. All you need is a sufficiently capable browser and you have access to a huge range of (web) applications.

(Getting that sufficiently capable browser can be a challenge on a sufficiently minority platform, especially if you're not on a major architecture. I'm lucky in that x86 Linux is a majority minority platform; people on FreeBSD or people on architectures other than x86 and 64-bit ARM may be less happy with the situation.)

PS: I don't know if what we have used the web for really counts as 'applications', since they're mostly HTML form based things once you peel a few covers off. But if they do count, the web has been critical in letting us provide them to people. We definitely couldn't have built local application versions of them for all of the platforms that people here use.

(I'm sure this isn't a novel thought, but the realization struck (or re-struck) me recently so I'm writing it down.)

How I got my nose rubbed in my screens having 'bad' areas for me

By: cks
10 March 2025 at 02:50

I wrote a while back about how my desktop screens now had areas that were 'good' and 'bad' for me, and mentioned that I had recently noticed this, calling it a story for another time. That time is now. What made me really notice this issue with my screens and where I had put some things on them was our central mail server (temporarily) stopping handling email because its load was absurdly high.

In theory I should have noticed this issue before a co-worker rebooted the mail server, because for a long time I've had an xload window from the mail server (among other machines, I have four xloads). Partly I did this so I could keep an eye on these machines and partly it's to help keep alive the shared SSH connection I also use for keeping an xrun on the mail server.

(In the past I had problems with my xrun SSH connections seeming to spontaneously close if they just sat there idle because, for example, my screen was locked. Keeping an xload running seemed to work around that; I assumed it was because xload keeps updating things even with the screen locked and so forced a certain amount of X-level traffic over the shared SSH connection.)

When the mail server's load went through the roof, I should have noticed that the xload for it had turned solid green (which is how xload looks under high load). However, I had placed the mail server's xload way off on the right side of my office dual screens, which put it outside my normal field of attention. As a result, I never noticed the solid green xload that would have warned me of the problem.

(This isn't where the xload was back on my 2011 era desktop, but at some point since then I moved it and some other xloads over to the right.)

In the aftermath of the incident, I relocated all of those xloads to a more central location, and also made my new Prometheus alert status monitor appear more or less centrally, where I'll definitely notice it.

(Some day I may do a major rethink about my entire screen layout, but most of the time that feels like yak shaving that I'd rather not touch until I have to, for example because I've been forced to switch to Wayland and an entirely different window manager.)

Sidebar: Why xload turns green under high load

Xload draws a horizontal tick line for every integer load average it needs to display the maximum load that fits in its moving histogram. If the highest load average is 1.5, there will be one tick; if the highest load average is 10.2, there will be ten. Ticks are normally drawn in green. This means that as the load average climbs, xload draws more and more ticks, and after a certain point the entire xload display is just solid green from all of the tick lines.

This has the drawback that you don't know the shape of the load average (all you know is that at some point it got quite high), but the advantage that it's quite visually distinctive and you know you have a problem.

How SAML and OIDC differ in sharing information, and perhaps why

By: cks
9 March 2025 at 04:39

In practice, SAML and OIDC are two ways of doing third party web-based authentication (and thus a Single Sign On (SSO)) system; the web site you want to use sends you off to a SAML or OIDC server to authenticate, and then the server sends authentication information back to the 'client' web site. Both protocols send additional information about you along with the bare fact of an authentication, but they differ in how they do this.

In SAML, the SAML server sends a collection of 'attributes' back to the SAML client. There are some standard SAML attributes that client websites will expect, but the server is free to throw in any other attributes it feels like, and I believe that servers do things like turn every LDAP attribute they get from a LDAP user lookup into a SAML attribute (certainly SimpleSAMLphp does this). As far as I know, any filtering of what SAML attributes are provided by the server to any particular client is a server side feature, and SAML clients don't necessarily have any way of telling the SAML server what attributes they want or don't want.

In OIDC, the equivalent way of returning information is 'claims', which are grouped into 'scopes', along with basic claims that you get without asking for a scope. The expectation in OIDC is that clients that want more than the basic claims will request specific scopes and then get back (only) the claims for those scopes. There are standard scopes with standard claims (not all of which are necessarily returned by any given OIDC server). If you want to add additional information in the form of more claims, I believe that it's generally expected that you'll create one or more custom scopes for those claims and then have your OIDC clients request them (although not all OIDC clients are willing and able to handle custom scopes).

(I think in theory an OIDC server may be free to shove whatever claims it wants to into information for clients regardless of what scopes the client requested, but an OIDC client may ignore any information it didn't request and doesn't understand rather than pass it through to other software.)

The SAML approach is more convenient for server and client administrators who are working within the same organization. The server administrator can add whatever information to SAML responses that's useful and convenient, and SAML clients will generally automatically pick it up and often make it available to other software. The OIDC approach is less convenient, since you need to create one or more additional scopes on the server and define what claims go in them, and then get your OIDC clients to request the new scopes; if an OIDC client doesn't update, it doesn't get the new information. However, the OIDC approach makes it easier for both clients and servers to be more selective and thus potentially for people to control how much information they give to who. An OIDC client can ask for only minimal information by only asking for a basic scope (such as 'email') and then the OIDC server can tell the person exactly what information they're approving being passed to the client, without the OIDC server administrators having to get involved to add client-specific attribute filtering.

(In practice, OIDC probably also encourages giving less information to even trusted clients in general since you have to go through these extra steps, so you're less likely to do things like expose all LDAP information as OIDC claims in some new 'our-ldap' scope or the like.)

My guess is that OIDC was deliberately designed this way partly in order to make it better for use with third party clients. Within an organization, SAML's broad sharing of information may make sense, but it makes much less sense in a cross-organization context, where you may be using OIDC-based 'sign in with <large provider>' on some unrelated website. In that sort of case, you certainly don't want that website to get every scrap of information that the large provider has on you, but instead only ask for (and get) what it needs, and for it to not get much by default.

The OpenID Connect (OIDC) 'sub' claim is surprisingly load-bearing

By: cks
8 March 2025 at 04:24

OIDC (OpenID Connect) is today's better or best regarded standard for (web-based) authentication. When a website (or something) authenticates you through an OpenID (identity) Provider (OP), one of the things it gets back is a bunch of 'claims', which is to say information about the authenticated person. One of the core claims is 'sub', which is vaguely described as a string that is 'subject - identifier for the end-user at the issuer'. As I discovered today, this claim is what I could call 'load bearing' in a surprising way or two.

In theory, 'sub' has no meaning beyond identifying the user in some opaque way. The first way it's load bearing is that some OIDC client software (a 'Relying Party (RP)') will assume that the 'sub' claim has a human useful meaning. For example, the Apache OpenIDC module defaults to putting the 'sub' claim into Apache's REMOTE_USER environment variable. This is fine if your OIDC IdP software puts, say, a login name into it; it is less fine if your OIDC IdP software wants to create 'sub' claims that look like 'YXVzZXIxMi5zb21laWRw'. These claims mean something to your server software but not necessarily to you and the software you want to use on (or behind) OIDC RPs.

The second and more surprising way that the 'sub' claim is load bearing involves how external consumers of your OIDC IdP keep track of your people. In common situations your people will be identified and authorized by their email address (using some additional protocols), which they enter into the outside OIDC RP that's authenticating against your OIDC IdP, and this looks like the identifier that RP uses to keep track of them. However, at least one such OIDC RP assumes that the 'sub' claim for a given email address will never change, and I suspect that there are more people who either quietly use the 'sub' claim as the master key for accounts or who require 'sub' and the email address to be locked together this way.

This second issue makes the details of how your OIDC IdP software generates its 'sub' claim values quite important. You want it to be able to generate those 'sub' values in a clear and documented way that other OIDC IdP software can readily duplicate to create the same 'sub' values, and that won't change if you change some aspect of the OIDC IdP configuration for your current software. Otherwise you're at least stuck with your current OIDC IdP software, and perhaps with its exact current configuration (for authentication sources, internal names of things, and so on).

(If you have to change 'sub' values, for example because you have to migrate to different OIDC IdP software, this could go as far as the outside OIDC RP basically deleting all of their local account data for your people and requiring all of it to be entered back from scratch. But hopefully those outside parties have a better procedure than this.)

The problem facing MFA-enabled IMAP at the moment (in early 2025)

By: cks
7 March 2025 at 04:32

Suppose that you have an IMAP server and you would like to add MFA (Multi-Factor Authentication) protection to it. I believe that in theory the IMAP protocol supports multi-step 'challenge and response' style authentication, so again in theory you could implement MFA this way, but in practice this is unworkable because people would be constantly facing challenges. Modern IMAP clients (and servers) expect to be able to open and close connections more or less on demand, rather than opening one connection, holding it open, and doing everything over it. To make IMAP MFA practical, you need to do it with some kind of 'Single Sign On' (SSO) system. The current approach for this uses an OIDC identity provider for the SSO part and SASL OAUTHBEARER authentication between the IMAP client and the IMAP server, using information from the OIDC IdP.

So in theory, your IMAP client talks to your OIDC IdP to get a magic bearer token, provides this token to the IMAP server, the IMAP server verifies that it comes from a configured and trusted IdP, and everything is good. You only have to go through authenticating to your OIDC IdP SSO system every so often (based on whatever timeout it's configured with); the rest of the time the aggregate system does any necessary token refreshes behind the scenes. And because OIDC has a discovery process that can more or less start from your email address (as I found out), it looks like IMAP clients like Thunderbird could let you more or less automatically use any OIDC IdP if people had set up the right web server information.

If you actually try this right now, you'll find that Thunderbird, apparently along with basically all significant IMAP client programs, will only let you use a few large identity providers; here is Thunderbird's list (via). If you read through that Thunderbird source file, you'll find one reason for this limitation, which is that each provider has one or two magic values (the 'client ID' and usually the 'client secret', which is obviously not so secret here), in addition to URLs that Thunderbird could theoretically autodiscover if everyone supported the current OIDC autodiscovery protocols (my understanding is that not everyone does). In most current OIDC identity provider software, these magic values are either given to the IdP software or generated by it when you set up a given OIDC client program (a 'Relying Party (RP)' in the OIDC jargon).

This means that in order for Thunderbird (or any other IMAP client) to work with your own local OIDC IdP, there would have to be some process where people could load this information into Thunderbird. Alternately, Thunderbird could publish default values for these and anyone who wanted their OIDC IdP to work with Thunderbird would have to add these values to it. To date, creators of IMAP client software have mostly not supported either option and instead hard code a list of big providers who they've arranged more or less explicit OIDC support with.

(Honestly it's not hard to see why IMAP client authors have chosen this approach. Unless you're targeting a very technically inclined audience, walking people through the process of either setting this up in the IMAP client or verifying if a given OIDC IdP supports the client is daunting. I believe some IMAP clients can be configured for OIDC IdPs through 'enterprise policy' systems, but there the people provisioning the policies are supposed to be fairly technical.)

PS: Potential additional references on this mess include David North's article and this FOSDEM 2024 presentation (which I haven't yet watched, I only just stumbled into this mess).

A Prometheus gotcha with alerts based on counting things

By: cks
6 March 2025 at 04:39

Suppose, not entirely hypothetically, that you have some backup servers that use swappable HDDs as their backup media and expose that 'media' as mounted filesystems. Because you keep swapping media around, you don't automatically mount these filesystems and when you do manually try to mount them, it's possible to have some missing (if, for example, a HDD didn't get fully inserted and engaged with the hot-swap bay). To deal with this, you'd like to write a Prometheus alert for 'not all of our backup disks are mounted'. At first this looks simple:

count(
  node_filesystem_size_bytes{
         host = "backupserv",
         mountpoint =~ "/dumps/tapes/slot.*" }
) != <some number>

This will work fine most of the time and then one day it will fail to alert you to the fact that none of the expected filesystems are mounted. The problem is the usual one of PromQL's core nature as a set-based query language (we've seen this before). As long as there's at least one HDD 'tape' filesystem mounted, you can count them, but once there are none, the result of counting them is not 0 but nothing. As a result this alert rule won't produce any results when there are no 'tape' filesystems on your backup server.

Unfortunately there's no particularly good fix, especially if you have multiple identical backup servers and so the real version uses 'host =~ "bserv1|bserv2|..."'. In the single-host case, you can use either absent() or vector() to provide a default value. There's no good solution in the multi-host case, because there's no version of vector() that lets you set labels. If there was, you could at least write:

count( ... ) by (host)
  or vector(0, "host", "bserv1")
  or vector(0, "host", "bserv2")
  ....

(Technically you can set labels via label_replace(). Let's not go there; it's a giant pain for simply adding labels, especially if you want to add more than one.)

In my particular case, our backup servers always have some additional filesystems (like their root filesystem), so I can write a different version of the count() based alert rule:

count(
  node_filesystem_size_bytes{
         host =~ "bserv1|bserv2|...",
         fstype =~ "ext.*' }
) by (host) != <other number>

In theory this is less elegant because I'm not counting exactly what I care about (the number of 'tape' filesystems that are mounted) but instead something more general and potentially more variable (the number of extN filesystems that are mounted) that contains various assumptions about the systems. In practice the number is just as fixed as the number of 'taoe' filesystems, and the broader set of labels will always match something, producing a count of at least one for each host.

(This would change if the standard root filesystem type changed in a future version of Ubuntu, but if that happened, we'd notice.)

PS: This might sound all theoretical and not something a reasonably experienced Prometheus person would actually do. But I'm writing this entry partly because I almost wrote a version of my first example as our alert rule, until I realized what would happen when there were no 'tape' filesystems mounted at all, which is something that happens from time to time for reasons outside the scope of this entry.

What SimpleSAMLphp's core:AttributeAlter does with creating new attributes

By: cks
5 March 2025 at 03:41

SimpleSAMLphp is a SAML identity provider (and other stuff). It's of deep interest to us because it's about the only SAML or OIDC IdP I can find that will authenticate users and passwords against LDAP and has a plugin that will do additional full MFA authentication against the university's chosen MFA provider (although you need to use a feature branch). In the process of doing this MFA authentication, we need to extract the university identifier to use for MFA authentication from our local LDAP data. Conveniently, SimpleSAMLphp has a module called core:AttributeAlter (a part of authentication processing filters) that is intended to do this sort of thing. You can give it a source, a pattern, a replacement that includes regular expression group matches, and a target attribute. In the syntax of its examples, this looks like the following:

 // the 65 is where this is ordered
 65 => [
    'class' => 'core:AttributeAlter',
    'subject' => 'gecos',
    'pattern' => '/^[^,]*,[^,]*,[^,]*,[^,]*,([^,]+)(?:,.*)?$/',
    'target' => 'mfaid',
    'replacement' => '\\1',
 ],

If you're an innocent person, you expect that your new 'mfaid' attribute will be undefined (or untouched) if the pattern does not match because the required GECOS field isn't set. This is not in fact what happens, and interested parties can follow along the rest of this in the source.

(All of this is as of SimpleSAMLphp version 2.3.6, the current release as I write this.)

The short version of what happens is that when the target is a different attribute and the pattern doesn't match, the target will wind up set but empty. Any previous value is lost. How this happens (and what happens) starts with that 'attributes' here are actually arrays of values under the covers (this is '$attributes'). When core:AttributeAlter has a different target attribute than the source attribute, it takes all of the source attribute's values, passes each of them through a regular expression search and replace (using your replacement), and then gathers up anything that changed and sets the target attribute to this gathered collection. If the pattern doesn't match any values of the attribute (in the normal case, a single value), the array of changed things is empty and your target attribute is set to an empty PHP array.

(This is implemented with an array_diff() between the results of preg_replace() and the original attribute value array.)

My personal view is that this is somewhere around a bug; if the pattern doesn't match, I expect nothing to happen. However, the existing documentation is ambiguous (and incomplete, as the use of capture groups isn't particularly documented), so it might not be considered a bug by SimpleSAMLphp. Even if it is considered a bug I suspect it's not going to be particularly urgent to fix, since this particular case is unusual (or people would have found it already).

For my situation, perhaps what I want to do is to write some PHP code to do this extraction operation by hand, through core:PHP. It would be straightforward to extract the necessary GECOS field (or otherwise obtain the ID we need) in PHP, without fooling around with weird pattern matching and module behavior.

(Since I just looked it up, I believe that in the PHP code that core:PHP runs for you, you can use a PHP 'return' to stop without errors but without changing anything. This is relevant in my case since not all GECOS entries have the necessary information.)

If you get the chance, always run more extra network fiber cabling

By: cks
4 March 2025 at 04:22

Some day, you may be in an organization that's about to add some more fiber cabling between two rooms in the same building, or maybe two close by buildings, and someone may ask you for your opinion about many fiber pairs should be run. My personal advice is simple: run more fiber than you think you need, ideally a bunch more (this generalizes to network cabling in general, but copper cabling is a lot more bulky and so harder to run (much) more of). There is an unreasonable amount of fiber to run, but mostly it comes up when you'd have to put in giant fiber patch panels.

The obvious reason to run more fiber is that you may well expand your need for fiber in the future. Someone will want to run a dedicated, private network connection between two locations; someone will want to trunk things to get more bandwidth; someone will want to run a weird protocol that requires its own network segment (did you know you can run HDMI over Ethernet?); and so on. It's relatively inexpensive to add some more fiber pairs when you're already running fiber but much more expensive to have to run additional fiber later, so you might as well give yourself room for growth.

The less obvious reason to run extra fiber is that every so often fiber pairs stop working, just like network cables go bad, and when this happens you'll need to replace them with spare fiber pairs, which means you need those spare fiber pairs. Some of the time this fiber failure is (probably) because a raccoon got into your machine room, but some of the time it just happens for reasons that no one is likely to ever explain to you. And when this happens, you don't necessarily lose only a single pair. Today, for example, we lost three fiber pairs that ran between two adjacent buildings and evidence suggests that other people at the university lost at least one more pair.

(There are a variety of possible causes for sudden loss of multiple pairs, probably all running through a common path, which I will leave to your imagination. These fiber runs are probably not important enough to cause anyone to do a detailed investigation of where the fault is and what happened.)

Fiber comes in two varieties, single mode and multi-mode. I don't know enough to know if you should make a point of running both (over distances where either can be used) as part of the whole 'run more fiber' thing. Locally we have both SM and MM fiber and have switched back and forth between them at times (and may have to do so as a result of the current failures).

PS: Possibly you work in an organization where broken inside-building fiber runs are regularly fixed or replaced. That is not our local experience; someone has to pay for fixing or replacing, and when you have spare fiber pairs left it's easier to switch over to them rather than try to come up with the money and so on.

(Repairing or replacing broken fiber pairs will reduce your long term need for additional fiber, but obviously not the short term need. If you lose N pairs of fiber, you need N spare pairs to get back into operation.)

Updating local commits with more changes in Git (the harder way)

By: cks
3 March 2025 at 03:34

One of the things I do with Git is maintain personal changes locally on top of the upstream version, with my changes updated via rebasing every time I pull upstream to update it. In the simple case, I have only a single local change and commit, but in more complex cases I split my changes into multiple local commits; my local version of Firefox currently carries 12 separate personal commits. Every so often, upstream changes something that causes one of those personal changes to need an update, without actually breaking the rebase of that change. When this happens I need to update my local commit with more changes, and often it's not the 'top' local commit (which can be updated simply).

In theory, the third party tool git-absorb should be ideal for this, and I believe I've used it successfully for this purpose in the past. In my most recent instance, though, git-absorb frustratingly refused to do anything in a situation where it felt it should work fine. I had an additional change to a file that was changed in exactly one of my local commits, which feels like an easy case.

(Reading the git-absorb readme carefully suggests that I may be running into a situation where my new change doesn't clash with any existing change. This makes git-absorb more limited than I'd like, but so it goes.)

In Git, what I want is called a 'fixup commit', and how to use it is covered in this Stackoverflow answer. The sequence of commands is basically:

# modify some/file with new changes, then
git add some/file

# Use this to find your existing commit ID
git log some/file

# with the existing commid ID
git commit --fixup=<commit ID>
git rebase --interactive --autosquash <commit ID>^

This will open an editor buffer with what 'git rebase' is about to do, which I can immediately exit out of because the defaults are exactly what I want (assuming I don't want to shuffle around the order of my local commits, which I probably don't, especially as part of a fixup).

I can probably also use 'origin/main' instead of '<commit ID>^', but that will rebase more things than is strictly necessary. And I need the commit ID for the 'git commit --fixup' invocation anyway.

(Sufficiently experienced Git people can probably put together a script that would do this automatically. It would get all of the files staged in the index, find the most recent commit that modified each of them, abort if they're not all the same commit, make a fixup commit to that most recent commit, and then potentially run the 'git rebase' for you.)

Using PyPy (or thinking about it) exposed a bug in closing files

By: cks
2 March 2025 at 03:20

Over on the Fediverse, I said:

A fun Python error some code can make and not notice until you run it under PyPy is a function that has 'f.close' at the end instead of 'f.close()' where f is an open()'d file.

(Normal CPython will immediately close the file when the function returns due to refcounted GC. PyPy uses non-refcounted GC so the file remains open until GC happens, and so you can get too many files open at once. Not explicitly closing files is a classic PyPy-only Python bug.)

When a Python file object is garbage collected, Python arranges to close the underlying C level file descriptor if you didn't already call .close(). In CPython, garbage collection is deterministic and generally prompt; for example, when a function returns, all of its otherwise unreferenced local variables will be garbage collected as their reference counts drop to zero. However, PyPy doesn't use reference counting for its garbage collection; instead, like Go, it only collects garbage periodically, and so will only close files as a side effect some time later. This can make it easy to build up a lot of open files that aren't doing anything, and possibly run your program out of available file descriptors, something I've run into in the past.

I recently wanted to run a hacked up version of a NFS monitoring program written in Python under PyPy instead of CPython, so it would run faster and use less CPU on the systems I was interested in. Since I remembered this PyPy issue, I found myself wondering if it properly handled closing the file(s) it had to open, or if it left it to CPython garbage collection. When I looked at the code, what I found can be summarized as 'yes and no':

def parse_stats_file(filename):
  [...]
  f = open(filename)
  [...]
  f.close

  return ms_dict

Because I was specifically looking for uses of .close(), the lack of the '()' immediately jumped out at me (and got fixed in my hacked version).

It's easy to see how this typo could linger undetected in CPython. The line 'f.close' itself does nothing but isn't an error, and then 'f' is implicitly closed in the next line, as part of the 'return', so even if you looking at this program's file descriptor usage while it's running you won't see any leaks.

(I'm not entirely a fan of nondeterministic garbage collection, at least in the context of Python, where deterministic GC was a long standing feature of the language in practice.)

Always sync your log or journal files when you open them

By: cks
1 March 2025 at 03:10

Today I learned of a new way to accidentally lose data 'written' to disk, courtesy of this Fediverse post summarizing a longer article about CouchDB and this issue. Because this is so nifty and startling when I encountered it, yet so simple, I'm going to re-explain the issue in my own words and explain how it leads to the title of this entry.

Suppose that you have a program that makes data it writes to disk durable through some form of journal, write ahead log (WAL), or the like. As we all know, data that you simply write() to the operating system isn't yet on disk; the operating system is likely buffering the data in memory before writing it out at the OS's own convenience. To make the data durable, you must explicitly flush it to disk (well, ask the OS to), for example with fsync(). Your program is a good program, so of course it does this; when it updates the WAL, it write()s then fsync()s.

Now suppose that your program is terminated after the write but before the fsync. At this point you have a theoretically incomplete and improperly written journal or WAL, since it hasn't been fsync'd. However, when your program restarts and goes through its crash recovery process, it has no way to discover this. Since the data was written (into the OS's disk cache), the OS will happily give the data back to you even though it's not yet on disk. Now assume that your program takes further actions (such as updating its main files) based on the belief that the WAL is fully intact, and then the system crashes, losing that buffered and not yet written WAL data. Oops. You (potentially) have a problem.

(These days, programs can get terminated for all sorts of reasons other than a program bug that causes a crash. If you're operating in a modern containerized environment, your management system can decide that your program or its entire container ought to shut down abruptly right now. Or something else might have run the entire system out of memory and now some OOM handler is killing your program.)

To avoid the possibility of this problem, you need to always force a disk flush when you open your journal, WAL, or whatever; on Unix, you'd immediately fsync() it. If there's no unwritten data, this will generally be more or less instant. If there is unwritten data because you're restarting after the program was terminated by surprise, this might take a bit of time but insures that the on-disk state matches the state that you're about to observe through the OS.

(CouchDB's article points to another article, Justin Jaffray’s NULL BITMAP Builds a Database #2: Enter the Memtable, which has a somewhat different way for this failure to bite you. I'm not going to try to summarize it here but you might find the article interesting reading.)

Using Netplan to set up WireGuard on Ubuntu 22.04 works, but has warts

By: cks
28 February 2025 at 04:07

For reasons outside the scope of this entry, I recently needed to set up WireGuard on an Ubuntu 22.04 machine. When I did this before for an IPv6 gateway, I used systemd-networkd directly. This time around I wasn't going to set up a single peer and stop; I expected to iterate and add peers several times, which made netplan's ability to update and re-do your network configuration look attractive. Also, our machines are already using Netplan for their basic network configuration, so this would spare my co-workers from having to learn about systemd-networkd.

Conveniently, Netplan supports multiple configuration files so you can put your WireGuard configuration into a new .yaml file in your /etc/netplan. The basic version of a WireGuard endpoint with purely internal WireGuard IPs is straightforward:

network:
  version: 2
  tunnels:
    our-wg0:
      mode: wireguard
      addresses: [ 192.168.X.1/24 ]
      port: 51820
      key:
        private: '....'
      peers:
        - keys:
            public: '....'
          allowed-ips: [ 192.168.X.10/32 ]
          keepalive: 90
          endpoint: A.B.C.D:51820

(You may want something larger than a /24 depending on how many other machines you think you'll be talking to. Also, this configuration doesn't enable IP forwarding, which is a feature in our particular situation.)

If you're using netplan's systemd-networkd backend, which you probably are on an Ubuntu server, you can apparently put your keys into files instead of needing to carefully guard the permissions of your WireGuard /etc/netplan file (which normally has your private key in it).

If you write this out and run 'netplan try' or 'netplan apply', it will duly apply all of the configuration and bring your 'our-wg0' WireGuard configuration up as you expect. The problems emerge when you change this configuration, perhaps to add another peer, and then re-do your 'netplan try', because when you look you'll find that your new peer hasn't been added. This is a sign of a general issue; as far as I can tell, netplan (at least in Ubuntu 22.04) can set up WireGuard devices from scratch but it can't update anything about their WireGuard configuration once they're created. This is probably be a limitation in the Ubuntu 22.04 version of systemd-networkd that's only changed in the very latest systemd versions. In order to make WireGuard level changes, you need to remove the device, for example with 'ip link del dev our-wg0' and then re-run 'netplan try' (or 'netplan apply') to re-create the WireGuard device from scratch; the recreated version will include all of your changes.

(The latest online systemd.netdev manual page says that systemd-networkd will try to update netdev configurations if they change, and .netdev files are where WireGuard settings go. The best information I can find is that this change appeared in systemd v257, although the Fedora 41 systemd.netdev manual page has this same wording and it has systemd '256.11'. Maybe there was a backport into Fedora.)

In our specific situation, deleting and recreating the WireGuard device is harmless and we're not going to be doing it very often anyway. In other configurations things may not be so straightforward and so you may need to resort to other means to apply updates to your WireGuard configuration (including working directly through the 'wg' tool).

I'm not impressed by the state of NFS v4 in the Linux kernel

By: cks
27 February 2025 at 04:15

Although NFS v4 is (in theory) the latest great thing in NFS protocol versions, for a long time we only used NFS v3 for our fileservers and our Ubuntu NFS clients. A few years ago we switched to NFS v4 due to running into a series of problems our people were experiencing with NFS (v3) locks (cf); since NFS v4 locks are integrated into the protocol and NFS v4 is the 'modern' NFS version that's probably receiving more attention than anything to do with NFS v3.

(NFS v4 locks are handled relatively differently than NFS v3 locks.)

Moving to NFS v4 did fix our NFS lock issues in that stuck NFS locks went away, when before they'd been a regular issue on our IMAP server. However, all has not turned out to be roses, and the result has left me not really impressed with the state of NFS v4 in the Linux kernel. In Ubuntu 22.04's 5.15.x server kernel, we've now run into scalability issues in both the NFS server (which is what sparked our interest in how many NFS server threads to run and what NFS server threads do in the kernel), and now in the NFS v4 client (where I have notes that let me point to a specific commit with the fix).

(The NFS v4 server issue we encountered may be the one fixed by this commit.)

What our two issues have in common is that both are things that you only find under decent or even significant load. That these issues both seem to have still been present as late as kernels 6.1 (server) and 6.6 (client) suggests that neither the Linux NFS v4 server nor the Linux NFS v4 client had been put under serious load until then, or at least not by people who could diagnose their problems precisely enough to identify the problem and get kernel fixes made. While both issues are probably fixed now, their past presence leaves me wondering what other scalability issues are lurking in the kernel's NFS v4 support, partly because people have mostly been using NFS v3 until recently (like us).

We're not going to go back to NFS v3 in general (partly because of the clear improvement in locking), and the server problem we know about has been wiped away because we're moving our NFS fileservers to Ubuntu 24.04 (and some day the NFS clients will move as well). But I'm braced for further problems, including ones in 24.04 that we may be stuck with for a while.

PS: I suspect that part of the issues may come about because the Linux NFS v4 client and the Linux NFS v4 server don't add NFS v4 operations at the same time. As I found out, the server supports more operations than the client uses but the client's use is of whatever is convenient and useful for it, not necessarily by NFS v4 revision. If the major use of Linux NFS v4 servers is with v4 clients, this could leave the server implementation of operations under-used until the client starts using them (and people upgrade clients to kernel versions with that support).

MFA's "push notification" authentication method can be easier to integrate

By: cks
26 February 2025 at 03:59

For reasons outside the scope of this entry, I'm looking for an OIDC or SAML identity provider that supports primary user and password authentication against our own data and then MFA authentication through the university's SaaS vendor. As you'd expect, the university's MFA SaaS vendor supports all of the common MFA approaches today, covering push notifications through phones, one time codes from hardware tokens, and some other stuff. However, pretty much all of the MFA integrations I've been able to find only support MFA push notifications (eg, also). When I thought about it, this made a lot of sense, because it's often going to be much easier to add push notification MFA than any other form of it.

A while back I wrote about exploiting password fields for multi-factor authentication, where various bits of software hijacked password fields to let people enter things like MFA one time codes into systems (like OpenVPN) that were never set up for MFA in the first place. With most provider APIs, authentication through push notification can usually be inserted in a similar way, because from the perspective of the overall system it can be a synchronous operation. The overall system calls a 'check' function of some sort, the check function calls out the the provider's API and then possibly polls for a result for a while, and then it returns a success or a failure. There's no need to change the user interface of authentication or add additional high level steps.

(The exception is if the MFA provider's push authentication API only returns results to you by making a HTTP query to you. But I think that this would be a relatively weird API; a synchronous reply or at least a polled endpoint is generally much easier to deal with and is more or less required to integrate push authentication with non-web applications.)

By contrast, if you need to get a one time code from the person, you have to do things at a higher level and it may not fit well in the overall system's design (or at least the easily exposed points for plugins and similar things). Instead of immediately returning a successful or failed authentication, you now need to display an additional prompt (in many cases, a HTML page), collect the data, and only then can you say yes or no. In a web context (such as a SAML or OIDC IdP), the provider may want you to redirect the user to their website and then somehow call you back with a reply, which you'll have to re-associate with context and validate. All of this assumes that you can even interpose an additional prompt and reply, which isn't the case in some contexts unless you do extreme things.

(Sadly this means that if you have a system that only supports MFA push authentication and you need to also accept codes and so on, you may be in for some work with your chainsaw.)

Go's behavior for zero value channels and maps is partly a choice

By: cks
25 February 2025 at 04:30

How Go behaves if you have a zero value channel or map (a 'nil' channel or map) is somewhat confusing (cf, via). When we talk about it, it's worth remembering that this behavior is a somewhat arbitrary choice on Go's part, not a fundamental set of requirements that stems from, for example, other language semantics. Go has reasons to have channels and maps behave as they do, but some those reasons have to do with how channel and map values are implemented and some are about what's convenient for programming.

As hinted at by how their zero value is called a 'nil' value, channel and map values are both implemented as pointers to runtime data structures. A nil channel or map has no such runtime data structure allocated for it (and the pointer value is nil); these structures are allocated by make(). However, this doesn't entirely allow us to predict what happens when you use nil values of either type. It's not unreasonable for an attempt to assign an element to a nil map to panic, since the nil map has no runtime data structure allocated to hold anything we try to put in it. But you don't have to say that a nil map is empty and looking up elements in it gives you a zero value; I think you could have this panic instead, just as assigning an element does. However, this would probably result in less safe code that paniced more (and probably had more checks for nil maps, too).

Then there's nil channels, which don't behave like nil maps. It would make sense for receiving from a nil channel to yield the zero value, much like looking up an element in a nil map, and for sending to a nil channel to panic, again like assigning to an element in a nil map (although in the channel case it would be because there's no runtime data structure where your goroutine could metaphorically hang its hat waiting for a receiver). Instead Go chooses to make both operations (permanently) block your goroutine, with panicing on send reserved for sending to a non-nil but closed channel.

The current semantics of sending on a closed channel combined with select statements (and to a lesser extent receiving from a closed channel) means that Go needs a channel zero value that is never ready to send or receive. However, I believe that Go could readily make actual sends or receives on nil channels panic without any language problems. As a practical matter, sending or receiving on a nil channel is a bug that will leak your goroutine even if your program doesn't deadlock.

Similarly, Go could choose to allocate an empty map runtime data structure for zero value maps, and then let you assign to elements in the resulting map rather than panicing. If desired, I think you could preserve a distinction between empty maps and nil maps. There would be some drawbacks to this that cut against Go's general philosophy of being relatively explicit about (heap) allocations and you'd want a clever compiler that didn't bother creating those zero value runtime map data structures when they'd just be overwritten by 'make()' or a return value from a function call or the like.

(I can certainly imagine a quite Go like language where maps don't have to be explicitly set up any more than slices do, although you might still use 'make()' if you wanted to provide size hints to the runtime.)

Sidebar: why you need something like nil channels

We all know that sometimes you want to stop sending or receiving on a channel in a select statement. On first impression it looks like closing a channel (instead of setting the channel to nil) could be made to work for this (it doesn't currently). The problem is that closing a channel is a global thing, while you may only want a local effect; you want to remove the channel from your select, but not close down other uses of it by other goroutines.

This need for a local effect pretty much requires a special, distinct channel value that is never ready for sending or receiving, so you can overwrite the old channel value with this special value, which we might as well call a 'nil channel'. Without a channel value that serves this purpose you'd have to complicate select statements with some other way to disable specific channels.

(I had to work this out in my head as part of writing this entry so I might as well write it down for my future self.)

JSON has become today's machine-readable output format (on Unix)

By: cks
24 February 2025 at 04:26

Recently, I needed to delete about 1,200 email messages to a particular destination from the mail queue on one of our systems. This turned out to be trivial, because this system was using Postfix and modern versions of Postfix can output mail queue status information in JSON format. So I could dump the mail queue status, select the relevant messages and print the queue IDs with jq, and feed this to Postfix to delete the messages. This experience has left me with the definite view that everything should have the option to output JSON for 'machine-readable' output, rather than some bespoke format. For new programs, I think that you should only bother producing JSON as your machine readable output format.

(If you strongly object to JSON, sure, create another machine readable output format too. But if you don't care one way or another, outputting only JSON is probably the easiest approach for programs that don't already have such a format of their own.)

This isn't because JSON is the world's best format (JSON is at best the least bad format). Instead it's because JSON has a bunch of pragmatic virtues on a modern Unix system. In general, JSON provides a clear and basically unambiguous way to represent text data and much numeric data, even if it has relatively strange characters in it (ie, JSON has escaping rules that everyone knows and all tools can deal with); it's also generally extensible to add additional data without causing heartburn in tools that are dealing with older versions of a program's output. And on Unix there's an increasingly rich collection of tools to deal with and process JSON, starting with jq itself (and hopefully soon GNU Awk in common configurations). Plus, JSON can generally be transformed to various other formats if you need them.

(JSON can also be presented and consumed in either multi-line or single line formats. Multi-line output is often much more awkward to process in other possible formats.)

There's nothing unique about JSON in all of this; it could have been any other format with similar virtues where everything lined up this way for the format. It just happens to be JSON at the moment (and probably well into the future), instead of (say) XML. For individual programs there are simpler 'machine readable' output formats, but they either have restrictions on what data they can represent (for example, no spaces or tabs in text), or require custom processing that goes well beyond basic grep and awk and other widely available Unix tools, or both. But JSON has become a "narrow waist" for Unix programs talking to each other, a common coordination point that means people don't have to invent another format.

(JSON is also partially self-documenting; you can probably look at a program's JSON output and figure out what various parts of it mean and how it's structured.)

PS: Using JSON also means that people writing programs don't have to design their own machine-readable output format. Designing a machine readable output format is somewhat more complicated than it looks, so I feel that the less of it people need to do, the better.

(I say this as a system administrator who's had to deal with a certain amount of output formats that have warts that make them unnecessarily hard to deal with.)

Institutions care about their security threats, not your security threats

By: cks
23 February 2025 at 03:45

Recently I was part of a conversation on the Fediverse that sparked an obvious in retrospect realization about computer security and how we look at and talk about security measures. To put it succinctly, your institution cares about threats to it, not about threats to you. It cares about threats to you only so far as they're threats to it through you. Some of the security threats and sensible responses to them overlap between you and your institution, but some of them don't.

One of the areas where I think this especially shows up is in issues around MFA (Multi-Factor Authentication). For example, it's a not infrequently observed thing that if all of your factors live on a single device, such as your phone, then you actually have single factor authentication (this can happen with many of the different ways to do MFA). But for many organizations, this is relatively fine (for them). Their largest risk is that Internet attackers are constantly trying to (remotely) phish their people, often in moderately sophisticated ways that involve some prior research (which is worth it for the attackers because they can target many people with the same research). Ignoring MFA alert fatigue for a moment, even a single factor physical device will cut of all of this, because Internet attackers don't have people's smartphones.

For individual people, of course, this is potentially a problem. If someone can gain access to your phone, they get everything, and probably across all of the online services you use. If you care about security as an individual person, you want attackers to need more than one thing to get all of your accounts. Conversely, for organizations, compromising all of their systems at once is sort of a given, because that's what it means to have a Single Sign On system and global authentication. Only a few organizational systems will be separated from the general SSO (and organizations have to hope that their people cooperate by using different access passwords).

Organizations also have obvious solutions to things like MFA account recovery. They can establish and confirm the identities of people associated with them, and a process to establish MFA in the first place, so if you lose whatever lets you do MFA (perhaps your work phone's battery has gotten spicy), they can just run you through the enrollment process again. Maybe there will be a delay, but if so, the organization has broadly decided to tolerate it.

(And I just recently wrote about the difference between 'internal' accounts and 'external' accounts, where people generally know who is in an organization and so has an account, so allowing this information to leak in your authentication isn't usually a serious problem.)

Another area where I think this difference in the view of threats is in the tradeoffs involved in disk encryption on laptops and desktops used by people. For an organization, choosing non-disclosure over availability on employee devices makes a lot of sense. The biggest threat as the organization sees it isn't data loss on a laptop or desktop (especially if they write policies about backups and where data is supposed to be stored), it's an attacker making off with one and having the data disclosed, which is at least bad publicity and makes the executives unhappy. You may feel differently about your own data, depending on how your backups are.

HTTP connections are part of the web's long tail

By: cks
22 February 2025 at 03:32

I recently read an article that, among other things, apparently seriously urging browser vendors to deprecate and disable plain text HTTP connections by the end of October of this year (via, and I'm deliberately not linking directly to the article). While I am a strong fan of HTTPS in general, I have some feelings about a rapid deprecation of HTTP. One of my views is that plain text HTTP is part of the web's long tail.

As I'm using the term here, the web's long tail (also is the huge mass of less popular things that are individually less frequently visited but which in aggregate amount to a substantial part of the web. The web's popular, busy sites are frequently updated and can handle transitions without problems. They can readily switch to using modern HTML, modern CSS, modern JavaScript, and so on (although they don't necessarily do so), and along with that update all of their content to HTTPS. In fact they mostly or entirely have done so over the last ten to fifteen years. The web's long tail doesn't work like that. Parts of it use old JavaScript, old CSS, old HTML, and these days, plain HTTP (in addition to the people who have objections to HTTPS and deliberately stick to HTTP).

The aggregate size and value of the long tail is part of why browsers have maintained painstaking compatibility back to old HTML so far, including things like HTML Image Maps. There's plenty of parts of the long tail that will never be updated to have HTTPS or work properly with it. For browsers to discard HTTP anyway would be to discard that part of the long tail, which would be a striking break with browser tradition. I don't think this is very likely and I certainly hope that it never comes to pass, because that long tail is part of what gives the web its value.

(It would be an especially striking break since a visible percentage of page loads still happen with HTTP instead of HTTPS. For example, Google's stats say that globally 5% of Windows Chrome page loads apparently still use HTTP. That's roughly one in twenty page loads, and the absolute number is going to be very large given how many page loads happen with Chrome on Windows. This large number is one reason I don't think this is at all a serious proposal; as usual with this sort of thing, it ignores that social problems are the ones that matter.)

PS: Of course, not all of the HTTP connections are part of the web's long tail as such. Some of them are to, for example, manage local devices via little built in web servers that simply don't have HTTPS. The people with these devices aren't in any rush to replace them just because some people don't like HTTP, and the vendors who made them aren't going to update their software to support (modern) HTTPS even for the devices which support firmware updates and where the vendor is still in business.

(You can view them as part of the long tail of 'the web' as a broad idea and interface, even though they're not exposed to the world the way that the (public) web is.)

It's good to have offline contact information for your upstream networking

By: cks
21 February 2025 at 03:42

So I said something on the Fediverse:

Current status: it's all fun and games until the building's backbone router disappears.

A modest suggestion: obtain problem reporting/emergency contact numbers for your upstream in advance and post them on the wall somewhere. But you're on your own if you use VOIP desk phones.

(It's back now or I wouldn't be posting this, I'm in the office today. But it was an exciting 20 minutes.)

(I was somewhat modeling the modest suggestion after nuintari's Fediverse series of "rules of networking", eg, also.)

The disappearance of the building's backbone router took out all local networking in the particular building that this happened in (which is the building with our machine room), including the university wireless in the building. THe disappearance of the wireless was especially surprising, because the wireless SSID disappeared entirely.

(My assumption is that the university's enterprise wireless access points stopped advertising the SSID when they lost some sort of management connection to their control plane.)

In a lot of organizations you might have been able to relatively easily find the necessary information even with this happening. For example, people might have smartphones with data plans and laptops that they could tether to the smartphones, and then use this to get access to things like the university directory, the university's problem reporting system, and so on. For various reasons, we didn't really have any of this available, which left us somewhat at a loss when the external networking evaporated. Ironically we'd just managed to finally find some phone numbers and get in touch with people when things came back.

(One bit of good news is that our large scale alert system worked great to avoid flooding us with internal alert emails. My personal alert monitoring (also) did get rather noisy, but that also let me see right away how bad it was.)

Of course there's always things you could do to prepare, much like there are often too many obvious problems to keep track of them all. But in the spirit of not stubbing our toes on the same problem a second time, I suspect we'll do something to keep some problem reporting and contact numbers around and available.

Shared (Unix) hosting and the problem of managing resource limits

By: cks
20 February 2025 at 03:14

Yesterday I wrote about how one problem with shared Unix hosting was the lack of good support for resource limits in the Unixes of the time. But even once you have decent resource limits, you still have an interlinked set of what we could call 'business' problems. These are the twin problems of what resource limits you set on people and how you sell different levels of these resources limits to your customers.

(You may have the first problem even for purely internal resource allocation on shared hosts within your organization, and it's never a purely technical decision.)

The first problem is whether you overcommit what you sell and in general how you decide on the resource limits. Back in the big days of the shared hosting business, I believe that overcommitting was extremely common; servers were expensive and most people didn't use much resources on average. If you didn't overcommit your servers, you had to charge more and most people weren't interested in paying that. Some resources, such as CPU time, are 'flow' resources that can be rebalanced on the fly, restricting everyone to a fair share when the system is busy (even if that share is below what they're nominally entitled to), but it's quite difficult to take memory back (or disk space). If you overcommit memory, your systems might blow up under enough load. If you don't overcommit memory, either everyone has to pay more or everyone gets unpopularly low limits.

(You can also do fancy accounting for 'flow' resources, such as allowing bursts of high CPU but not sustained high CPU. This is harder to do gracefully for things like memory, although you can always do it ungracefully by terminating things.)

The other problem entwined with setting resource limits is how (and if) you sell different levels of resource limits to your customers. A single resource limit is simple but probably not what all of your customers want; some will want more and some will only need less. But if you sell different limits, you have to tell customers what they're getting, let them assess their needs (which isn't always clear in a shared hosting situation), deal with them being potentially unhappy if they think they're not getting what they paid for, and so on. Shared hosting is always likely to have complicated resource limits, which raises the complexity of selling them (and of understanding them, for the customers who have to pick one to buy).

Viewed from the right angle, virtual private servers (VPSes) are a great abstraction to sell different sets of resource limits to people in a way that's straightforward for them to understand (and which at least somewhat hides whether or not you're overcommitting resources). You get 'a computer' with these characteristics, and most of the time it's straightforward to figure out whether things fit (the usual exception is IO rates). So are more abstracted, 'cloud-y' ways of selling computation, database access, and so on (at least in areas where you can quantify what you're doing into some useful unit of work, like 'simultaneous HTTP requests').

It's my personal suspicion that even if the resource limitation problems had been fully solved much earlier, shared hosting would have still fallen out of fashion in favour of simpler to understand VPS-like solutions, where what you were getting and what you were using (and probably what you needed) were a lot clearer.

One problem with "shared Unix hosting" was the lack of resource limits

By: cks
19 February 2025 at 04:04

I recently read Comments on Shared Unix Hosting vs. the Cloud (via), which I will summarize as being sad about how old fashioned shared hosting on a (shared) Unix system has basically died out, and along with it web server technology like CGI. As it happens, I have a system administrator's view of why shared Unix hosting always had problems and was a down-market thing with various limitations, and why even today people aren't very happy with providing it. In my view, a big part of the issue was the lack of resource limits.

The problem with sharing a Unix machine with other people is that by default, those other people can starve you out. They can take up all of the available CPU time, memory, process slots, disk IO, and so on. On an unprotected shared web server, all you need is one person's runaway 'CGI' code (which might be PHP code or etc) or even an unusually popular dynamic site and all of the other people wind up having a bad time. Life gets worse if you allow people to log in, run things in the background, run things from cron, and so on, because all of these can add extra load. In order to make shared hosting be reliable and good, you need some way of forcing a fair sharing of resources and limiting how much resources a given customer can use.

Unfortunately, for much of the practical life of shared Unix hosting, Unixes did not have that. Some Unixes could create various sorts of security boundaries, but generally not resource usage limits that applied to an entire group of processes. Even once this became possibly to some degree in Linux through cgroup(s), the kernel features took some time to mature and then it took even longer for common software to support running things in isolated and resource controlled cgroups. Even today it's still not necessarily entirely there for things like running CGIs from your web server, never mind a potential shared database server to support everyone's database backed blog.

(A shared database server needs to implement its own internal resource limits for each customer, otherwise you have to worry about a customer gumming it up with expensive queries, a flood of queries, and so on. If they need separate database servers for isolation and resource control, now they need more server resources.)

My impression is that the lack of kernel supported resource limits forced shared hosting providers to roll their own ad-hoc ways of limiting how much resources their customers could use. In turn this created the array of restrictions that you used to see on such providers, with things like 'no background processes', 'your CGI can only run for so long before being terminated', 'your shell session is closed after N minutes', and so on. If shared hosting had been able to put real limits on each of their customers, this wouldn't have been as necessary; you could go more toward letting each customer blow itself up if it over-used resources.

(How much resources to give each customer is also a problem, but that's another entry.)

More potential problems for people with older browsers

By: cks
18 February 2025 at 03:40

I've written before that keeping your site accessible to very old browsers is non-trivial because of issues like them not necessarily supporting modern TLS. However, there's another problem that people with older browsers are likely to be facing, unless circumstances on the modern web change. I said on the Fediverse:

Today in unfortunate web browser developments: I think people using older versions of browsers, especially Chrome, are going to have increasing problems accessing websites. There are a lot of (bad) crawlers out there forging old Chrome versions, perhaps due to everyone accumulating AI training data, and I think websites are going to be less and less tolerant of them.

(Mine sure is currently, as an experiment.)

(By 'AI' I actually mean LLM.)

I covered some request volume information yesterday and it (and things I've seen today) strongly suggest that there is a lot of undercover scraping activity going on. Much of that scraping activity uses older browser User-Agents, often very old, which means that people who don't like it are probably increasingly going to put roadblocks in the way of anything presenting those old User-Agent values (there are already open source projects designed to frustrate LLM scraping and there will probably be more in the future).

(Apparently some LLM scrapers start out with honest User-Agents but then switch to faking them if you block their honest versions.)

There's no particular reason why scraping software can't use current User-Agent values, but it probably has to be updated every so often when new browser versions come out and people haven't done that so far. Much like email anti-spam efforts changing email spammer behavior, this may change if enough websites start reacting to old User-Agents, but I suspect that it will take a while for that to come to pass. Instead I expect it to be a smaller scale, distributed effort from 'unimportant' websites that are getting overwhelmed, like LWN (see the mention of this in their 'what we haven't added' section).

Major websites probably won't outright reject old browsers, but I suspect that they'll start throwing an increased amount of blocks in the way of 'suspicious' browser sessions with those User-Agents. This is probably likely to include CAPTCHAs and other such measures that they already use some of the time. CAPTCHAs aren't particularly effective at stopping bad actors in practice but they're the hammer that websites already have, so I'm sure they'll be used on this nail.

Another thing that I suspect will start happening is that more sites will start insisting that you run some JavaScript to pass a test in order to access them (whether this is an explicit CAPTCHA or just passive JavaScript that has to execute). This will stop LLM scrapers that don't run JavaScript, which is not all of them, and force the others to spend a certain amount of CPU and memory, driving up the aggregate cost of scraping your site dry. This will of course adversely affect people without JavaScript in their browser and those of us who choose to disable it for most sites, but that will be seen as the lesser evil by people who do this. As with anti-scraper efforts, there are already open source projects for this.

(This is especially likely to happen if LLM scrapers modernize their claimed User-Agent values to be exactly like current browser versions. People are going to find some defense.)

PS: I've belatedly made the Wandering Thoughts blocks for old browsers now redirect people to a page about the situation. I've also added a similar page for my current block of most HTTP/1.0 requests.

The HTTP status codes of responses from about 21 hours of traffic to here

By: cks
17 February 2025 at 04:06

You may have heard that there are a lot of crawlers out there these days, many of them apparently harvesting training data for LLMs. Recently I've been getting more strict about access to this blog, so for my own interest I'm going to show statistics on what HTTP status codes all of the requests to here got in the past roughly 21 hours and a bit. I think this is about typical, although there may be more blocked things than usual.

I'll start with the overall numbers for all requests:

 22792 403      [45%]
  9207 304      [18.3%]
  9055 200      [17.9%]
  8641 429      [17.1%]
   518 301
    58 400
    33 404
     2 206
     1 302

HTTP 403 is the error code that people get on blocked access; I'm not sure what's producing the HTTP 400s. The two HTTP 206s were from LinkedIn's bot against a recent entry and completely puzzle me. Some of the blocked access is major web crawlers requesting things that they shouldn't (Bing is a special repeat offender here), but many of them are not. Between HTTP 403s and HTTP 429s, 62% or so of the requests overall were rejected and only 36% got a useful reply.

(With less thorough and active blocks, that would be a lot more traffic for Wandering Thoughts to handle.)

The picture for syndication feeds is rather different, as you might expect, but not quite as different as I'd like:

  9136 304    [39.5%]
  8641 429    [37.4%]
  3614 403    [15.6%]
  1663 200    [ 7.2%]
    19 301

Some of those rejections are for major web crawlers and almost a thousand are for a pair of prolific, repeat high volume request sources, but a lot of them aren't. Feed requests account for 23073 requests out of a total of 50307, or about 45% of the requests. To me this feels quite low for anything plausibly originated from humans; most of the time I expect feed requests to significantly outnumber actual people visiting.

(In terms of my syndication feed rate limiting, there were 19440 'real' syndication feed requests (84% of the total attempts), and out of them 44.4% were rate-limited. That's actually a lower level of rate limiting than I expected; possibly various feed fetchers have actually noticed it and reduced their attempt frequency. 46.9% made successful conditional GET requests (ones that got a HTTP 304 response) and 8.5% actually fetched feed data.)

DWiki, the wiki engine behind the blog, has a concept of alternate 'views' of pages. Syndication feeds are alternate views, but so are a bunch of other things. Excluding syndication feeds, the picture for requests of alternate views of pages is:

  5499 403
   510 200
    39 301
     3 304

The most blocked alternate views are:

  1589 ?writecomment
  1336 ?normal
  1309 ?source
   917 ?showcomments

(The most successfully requested view is '?showcomments', which isn't really a surprise to me; I expect search engines to look through that, for one.)

If I look only at plain requests, not requests for syndication feeds or alternate views, I see:

 13679 403   [64.5%]
  6882 200   [32.4%]
   460 301
    68 304
    58 400
    33 404
     2 206
     1 302

This means the breakdown of traffic is 21183 normal requests (42%), 45% feed requests, and the remainder for alternate views, almost all of which were rejected.

Out of the HTTP 403 rejections across all requests, the 'sources' break down something like this:

  7116 Forged Chrome/129.0.0.0 User-Agent
  1451 Bingbot
  1173 Forged Chrome/121.0.0.0 User-Agent
   930 PerplexityBot ('AI' LLM data crawler)
   915 Blocked sources using a 'Go-http-client/1.1' User-Agent

Those HTTP 403 rejections came from 12619 different IP addresses, in contrast to the successful requests (HTTP 2xx and 3xx codes), which came from 18783 different IP addresses. After looking into the ASN breakdown of those IPs, I've decided that I can't write anything about them with confidence, and it's possible that part of what is going on is that I have mis-firing blocking rules (alternately, I'm being hit from a big network of compromised machines being used as proxies, perhaps the same network that is the Chrome/129.0.0.0 source). However, some of the ASNs that show up highly are definitely ones I recognize from other contexts, such as attempted comment spam.

Update: Well that was a learning experience about actual browser User-Agents. Those 'Chrome/129.0.0.0' User-Agents may well not have been so forged (although people really should be running more current versions of Chrome). I apologize to the people using real current Chrome versions that were temporarily unable to read the blog because of my overly-aggressive blocks.

Why I have a little C program to filter a $PATH (more or less)

By: cks
16 February 2025 at 02:07

I use a non-standard shell and have for a long time, which means that I have to write and maintain my own set of dotfiles (which sometimes has advantages). In the long ago days when I started doing this, I had a bunch of accounts on different Unixes around the university (as was the fashion at the time, especially if you were a sysadmin). So I decided that I was going to simplify my life by having one set of dotfiles for rc that I used on all of my accounts, across a wide variety of Unixes and Unix environments. That way, when I made an improvement in a shell function I used, I could get it everywhere by just pushing out a new version of my dotfiles.

(This was long enough ago that my dotfile propagation was mostly manual, although I believe I used rdist for some of it.)

In the old days, one of the problems you faced if you wanted a common set of dotfiles across a wide variety of Unixes was that there were a lot of things that potentially could be in your $PATH. Different Unixes had different sets of standard directories, and local groups put local programs (that I definitely wanted access to) in different places. I could have put everything in $PATH (giving me a gigantic one) or tried to carefully scope out what system environment I was on and set an appropriate $PATH for each one, but I decided to take a more brute force approach. I started with a giant potential $PATH that listed every last directory that could appear in $PATH in any system I had an account on, and then I had a C program that filtered that potential $PATH down to only things that existed on the local system. Because it was written in C and had to stat() things anyways, I made it also keep track of what concrete directories it had seen and filter out duplicates, so that if there were symlinks from one name to another, I wouldn't get it twice in my $PATH.

(Looking at historical copies of the source code for this program, the filtering of duplicates was added a bit later; the very first version only cared about whether a directory existed or not.)

The reason I wrote a C program for this (imaginatively called 'isdirs') instead of using shell builtins to do this filtering (which is entirely possible) is primarily because this was so long ago that running a C program was definitely faster than using shell builtins in my shell. I did have a fallback shell builtin version in case my C program might not be compiled for the current system and architecture, although it didn't do the filtering of duplicates.

(Rc uses a real list for its equivalent of $PATH instead of the awkward ':' separated pseudo-list that other Unix shells use, so both my C program and my shell builtin could simply take a conventional argument list of directories rather than having to try to crack a $PATH apart.)

(This entry was inspired by Ben Zanin's trick(s) to filter out duplicate $PATH entries (also), which prompted me to mention my program.)

PS: rc technically only has one dotfile, .rcrc, but I split my version up into several files that did different parts of the work. One reason for this split was so that I could source only some parts to set up my environment in a non-interactive context (also).

Sidebar: the rc builtin version

Rc has very few builtins and those builtins don't include test, so this is a bit convoluted:

path=`{tpath=() pe=() {
        for (pe in $path)
           builtin cd $pe >[1=] >[2=] && tpath=($tpath $pe)
        echo $tpath
       } >[2]/dev/null}

In a conventional shell with a test builtin, you would just use 'test -d' to see if directories were there. In rc, the only builtin that will tell you if a directory exists is to try to cd to it. That we change directories is harmless because everything is running inside the equivalent of a Bourne shell $(...).

Keen eyed people will have noticed that this version doesn't work if anything in $path has a space in it, because we pass the result back as a whitespace-separated string. This is a limitation shared with how I used the C program, but I never had to use a Unix where one of my $PATH entries needed a space in it.

The profusion of things that could be in your $PATH on old Unixes

By: cks
15 February 2025 at 03:43

In the beginning, which is to say the early days of Bell Labs Research Unix, life was simple and there was only /bin. Soon afterwards that disk ran out of space and we got /usr/bin (and all of /usr), and some people might even have put /etc on their $PATH. When UCB released BSD Unix, they added /usr/ucb as a place for (some of) their new programs and put some more useful programs in /etc (and at some point there was also /usr/etc); now you had three or four $PATH entries. When window systems showed up, people gave them their own directories too, such as /usr/bin/X11 or /usr/openwin/bin, and this pattern was followed by other third party collections of programs, with (for example) /usr/bin/mh holding all of the (N)MH programs (if you installed them there). A bit later, SunOS 4.0 added /sbin and /usr/sbin and other Unixes soon copied them, adding yet more potential $PATH entries.

(Sometimes X11 wound up in /usr/X11/bin, or /usr/X11<release>/bin. OpenBSD still has a /usr/X11R6 directory tree, to my surprise.)

When Unix went out into the field, early system administrators soon learned that they didn't want to put local programs into /usr/bin, /usr/sbin, and so on. Of course there was no particular agreement on where to put things, so people came up with all sorts of options for the local hierarchy, including /usr/local, /local, /slocal, /<group name> (such as /csri or /dgp), and more. Often these /local/bin things had additional subdirectories for things like the locally built version of X11, which might be plain 'bin/X11' or have a version suffix, like 'bin/X11R4', 'bin/X11R5', or 'bin/X11R6'. Some places got more elaborate; rather than putting everything in a single hierarchy, they put separate things into separate directory hierarchies. When people used /opt for this, you could get /opt/gnu/bin, /opt/tk/bin, and so on.

(There were lots of variations, especially for locally built versions of X11. And a lot of people built X11 from source in those days, at least in the university circles I was in.)

Unix vendors didn't sit still either. As they began adding more optional pieces they started splitting them up into various directory trees, both for their own software and for third party software they felt like shipping. Third party software was often planted into either /usr/local or /usr/contrib, although there were other options, and vendor stuff could go in many places. A typical example is Solaris 9's $PATH for sysadmins (and I think that's not even fully complete, since I believe Solaris 9 had some stuff hiding under /usr/xpg4). Energetic Unix vendors could and did put various things in /opt under various names. By this point, commercial software vendors that shipped things for Unixes also often put them in /opt.

This led to three broad things for people using Unixes back in those days. First, you invariably had a large $PATH, between all of the standard locations, the vendor additions, and the local additions on top of those (and possibly personal 'bin' directories in your $HOME). Second, there was a lot of variation in the $PATH you wanted, both from Unix to Unix (with every vendor having their own collection of non-standard $PATH additions) and from site to site (with sysadmins making all sorts of decisions about where to put local things). Third, setting yourself up on a new Unix often required a bunch of exploration and digging. Unix vendors often didn't add everything that you wanted to their standard $PATH, for example. If you were lucky and got an account at a well run site, their local custom new account dotfiles would set you up with a correct and reasonably complete local $PATH. If you were a sysadmin exploring a new to you Unix, you might wind up writing a grumpy blog entry.

(This got much more complicated for sites that had a multi-Unix environment, especially with shared home directories.)

Modern Unix life is usually at least somewhat better. On Linux, you're typically down to two main directories (/usr/bin and /usr/sbin) and possibly some things in /opt, depending on local tastes. The *BSDs are a little more expansive but typically nowhere near the heights of, for example, Solaris 9's $PATH (see the comments on that entry too).

'Internal' accounts and their difference from 'external' accounts

By: cks
14 February 2025 at 03:22

In the comments on my entry on how you should respond to authentication failures depends on the circumstances, sapphirepaw said something that triggered a belated realization in my mind:

Probably less of a concern for IMAP, but in a web app, one must take care to hide the information completely. I was recently at a site that wouldn't say whether the provided email was valid for password reset, but would reveal it was in use when trying to create a new account.

The realization this sparked is that we can divide accounts and systems into two sorts, which I will call internal and external, and how you want to treat things around these accounts is possibly quite different.

An internal account is one that's held by people within your organization, and generally is pretty universal. If you know that someone is a member of the organization you can predict that they have an account on the system, and not infrequently what the account name is. For example, if you know that someone is a graduate student here it's a fairly good bet that they have an account with us and you may even be able to find and work out their login name. The existence of these accounts and even specifics about who has what login name (mostly) isn't particularly secret or sensitive.

(Internal accounts don't have to be on systems that the organization runs; they could be, for example, 'enterprise' accounts on someone else's SaaS service. Once you know that the organization uses a particular SaaS offering or whatever, you're usually a lot of the way to identifying all of their accounts.)

An external account is one that's potentially held by people from all over, far outside the bounds of a single organization (including the one running the the systems the account is used with). A lot of online accounts with websites are like this, because most websites are used by lots of people from all over. Who has such an account may be potentially sensitive information, depending on the website and the feelings of the people involved, and the account identity may be even more sensitive (it's one thing to know that a particular email address has an Fediverse account on mastodon.social, but it may be quite different to know which account that is, depending on various factors).

There's a spectrum of potential secrecy between these two categories. For example, the organization might not want to openly reveal which external SaaS products they use, what entity name the organization uses on them, and the specific names people use for authentication, all in the name of making it harder to break into their environment at the SaaS product. And some purely internal systems might have a very restricted access list that is kept at least somewhat secret so attackers don't know who to target. But I think the broad division between internal and external is useful because it does a lot to point out where any secrecy is.

When I wrote my entry, I was primarily thinking about internal accounts, because internal accounts are what we deal with (and what many internal system administration groups handle). As sapphirepaw noted, the concerns and thus the rules are quite different for external accounts.

(There may be better labels for these two sorts of accounts. I'm not great with naming)

How you should respond to authentication failures isn't universal

By: cks
13 February 2025 at 02:55

A discussion broke out in the comments on my entry on how everything should be able to ratelimit authentication failures, and one thing that came up was the standard advice that when authentication fails, the service shouldn't give you any indication of why. You shouldn't react any differently if it's a bad password for an existing account, an account that doesn't exist any more (perhaps with the correct password for the account when it existed), an account that never existed, and so on. This is common and long standing advice, but like a lot of security advice I think that the real answer is that what you should do depends on your circumstances, priorities, and goals.

The overall purpose of the standard view is to not tell attackers what they got wrong, and especially not to tell them if the account doesn't even exist. What this potentially achieves is slowing down authentication guessing and making the attacker use up more resources with no chance of success, so that if you have real accounts with vulnerable passwords the attacker is less likely to succeed against them. However, you shouldn't have weak passwords any more and on the modern Internet, attackers aren't short of resources or likely to suffer any consequences for trying and trying against you (and lots of other people). In practice, much like delays on failed authentications, it's been a long time since refusing to say why something failed meaningfully impeded attackers who are probing standard setups for SSH, IMAP, authenticated SMTP, and other common things.

(Attackers are probing for default accounts and default passwords, but the fix there is not to have any, not to slow attackers down a bit. Attackers will find common default account setups, probably much sooner than you would like. Well informed attackers can also generally get a good idea of your valid accounts, and they certainly exist.)

If what you care about is your server resources and not getting locked out through side effects, it's to your benefit for attackers to stop early. In addition, attackers aren't the only people who will fail your authentication. Your own people (or ex-people) will also be doing a certain amount of it, and some amount of the time they won't immediately realize what's wrong and why their authentication attempt failed (in part because people are sadly used to systems simply being flaky, so retrying may make things work). It's strictly better for your people if you can tell them what was wrong with their authentication attempt, at least to a certain extent. Did they use a non-existent account name? Did they format the account name wrong? Are they trying to use an account that has now been disabled (or removed)? And so on.

(Some of this may require ingenious custom communication methods (and custom software). In the comments on my entry, BP suggested 'accepting' IMAP authentication for now-closed accounts and then providing them with only a read-only INBOX that had one new message that said 'your account no longer exists, please take it out of this IMAP client'.)

There's no universally correct trade-off between denying attackers information and helping your people. A lot of where your particular trade-offs fall will depend on your usage patterns, for example how many of your people make mistakes of various sorts (including 'leaving their account configured in clients after you've closed it'). Some of it will also depend on how much resources you have available to do a really good job of recognizing serious attacks and impeding attackers with measures like accurately recognizing 'suspicious' authentication patterns and blocking them.

(Typically you'll have no resources for this and will be using more or less out of the box rate-limiting and other measures in whatever software you use. Of course this is likely to limit your options for giving people special messages about why they failed authentication, but one of my hopes is that over time, software adds options to be more informative if you turn them on.)

A surprise with rspamd's spam scoring and a workaround

By: cks
12 February 2025 at 03:41

Over on the Fediverse, I shared a discovery:

This is my face when rspamd will apparently pattern-match a mention of 'test@test' in the body of an email, extract 'test', try that against the multi.surbl.org DNS blocklist (which includes it), and decide that incoming email is spam as a result.

Although I didn't mention it in the post, I assume that rspamd's goal is to extract the domain from email addresses and see if the domain is 'bad'. This handles a not uncommon pattern of spammer behavior where they send email from a throwaway setup but direct your further email to their long term address. One sees similar things with URLs, and I believe that rspamd will extract domains from URLs in messages as well.

(Rspamd is what we currently use for scoring email for spam, for various reasons beyond the scope of this entry.)

The sign of this problem happening was message summary lines in the rspamd log that included annotations like (with a line split and spacing for clarity):

[...] MW_SURBL_MULTI(7.50){test:email;},
PH_SURBL_MULTI(5.00){test:email;} [...]

As I understand it, the 'test:email' bit means that the thing being looked up in multi.surbl.org was 'test' and it came from the email message (I don't know if it's specifically the body of the email message or this could also have been in the headers). The SURBL reasonably lists 'test' for, presumably, testing purposes, much like many IP based DNSBLs list various 127.0.0.* IPs. Extracting a dot-less 'domain' from a plain text email message is a bit aggressive, but we get the rspamd that we get.

(You might wonder where 'test@test' comes from; the answer is that in Toronto it's a special DSL realm that's potentially useful for troubleshooting your DSL (also).)

Fortunately rspamd allows exceptions. If your rspamd configuration directory is /etc/rspamd as normal, you can put a 'map' file of SURBL exceptions at /etc/rspamd/local.d/map.d/surbl-whitelist.inc.local. You can discover this location by reading modules.d/rbl.conf, which you can find by grep'ing the entire /etc/rspamd tree for 'surbl' (yes, sometimes I use brute force). The best documentation on what you put into maps that I could find is "Maps content" in the multimap module documentation; the simple version is that you appear to put one domain per line and comment lines are allowed, starting with '#'.

(As far as I could tell from our experience, rspamd noticed the existence of our new surbl-whitelist.inc.local file all on its own, with no restart or reload necessary.)

Everything should be able to ratelimit sources of authentication failures

By: cks
11 February 2025 at 03:54

One of the things that I've come to believe in is that everything, basically without exception, should be able to rate-limit authentication failures, at least when you're authenticating people. Things don't have to make this rate-limiting mandatory, but it should be possible. I'm okay with basic per-IP or so rate limiting, although it would be great if systems could do better and be able to limit differently based on different criteria, such as whether the target login exists or not, or is different from the last attempt, or both.

(You can interpret 'sources' broadly here, if you want to; perhaps you should be able to ratelimit authentication by target login, not just by source IP. Or ratelimit authentication attempts to nonexistent logins. Exim has an interesting idea of a ratelimit 'key', which is normally the source IP in string form but which you can make be almost anything, which is quite flexible.)

I have come to feel that there are two reasons for this. The first reason, the obvious one, is that the Internet is full of brute force bulk attackers and if you don't put in rate-limits, you're donating CPU cycles and RAM to them (even if they have no chance of success and will always fail, for example because you require MFA after basic password authentication succeeds). This is one of the useful things that moving your services to non-standard ports helps with; you're not necessarily any more secure against a dedicated attacker, but you've stopped donating CPU cycles to the attackers that only poke the default port.

The second reason is that there are some number of people out there who will put a user name and a password (or the equivalent in the form of some kind of bearer token) into the configuration of some client program and then forget about it. Some of the programs these people are using will retry failed authentications incessantly, often as fast as you'll allow them. Even if the people check the results of the authentication initially (for example, because they want to get their IMAP mail), they may not keep doing so and so their program may keep trying incessantly even after events like their password changing or their account being closed (something that we've seen fairly vividly with IMAP clients). Without rate-limits, these programs have very little limits on their blind behavior; with rate limits, you can either slow them down (perhaps drastically) or maybe even provoke error messages that get the person's attention.

Unless you like potentially seeing your authentication attempts per second trending up endlessly, you want to have some way to cut these bad sources off, or more exactly make their incessant attempts inexpensive for you. The simple, broad answer is rate limiting.

(Actually getting rate limiting implemented is somewhat tricky, which in my view is one reason it's uncommon (at least as an integrated feature, instead of eg fail2ban). But that's another entry.)

PS: Having rate limits on failed authentications is also reassuring, at least for me.

Providing pseudo-tags in DWiki through a simple hack

By: cks
10 February 2025 at 03:56

DWiki is the general filesystem based wiki engine that underlies this blog, and for various reasons having to do with how old it is, it lacks a number of features. One of the features that I've wanted for more than a decade has been some kind of support for attaching tags to entries and then navigating around using them (although doing this well isn't entirely easy). However, it was always a big feature, both in implementing external files of tags and in tagging entries, and so I never did anything about it.

Astute observers of Wandering Thoughts may have noticed that some years ago, it acquired some topic indexes. You might wonder how this was implemented if DWiki still doesn't have tags (and the answer isn't that I manually curate the lists of entries for each topic, because I'm not that energetic). What happened is that when the issue was raised in a comment on an entry, I realized that I sort of already had tags for some topics because of how I formed the 'URL slugs' of entries (which are their file names). When I wrote about some topics, such as Prometheus, ZFS, or Go, I'd almost always put that word in the wikiword that became the entry's file name. This meant that I could implement a low rent version of tags simply by searching the (file) names of entries for words that matched certain patterns. This was made easier because I already had code to obtain the general list of file names of entries since that's used for all sorts of things in a blog (syndication feeds, the front page, and so on).

That this works as well as it does is a result of multiple quirks coming together. DWiki is a wiki so I try to make entry file names be wikiwords, and because I have an alphabetical listing of all entries that I look at regularly, I try to put relevant things in the file name of entries so I can find them again and all of the entries about a given topic sort together. Even in a file based blog engine, people don't necessarily form their file names to put a topic in them; you might make the file name be a slug-ized version of the title, for example.

(The actual implementation allows for both positive and negative exceptions. Not all of my entries about Go have 'Go' as a word, and some entries with 'Go' in their file name aren't about Go the language, eg.)

Since the implementation is a hack that doesn't sit cleanly within DWiki's general model of the world, it has some unfortunate limitations (so far, although fixing them would require more hacks). One big one is that as far as the rest of DWiki is concerned, these 'topic' indexes are plain pages with opaque text that's materialized through internal DWikiText rendering. As such, they don't (and can't) have Atom syndication feeds, the way proper fully supported tags would (and you can't ask for 'the most recent N Go entries', and so on; basically there are no blog-like features, because they all require directories).

One of the lessons I took from the experience of hacking pseudo-tag support together was that as usual, sometimes the perfect (my image of nice, generalized tags) is the enemy of the good enough. My solution for Prometheus, ZFS, and Go as topics isn't at all general, but it works for these specific needs and it was easy to put together once I had the idea. Another lesson is that sometimes you have more data than you think, and you can do a surprising amount with it once you realize this. I could have implemented these simple tags years before I did, but until the comment gave me the necessary push I just hadn't thought about using the information that was already in entry names (and that I myself used when scanning the list).

The Prometheus host agent is missing some Linux NFSv4 RPC stats (as of 1.8.2)

By: cks
9 February 2025 at 03:51

Over on the Fediverse I said:

This is my face when the Prometheus host agent provides very incomplete monitoring of NFS v4 RPC operations on modern kernels that can likely hide problems. For NFS servers I believe that you get only NFS v4.0 ops, no NFS v4.1 or v4.2 ones. For NFS v4 clients things confuse me but you certainly don't get all of the stats as far as I can see.

When I wrote that Fediverse post, I hadn't peered far enough into the depths of the Linux kernel to be sure what was missing, but now that I understand the Linux kernel NFS v4 server and client RPC operations stats I can provide a better answer of what's missing. All of this applies to node_exporter as of version 1.8.2 (the current one as I write this).

(I now think 'very incomplete' is somewhat wrong, but not entirely so, especially on the server side.)

Importantly, what's missing is different for the server side and the client side, with the client side providing information on operations that the server side doesn't. This can make it very puzzling if you're trying to cross-compare two 'NFS RPC operations' graphs, one from a client and one from a server, because the client graph will show operations that the server graph doesn't.

In the host agent code, the actual stats are read from /proc/net/rpc/nfs and /proc/net/rpc/nfsd by a separate package, prometheus/procfs, and are parsed in nfs/parse.go. For the server case, if we cross compare this to the kernel's include/linux/nfs4.h, what's missing from server stats is all NFS v4.1, v4.2, and RFC 8276 xattr operations, everything from operation 40 through operation 75 (as I write this).

Because the Linux NFS v4 client stats are more confusing and aren't so nicely ordered, the picture there is more complex. The nfs/parse.go code handles everything up through 'Clone', and is missing from 'Copy' onward. However, both what it has and what it's missing are a mixture of NFS v4, v4.1, and v4.2 operations; for example, 'Allocate' and 'Clone' (both included) are v4.2 operations, while 'Lookupp', a v4.0 operation, is missing from client stats. If I'm reading the code correctly, the missing NFS v4 client operations are currently (using somewhat unofficial names):

Copy OffloadCancel Lookupp LayoutError CopyNotify Getxattr Setxattr Listxattrs Removexattr ReadPlus

Adding the missing operations to the Prometheus host agent would require updates to both prometheus/procfs (to add fields for them) and to node_exporter itself, to report the fields. The NFS client stats collector in collector/nfs_linux.go uses Go reflection to determine the metrics to report and so needs no updates, but the NFS server stats collector in collector/nfsd_linux.go directly knows about all 40 of the current operations and so would need code updates, either to add the new fields or to switch to using Go reflection.

If you want numbers for scale, at the moment node_exporter reports on 50 out of 69 NFS v4 client operations, and is missing 36 NFS v4 server operations (reporting on what I believe is 36 out of 72). My ability to decode what the kernel NFS v4 client and server code is doing is limited, so I can't say exactly how these operations match up and, for example, what client operations the server stats are missing.

(I haven't made a bug report about this (yet) and may not do so, because doing so would require making my Github account operable again, something I'm sort of annoyed by. Github's choice to require me to have MFA to make bug reports is not the incentive they think it is.)

Web application design and the question of what is a "route"

By: cks
8 February 2025 at 04:16

So what happened is that Leah Neukirchen ran a Fediverse poll on how many routes your most complex web app had, and I said that I wasn't going to try to count how many DWiki had and then gave an example of combining two things in a way that I felt was a 'route' (partly because 'I'm still optimizing the router' was one poll answer). This resulted in a discussion where one of the questions I draw from it is "what is a route, exactly".

At one level counting up routes in your web application seems simple. For instance, in our Django application I could count up the URL patterns listed in our 'urlpatterns' setting (which gives me a larger number than I expected for what I think of as a simple Django application). Pattern delegation may make this a bit tedious, but it's entirely tractable. However, I think that this only works for certain sorts of web applications that are designed in a particular way, and as it happens I have an excellent example of where the concept of "route" gets fuzzy.

DWiki, the engine behind this blog, is actually a general filesystem based wiki (engine). As a filesystem based wiki, what it started out doing was to map any URL path to a filesystem object and then render the filesystem object in some appropriate way; for example, directories turn into a listing of their contents. With some hand-waving you could say that this is one route, or two once we through in an optional system for handling static assets. Alternately you could argue that this is two (or three) routes, one route for directories and one route for files, because the two are rendered differently (although that's actually implemented in templates, not in code, so maybe they're one route after all).

Later I added virtual directories, which are added to the end of directory paths and are used to restrict what things are visible within the directory (or directory tree). Both the URL paths involved and the actual matching against them look like normal routing (although they're not handled through a traditional router approach), so I should probably count them as "routes", adding four or so more routes, so you could say that DWiki has somewhere between five and seven routes (if you count files and directories separately and throw in a third route for static asset files).

However, I've left out a significant detail, which is visible in how both the blog's front page and the Atom syndication feed of the blog use the same path in their URLs, and the blog's front page looks nothing like a regular directory listing. What's going on is that how DWiki presents both files and especially directories depends on the view they're shown in, and DWiki has a bunch of views; all of the above differences are because of different views being used. Standard blog entry files can be presented in (if I'm counting right) five different views. Directories have a whole menagerie of views that they support, including a 'blog' view. Because views are alternate presentations of a given filesystem object and thus URL path, they're provided as a query parameter, not as part of the URL's path.

Are DWiki's views routes, and if they are, how do we count them? Is each unique combination of a page type (including virtual directories) and a view a new route? One thing that may affect your opinion of this is that a lot of the implementation of views is actually handled in DWiki's extremely baroque templates, not code. However, DWiki's code knows a full list of what views exist (and templates have to be provided or you'll get various failures).

(I've also left out a certain amount of complications, like redirections and invalid page names.)

The broad moral I draw from this exercise is that the model of distinct 'routes' is one that only works for certain sorts of web application design. When and where it works well, it's a quite useful model and I think it pushes you toward making good decisions about how to structure your URLs. But in any strong form, it's not a universal pattern and there are ways to go well outside it.

(Interested parties can see a somewhat out of date version of DWiki's code and many templates, although note that both contain horrors. At some point I'll probably update both to reflect my recent burst of hacking on DWiki.)

Linux kernel NFSv4 server and client RPC operation statistics

By: cks
7 February 2025 at 02:59

NFS servers and clients communicate using RPC, sending various NFS v3, v4, and possibly v2 (but we hope not) RPC operations to the server and getting replies. On Linux, the kernel exports statistics about these NFS RPC operations in various places, with a global summary in /proc/net/rpc/nfsd (for the NFS server side) and /proc/net/rpc/nfs (for the client side). Various tools will extract this information and convert it into things like metrics, or present it on the fly (for example, nfsstat(8)). However, as far as I know what is in those files and especially how RPC operations are reported is not well documented, and also confusing, which is a problem if you discover that something has an incomplete knowledge of NFSv4 RPC stats.

For a general discussion of /proc/net/rpc/nfsd, see Svenn D'Hert's nfsd stats explained article. I'm focusing on NFSv4, which is to say the 'proc4ops' line. This line is produced in nfsd_show in fs/nfsd/stats.c. The line starts with a count of how many operations there are, such as 'proc4ops 76', and then has one number for each operation. What are the operations and how many of them are there? That's more or less found in the nfs_opnum4 enum in include/linux/nfs4.h. You'll notice that there are some gaps in the operation numbers; for example, there's no 0, 1, or 2. Despite there being no such actual NFS v4 operations, 'proc4ops' starts with three 0s for them, because it works with an array numbered by nfs_opnum4 and like all C arrays, it starts at 0.

(The counts of other, real NFS v4 operations may be 0 because they're never done in your environment.)

For NFS v4 client operations, we look at the 'proc4' line in /proc/net/rpc/nfs. Like the server's 'proc4ops' line, it starts with a count of how many operations are being reported on, such as 'proc4 69', and then a count for each operation. Unfortunately for us and everyone else, these operations are not numbered the same as the NFS server operations. Instead the numbering is given in an anonymous and unnumbered enum in include/linux/nfs4.h that starts with 'NFSPROC4_CLNT_NULL = 0,' (as a spoiler, the 'null' operation is not unused, contrary to the include file's comment). The actual generation and output of /proc/net/rpc/nfs is done in rpc_proc_show in net/sunrpc/stats.c. The whole structure this code uses is set up in fs/nfs/nfs4xdr.c, and while there is a confusing level of indirection, I believe the structure corresponds directly with the NFSPROC4_CLNT_* enum values.

What I think is going on is that Linux has decided to optimize its NFSv4 client statistics to only include the NFS v4 operations that it actually uses, rather than take up a bit of extra memory to include all of the NFS v4 operations, including ones that will always have a '0' count. Because the Linux NFS v4 client started using different NFSv4 operations at different times, some of these operations (such as 'lookupp') are out of order; when the NFS v4 client started using them, they had to be added at the end of the 'proc4' line to preserve backward compatibility with existing programs that read /proc/net/rpc/nfs.

PS: As far as I can tell from a quick look at fs/nfs/nfs3xdr.c, include/uapi/linux/nfs3.h, and net/sunrpc/stats.c, the NFS v3 server and client stats cover all of the NFS v3 operations and are in the same order, the order of the NFS v3 operation numbers.

How Ubuntu 24.04's bad bpftrace package appears to have happened

By: cks
6 February 2025 at 02:39

When I wrote about Ubuntu 24.04's completely broken bpftrace '0.20.2-1ubuntu4.2' package (which is now no longer available as an Ubuntu update), I said it was a disturbing mystery how a theoretical 24.04 bpftrace binary was built in such a way that it depended on a shared library that didn't exist in 24.04. Thanks to the discussion in bpftrace bug #2097317, we have somewhat of an answer, which in part shows some of the challenges of building software at scale.

The short version is that the broken bpftrace package wasn't built in a standard Ubuntu 24.04 environment that only had released packages. Instead, it was built in a '24.04' environment that included (some?) proposed updates, and one of the included proposed updates was an updated version of libllvm18 that had the new shared library. Apparently there are mechanisms that should have acted to make the new bpftrace depend on the new libllvm18 if everything went right, but some things didn't go right and the new bpftrace package didn't pick up that dependency.

On the one hand, if you're planning interconnected package updates, it's a good idea to make sure that they work with each other, which means you may want to mingle in some proposed updates into some of your build environments. On the other hand, if you allow your build environments to be contaminated with non-public packages this way, you really, really need to make sure that the dependencies work out. If you don't and packages become public in the wrong order, you get Ubuntu 24.04's result.

(While the RPM build process and package format would have avoided this specific problem, I'm pretty sure that there are similar ways to make it go wrong.)

Contaminating your build environment this way also makes testing your newly built packages harder. The built bpftrace binary would have run inside the build environment, because the build environment had the right shared library from the proposed libllvm18. To see the failure, you would have to run tests (including running the built binary) in a 'pure' 24.04 environment that had only publicly released package updates. This would require an extra package test step; I'm not clear if Ubuntu has this as part of their automated testing of proposed updates (there's some hints in the discussion that they do but that these tests were limited and didn't try to run the binary).

The practical (Unix) problems with .cache and its friends

By: cks
5 February 2025 at 03:53

Over on the Fediverse, I said:

Dear everyone writing Unix programs that cache things in dot-directories (.cache, .local, etc): please don't. Create a non-dot directory for it. Because all of your giant cache (sub)directories are functionally invisible to many people using your programs, who wind up not understanding where their disk space has gone because almost nothing tells them about .cache, .local, and so on.

A corollary: if you're making a disk space usage tool, it should explicitly show ~/.cache, ~/.local, etc.

If you haven't noticed, there are an ever increasing number of programs that will cache a bunch of data, sometimes a very large amount of it, in various dot-directories in people's home directories. If you're lucky, these programs put their cache somewhere under ~/.cache; if you're semi-lucky, they use ~/.local, and if you're not lucky they invent their own directory, like ~/.cargo (used by Rust's standard build tool because it wants to be special). It's my view that this is a mistake and that everyone should put their big caches in a clearly visible directory or directory hierarchy, one that people can actually find in practice.

I will freely admit that we are in a somewhat unusual environment where we have shared fileservers, a now very atypical general multi-user environment, a compute cluster, and a bunch of people who are doing various sorts of modern GPU-based 'AI' research and learning (both AI datasets and AI software packages can get very big). In our environment, with our graduate students, it's routine for people to wind up with tens or even hundreds of GBytes of disk space used up for caches that they don't even realize are there because they don't show up in conventional ways to look for space usage.

As noted by Haelwenn /элвэн/, a plain 'du' will find such dotfiles. The problem is that plain 'du' is more or less useless for most people; to really take advantage of it, you have to know the right trick (not just the -h argument but feeding it to sort to find things). How I think most people use 'du' to find space hogs is they start in their home directory with 'du -s *' (or maybe 'du -hs *') and then they look at whatever big things show up. This will completely miss things in dot-directories in normal usage. And on Linux desktops, I believe that common GUI file browsers will omit dot-directories by default and may not even have a particularly accessible option to change that (this is certainly the behavior of Cinnamon's 'Files' application and I can't imagine that GNOME is different, considering their attitude).

(I'm not sure what our graduate students use to try explore their disk usage, but I know that multiple graduate students have been unable to find space being eaten up in dot-directories and surprised that their home directory was using so much.)

Why writes to disk generally wind up in your OS's disk read cache

By: cks
4 February 2025 at 03:44

Recently, someone was surprised to find out that ZFS puts disk writes in its version of a disk (read) cache, the ARC ('Adaptive Replacement Cache'). In fact this is quite common, as almost every operating system and filesystem puts ordinary writes to disk into their disk (read) cache. In thinking about the specific issue of the ZFS ARC and write data, I realized that there's a general broad reason for this and then a narrower technical one.

The broad reason that you'll most often hear about is that it's not uncommon for your system to read things back after you've written them to disk. It would be wasteful to having something in RAM, write it to disk, remove it from RAM, and then have to more or less immediately read it back from disk. If you're dealing with spinning HDDs, this is quite bad since HDDs can only do a relatively small amount of IO a second; in this day of high performance, low latency NVMe SSDs, it might not be so terrible any more, but it still costs you something. Of course you have to worry about writes flooding the disk cache and evicting more useful data, but this is also an issue with certain sorts of reads.

The narrower technical reason is dealing with issues that come up once you add write buffering to the picture. In practice a lot of ordinary writes to files aren't synchronously written out to disk on the spot; instead they're buffered in memory for some amount of time. This require some pool of (OS) memory to hold the these pending writes, which might as well be your regular disk (read) cache. Putting not yet written out data in the disk read cache also deals with the issue of coherence, where you want programs that are reading data to see the most recently written data even if it hasn't been flushed out to disk yet. Since reading data from the filesystem already looks in the disk cache, you'll automatically find the pending write data there (and you'll automatically replace an already cached version of the old data). If you put pending writes into a different pool of memory, you have to specifically manage it and tune its size, and you have to add extra code to potentially get data from it on reads.

(I'm going to skip considering memory mapped IO in this picture because it only makes things even more complicated, and how OSes and filesystems handle it potentially varies a lot. For example, I'm not sure if Linux or ZFS normally directly use pages in the disk cache, or if even shared memory maps get copies of the disk cache pages.)

PS: Before I started thinking about the whole issue as a result of the person's surprise, I would have probably only given you the broad reason off the top of my head. I hadn't thought about the technical issues of not putting writes in the read cache before now.

Web spiders (or people) can invent unfortunate URLs for your website

By: cks
3 February 2025 at 00:55

Let's start with my Fediverse post:

Today in "spiders on the Internet do crazy things": my techblog lets you ask for a range of entries. Normally the range that people ask for is, say, ten entries (the default, which is what you normally get links for). Some deranged spider out there decided to ask for a thousand entries at once and my blog engine sighed, rolled up its sleeves, and delivered (slowly and at large volume).

In related news, my blog engine can now restrict how large a range people can ask for (although it's a hack).

DWiki is the general wiki engine that creates Wandering Thoughts. As part of its generality, it has a feature that shows a range of 'pages' (in Wandering Thoughts these are entries, in general these are files in a directory tree), through what I call virtual directories. As is usual with these things, the range of entries (pages, files) that you're asking for is specified in the URL, with syntax like '<whatever>/range/20-30'.

If you visit the blog front page or similar things, the obvious and discoverable range links you get are for ten entries. You can under some situations get links for slightly bigger ranges, but not substantially larger ones. However, the engine didn't particularly restrict the size of these ranges, so if you wanted to create URLs by hand you could ask for very large ranges.

Today, I discovered that two IPs had asked for 1000-entry ranges today, and the blog engine provided them. Based on some additional log information, it looks like it's not the first time that giant ranges have been requested. One of those IPs was an AWS IP, for which my default assumption is that this is a web spider of some source. Even if it's not a conventional web spider, I doubt anyone is asking for a thousand entries at once with the plan of reading them all; that's a huge amount of text, so it's most likely being done to harvest a lot of my entries at once for some purpose.

(Partly because of that and partly because it puts a big load on DWiki, I've now hacked in a mentioned feature to restrict how large a range you can request. Because it's a hack, too-large ranges get HTTP 404 responses instead of something more useful.)

Sidebar: on the "virtual directories" name and feature

All of DWiki's blog parts are alternate views of a directory hierarchy full of files, where each file is a 'page' and in the context of Wandering Thoughts, almost all pages are blog entries (on the web, the 'See as Normal' link at the bottom will show you the actual directory view of something). A 'virtual directory' is a virtual version of the underlying real directory or directory hierarchy that only shows some pages, for example pages from 2025 or a range of pages based on how recent they are.

All of this is a collection of hacks built on top of other hacks, because that's what happens when you start with a file based wiki engine and decide you can make it be a blog too with only a few little extra features (as a spoiler, it did not wind up requiring only a few extra things). For example, you might wonder how the blog's front page winds up being viewed as a chronological blog, instead of a directory, and the answer is a hack.

Build systems and their effects on versioning and API changes

By: cks
2 February 2025 at 21:52

In a comment on my entry on modern languages and bad packaging outcomes at scale, sapphirepaw said (about backward and forward compatibility within language ecologies), well, I'm going to quote from it because it's good (but go read the whole comment):

I think there’s a social contract that has broken down somewhere.

[...]

If a library version did break things, it was generally considered a bug, and developers assumed it would be fixed in short order. Then, for the most part, only distributions had to worry about specific package/library-version incompatibilities.

This all falls apart if a developer, or the ecosystem of libraries/language they depend on, ends up discarding that compatibility-across-time. That was the part that made it feasible to build a distribution from a collection of projects that were, themselves, released across time.

I have a somewhat different view. I think that the way it was in the old days was less a social contract and more an effect of the environment that software was released into and built in, and now that the environment has changed, the effects have too.

C famously has a terrible story around its (lack of a) build system and dependency management, and for much of its life you couldn't assume pervasive and inexpensive Internet connectivity (well, you still can't assume the latter globally, but people have stopped caring about such places). This gave authors of open source software a strong incentive to be both backward and forward compatible. If you released a program that required the features of a very recent version of a library, you reduced your audience to people who already had the recent version (or better) or who were willing to go through the significant manual effort to get and build that version of the library, and then perhaps make all of their other programs work with it, since C environments often more or less forced global installation of libraries. If you were a library author releasing a new minor version or patch level that had incompatibilities, people would be very slow to actually install and adopt that version because of those incompatibilities; most of their programs using your libraries wouldn't update on the spot, and there was no good mechanism to use the old version of the library for some programs.

(Technically you could make this work with static linking, but static linking was out of favour for a long time.)

All of this creates a quite strong practical and social push toward stability. If you wanted your program or its new version to be used widely (and you usually did), it had better work with the old versions of libraries that people already had; requiring new APIs or new library behavior was dangerous. If you wanted the new version of your library to be used widely, it had better be compatible with old programs using the old API, and if you wanted a brand new library to be used by people in programs, it had better demonstrate that it was going to be stable.

Much of this spilled over into other languages like Perl and Python. Although both of these developed central package repositories and dependency management schemes, for a long time these mostly worked globally, just like the C library and header ecology, and so they faced similar pressures. Python only added fully supported virtual environments in 2012, for example (in Python 3.3).

Modern languages like Go and Rust (and the Node.js/NPM ecosystem, and modern Python venv based operation) don't work like that. Modern languages mostly use static linking instead of shared libraries (or the equivalent of static linking for dynamic languages, such as Python venvs), and they have build systems that explicitly support automatically fetching and using specific versions of dependencies (or version ranges; most build systems are optimistic about forward compatibility). This has created an ecology where it's much easier to use a recent version of something than it was in C, and where API changes in dependencies often have much less effect because it's much easier (and sometimes even the default) to build old programs with old dependency versions.

(In some languages this has resulted in a lot of programs and packages implicitly requiring relatively recent versions of their dependencies, even if they don't say so and claim wide backward compatibility. This happens because people would have to take explicit steps to test with their stated minimum version requirements and often people don't, with predictable results. Go is an exception here because of its choice of 'minimum version selection' for dependencies over 'maximum version selection', but even then it's easy to drift into using new language features or new standard library APIs without specifically requiring that version of Go.)

One of the things about technology is that technology absolutely affects social issues, so different technology creates different social expectations. I think that's what's happened with social expectations around modern languages. Because they have standard build systems that make it easy to do it, people feel free to have their programs require specific version ranges of dependencies (modern as well as old), and package authors feel free to break things and then maybe fix them later, because programs can opt in or not and aren't stuck with the package's choices for a particular version. There are still forces pushing towards compatibility, but they're weaker than they used to be and more often violated.

Or to put it another way, there was a social contract of sorts for C libraries in the old days but the social contract was a consequence of the restrictions of the technology. When the technology changed, the 'social contract' also changed, with unfortunate effects at scale, which most developers don't care about (most developers aren't operating at scale, they're scratching their own itch). The new technology and the new social expectations are probably better for the developers of programs, who can now easily use new features of dependencies (or alternately not have to update their code to the latest upstream whims), and for the developers of libraries and packages, who can change things more easily and who generally see their new work being used faster than before.

(In one perspective, the entire 'semantic versioning' movement is a reaction to developers not following the expected compatibility that semver people want. If developers were already doing semver, there would be no need for a movement for it; the semver movement exists precisely because people weren't. We didn't have a 'semver' movement for C libraries in the 1990s because no one needed to ask for it, it simply happened.)

An alarmingly bad official Ubuntu 24.04 bpftrace binary package

By: cks
2 February 2025 at 03:53

Bpftrace is a more or less official part of Ubuntu; it's even in the Ubuntu 24.04 'main' repository, as opposed to one of the less supported ones. So I'll present things in the traditional illustrated form (slightly edited for line length reasons):

$ bpftrace
bpftrace: error while loading shared libraries: libLLVM-18.so.18.1: cannot open shared object file: No such file or directory
$ readelf -d /usr/bin/bpftrace | grep libLLVM
 0x0...01 (NEEDED)  Shared library: [libLLVM-18.so.18.1]
$ dpkg -L libllvm18 | grep libLLVM
/usr/lib/llvm-18/lib/libLLVM.so.1
/usr/lib/llvm-18/lib/libLLVM.so.18.1
/usr/lib/x86_64-linux-gnu/libLLVM-18.so
/usr/lib/x86_64-linux-gnu/libLLVM.so.18.1
$ dpkg -l bpftrace libllvm18
[...]
ii  bpftrace       0.20.2-1ubuntu4.2 amd64 [...]
ii  libllvm18:amd64 1:18.1.3-1ubuntu1 amd64 [...]

I originally mis-diagnosed this as a libllvm18 packaging failure, but this is in fact worse. Based on trawling through packages.ubuntu.com, only Ubuntu 24.10 and later have a 'libLLVM-18.so.18.1' in any package; in Ubuntu 24.04, the correct name for this is 'libLLVM.so.18.1'. If you rebuild the bpftrace source .deb on a genuine 24.04 machine, you get a bpftrace build (and binary .deb) that does correctly use 'libLLVM.so.18.1' instead of 'libLLVM-18.so.18.1'.

As far as I can see, there are two things that could have happened here. The first is that Canonical simply built a 24.10 (or later) bpftrace binary .deb and put it in 24.04 without bothering to check if the result actually worked. I would like to say that this shows shocking disregard for the functioning of an increasingly important observability tool from Canonical, but actually it's not shocking at all, it's Canonical being Canonical (and they would like us to pay for this for some reason). The second and worse option is that Canonical is building 'Ubuntu 24.04' packages in an environment that is contaminated with 24.10 or later packages, shared libraries, and so on. This isn't supposed to happen in a properly operating package building environment that intends to create reliable and reproducible results and casts doubt on the provenance and reliability of all Ubuntu 24.04 packages.

(I don't know if there's a way to inspect binary .debs to determine anything about the environment they were built in, the way you can get some information about RPMs. Also, I now have a new appreciation for Fedora putting the Fedora release version into the actual RPM's 'release' name. Ubuntu 24.10 and 24.04 don't have the same version of bpftrace, so this isn't quite as simple as Canonical copying the 24.10 package to 24.04; 24.10 has 0.21.2, while 24.04 is theoretically 0.20.2.)

Incidentally, this isn't an issue of the shared library having its name changed, because if you manually create a 'libLLVM-18.so.18.1' symbolic link to the 24.04 libllvm18's 'libLLVM.so.18.1' and run bpftrace, what you get is:

$ bpftrace
: CommandLine Error: Option 'debug-counter' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
abort

This appears to say that the Ubuntu 24.04 bpftrace binary is incompatible with the Ubuntu 24.04 libllvm18 shared libraries. I suspect that it was built against different LLVM 18 headers as well as different LLVM 18 shared libraries.

Modern languages and bad packaging outcomes at scale

By: cks
1 February 2025 at 03:30

Recently I read Steinar H. Gunderson's Migrating away from bcachefs (via), where one of the mentioned issues was a strong disagreement between the author of bcachefs and the Debian Linux distribution about how to package and distribute some Rust-based tools that are necessary to work with bcachefs. In the technology circles that I follow, there's a certain amount of disdain for the Debian approach, so today I want to write up how I see the general problem from a system administrator's point of view.

(Saying that Debian shouldn't package the bcachefs tools if they can't follow the wishes of upstream is equivalent to saying that Debian shouldn't support bcachefs. Among other things, this isn't viable for something that's intended to be a serious mainstream Linux filesystem.)

If you're serious about building software under controlled circumstances (and Linux distributions certainly are, as are an increasing number of organizations in general), you want the software build to be both isolated and repeatable. You want to be able to recreate the same software (ideally exactly binary identical, a 'reproducible build') on a machine that's completely disconnected from the Internet and the outside world, and if you build the software again later you want to get the same result. This means that build process can't download things from the Internet, and if you run it three months from now you should get the same result even if things out there on the Internet have changed (such as third party dependencies releasing updated versions).

Unfortunately a lot of the standard build tooling for modern languages is not built to do this. Instead it's optimized for building software on Internet connected machines where you want the latest patchlevel or even entire minor version of your third party dependencies, whatever that happens to be today. You can sometimes lock down specific versions of all third party dependencies, but this isn't necessarily the default and so programs may not be set up this way from the start; you have to patch it in as part of your build customizations.

(Some languages are less optimistic about updating dependencies, but developers tend not to like that. For example, Go is controversial for its approach of 'minimum version selection' instead of 'maximum version selection'.)

The minimum thing that any serious packaging environment needs to do is contain all of the dependencies for any top level artifact, and to force the build process to use these (and only these), without reaching out to the Internet to fetch other things (well, you're going to block all external access from the build environment). How you do this depends on the build system, but it's usually possible; in Go you might 'vendor' all dependencies to give yourself a self-contained source tree artifact. This artifact never changes the dependency versions used in a build even if they change upstream because you've frozen them as part of the artifact creation process.

(Even if you're not a distribution but an organization building your own software using third-party dependencies, you do very much want to capture local copies of them. Upstream things go away or get damaged every so often, and it can be rather bad to not be able to build a new release of some important internal tool because an upstream decided to retire to goat farming rather than deal with the EU CRA. For that matter, you might want to have local copies of important but uncommon third party open source tools you use, assuming you can reasonably rebuild them.)

If you're doing this on a small scale for individual programs you care a lot about, you can stop there. If you're doing this on an distribution's scale you have an additional decision to make: do you allow each top level thing to have its own version of dependencies, or do you try to freeze a common version? If you allow each top level thing to have its own version, you get two problems. First, you're using up more disk space for at least your source artifacts. Second and worse, now you're on the hook for maintaining, checking, and patching multiple versions of a given dependency if it turns out to have a security issue (or a serious bug).

Suppose that you have program A using version 1.2.3 of a dependency, program B using 1.2.7, the current version is 1.2.12, and the upstream releases 1.2.13 to fix a security issue. You may have to investigate both 1.2.3 and 1.2.7 to see if they have the bug and then either patch both with backported fixes or force both program A and program B to be built with 1.2.13, even if the version of these programs that you're using weren't tested and validated with this version (and people routinely break things in patchlevel releases).

If you have a lot of such programs it's certainly tempting to put your foot down and say 'every program that uses dependency X will be set to use a single version of it so we only have to worry about that version'. Even if you don't start out this way you may wind up with it after a few security releases from the dependency and the packagers of programs A and B deciding that they will just force the use of 1.2.13 (or 1.2.15 or whatever) so that they can skip the repeated checking and backporting (especially if both programs are packaged by the same person, who has only so much time to deal with all of this). If you do this inside an organization, probably no one in the outside world knows. If you do this as a distribution, people yell at you.

(Within an organization you may also have more flexibility to update program A and program B themselves to versions that might officially support version 1.2.15 of that dependency, even if the program version updates are a little risky and change some behavior. In a distribution that advertises stability and has no way of contacting people using it to warn them or coordinate changes, things aren't so flexible.)

The tradeoffs of having an internal unauthenticated SMTP server

By: cks
31 January 2025 at 04:08

One of the reactions I saw to my story of being hit by an alarming well prepared phish spammer was surprise that we had an unauthenticated SMTP server, even if it was only available to our internal networks. Part of the reason we have such a server is historical, but I also feel that the tradeoffs involved are not as clear cut as you might think.

One fundamental problem is that people (actual humans) aren't the only thing that needs to be able to send email. Unless you enjoy building your own system problem notification system from scratch, a whole lot of things will try to send you email to tell you about problems. Cron jobs will email you output, you may want to get similar email about systemd units, both Linux software RAID and smartd will want to use email to tell you about failures, you may have home-grown management systems, and so on. In addition to these programs on your servers, you may have inconvenient devices like networked multi-function photocopiers that have scan to email functionality (and the people who bought them and need to use them have feelings about being able to do so). In a university environment such as ours, some of the machines involved will be run by research groups, graduate students, and so on, not your core system administrators (and it's a very good idea if these machines can tell their owners about failed disks and the like).

Most of these programs will submit their email through the local mailer facilities (whatever they are), and most local mail systems ('MTAs') can be configured to use authentication when they talk to whatever SMTP gateway you point them at. So in theory you could insist on authenticated SMTP for everything. However, this gives you a different problem, because now you must manage this authentication. Do you give each machine its own authentication identity and password, or have some degree of shared authentication? How do you distribute and update this authentication information? How much manual work are you going to need to do as research groups add and remove machines (and as your servers come and go)? Are you going to try to build a system that restricts where a given authentication identity can be used from, so that someone can't make off with the photocopier's SMTP authorization and reuse it from their desktop?

(If you instead authorize IP addresses without requiring SMTP authentication, you've simply removed the requirement for handling and distributing passwords; you're still going to be updating some form of access list. Also, this has issues if people can use your servers.)

You can solve all of these problems if you want to. But there is no current general, easily deployed solution for them, partly because we don't currently have any general system of secure machine and service identity that programs like MTAs can sit on top of. So system administrators have to build such things ourselves to let one MTA prove to another MTA who and what it is.

(There are various ways to do this other than SMTP authentication and some of them are generally used in some environments; I understand that mutual TLS is common in some places. And I believe that in theory Kerberos could solve this, if everything used it.)

Every custom piece of software or piece of your environment that you build is an overhead; it has to be developed, maintained, updated, documented, and so on. It's not wrong to look at the amount of work it would require in your environment to have only authenticated SMTP and conclude that the practical risks of having unauthenticated SMTP are low enough that you'll just do that.

PS: requiring explicit authentication or authorization for notifications is itself a risk, because it means that a machine that's in a sufficiently bad or surprising state can't necessarily tell you about it. Your emergency notification system should ideally fail open, not fail closed.

PPS: In general, there are ways to make an unauthenticated SMTP server less risky, depending on what you need it to do. For example, in many environments there's no need to directly send such system notification email to arbitrary addresses outside the organization, so you could restrict what destinations the server accepts, and maybe what sending addresses can be used with it.

Our well-prepared phish spammer may have been chasing lucrative prey

By: cks
30 January 2025 at 03:19

Yesterday I wrote about how we got hit by an alarmingly well-prepared phish spammer. This spammer sent a moderate amount of spam through us, in two batches; most of it was immediately delivered or bounced (and was effectively lost), but we managed to capture one message due to delivery problems. We can't be definite from a single captured spam message (and our logs suggesting that the other messages were similar to it), but it's at least suggestive.

The single captured email message has two PDFs and a text portion; as far as I can tell the PDFs are harmless (apart from their text contents), with no links or other embedded things. The text portion claims to be a series of (top replying) email messages about the nominal sender of the message getting an invoice paid, and the PDFs are an invoice for vague professional services for $49,700 (US dollars, implicitly), with a bank's name, a bank routing number and an account number, and a US IRS W-9 form for the person supposedly asking for their invoice to be paid, complete with an address and a US Social Security number. The PDF requests that you 'send a copy of the remittance to <email address>', where the domain has no website and its mail is hosted by Google. Based on some Internet searches, the PDF's bank routing number is correct for the bank, although of course who knows who the account number goes to.

The very obvious thing to say is that if even a single recipient out of the just under three hundred this spam was sent to follows the directions and sends an invoice payment, this will have been a decently lucrative phish spam (assuming that all of the spam messages were pushing the same scam, and the spammer can extract the money). If several of them did, this could be extremely lucrative, more than lucrative enough to justify dozens or hundreds of hours of research on both the ultimate targets (to determine who at various domains to send email to, what names of bosses to put in the email, and so on) and access methods (ie, how to use our VPNs).

Further, it seems possible that the person whose name was on the invoice, the email, and the W-9 is real and had their identity stolen, complete with their current address and US social security number. If this is the case, the person may receive an unpleasant surprise the next time they have to interact with the US IRS, since the IRS may well have data from companies claiming that this person was paid income that, well, they weren't. I can imagine a more advanced version of the scam where the spammer actually opened an account in this person's name at the bank in the invoice, and is now routing their fraudulently obtained invoice payments through it.

(There are likely all sorts of other possibilities for how the spammer might be extracting invoice payment money, and all of this assumes that the PDFs themselves don't contain undetected malware that is simply inactive in my Linux command line based PDF viewing environment.)

We got hit by an alarmingly well-prepared phish spammer

By: cks
29 January 2025 at 04:24

Yesterday evening, we were hit by a run of phish spam that I would call 'vaguely customized' for us, for example the display name in the From: header was "U of T | CS Dept" (but then the actual email address was that of the compromised account elsewhere that was used to send the phish spam). The destination addresses here weren't particularly well chosen, and some of them didn't even exist. So far, so normal. One person here fell for the phish spam that evening but realized it almost immediately and promptly changed their password. Today that person got in touch with us because they'd started receiving email bounces for (spam) email that they hadn't sent. Investigation showed that the messages were being sent through us, but in an alarmingly clever way.

We have a local VPN service for people, and this VPN service requires a different password from your regular (Unix and IMAP and etc) password. People connecting through our VPN have access to an internal-only SMTP gateway machine that doesn't require SMTP authentication. As far as we can tell, in the quite short interval between when the person fell for the phish and then changed their password, the phish spam attacker used the main password they'd just stolen to register the person for our VPN and obtain a VPN password (which we don't reset on Unix password changes). They then connected to the VPN using their stolen credentials and used the VPN to send spam email through our internal-only SMTP gateway (initially last evening and then again today, at which point they were detected).

Based on some log evidence, I think that the phish spammer first tried to use authenticated SMTP but failed due to the password change, then fell back on the VPN access. Even if VPN access hadn't been their primary plan, they worked very fast to secure themselves an additional access method. It seems extremely likely that the attacker had already researched our mail and VPN environment before they sent their initial phish spam, since they knew exactly where to go and what to do.

If phish spammers are increasingly going to be this well prepared and clever, we're going to have to be prepared for that on our side. Until now, we hadn't really thought about the possibility of phish spammers gaining VPN access; previous phish spammers have exploited some combination of webmail and authenticated SMTP.

(We're also going to need to be more concerned about other methods of obtaining persistent account access, such as adding new SSH authorized keys to the Unix login. This attacker didn't attempt any sort of SSH access.)

How to accidentally get yourself with 'find ... -name something*'

By: cks
28 January 2025 at 03:43

Suppose that you're in some subdirectory /a/b/c, and you want to search all of /a for the presence of files for any version of some program:

u@h:/a/b/c$ find /a -name program* -print

This reports '/a/b/c/program-1.2.tar' and '/a/b/f/program-1.2.tar', but you happen to know that there are other versions of the program under /a. What happened to a command that normally works fine?

As you may have already spotted, what happened is the shell's wildcard expansion. Because you ran your find in a directory that contained exactly one match for 'program*', the shell expanded it before you ran find, and what you actually ran was:

find /a -name program-1.2.tar -print

This reported the two instances of program-1.2.tar in the /a tree, but not the program-1.4.1.tar that was also in the /a tree.

If you'd run your find command in a directory without a shell match for the -name wildcard, the shell would (normally) pass the unexpanded wildcard through to find, which would do what you want. And if there had been only one instance of 'program-1.2.tar' in the tree, in your current directory, it might have been more obvious what went wrong; instead, the find returning more than one result made it look like it was working normally apart from inexplicably not finding and reporting 'program-1.4.1.tar'.

(If there were multiple matches for the wildcard in the current directory, 'find' would probably have complained and you'd have realized what was going on.)

Some shells have options to cause failed wildcard expansions to be considered an error; Bash has the 'failglob' shopt, for example. People who turn these options on are probably not going to stumble into this because they've already been conditioned to quote wildcards for 'find -name' and other similar tools. Possibly this Bash option or its equivalent in other shells should be the default for new Unix accounts, just so everyone gets used to quoting wildcards that are supposed to be passed through to programs.

(Although I don't use a shell that makes failed wildcard expansions an error, I somehow long ago internalized the idea that I should quote all wildcards I want to pass to programs.)

Some learning experiences with HTTP cookies in practice

By: cks
27 January 2025 at 03:29

Suppose, not hypothetically, that you have a dynamic web site that makes minor use of HTTP cookies in a way that varies the output, and also this site has a caching layer. Naturally you need your caching layer to only serve 'standard' requests from cache, not requests that should get something non-standard. One obvious and simple approach is to skip your cache layer for any request that has a HTTP cookie. If you (I) do this, I have bad news about HTTP requests in practice, at least for syndication feed fetchers.

(One thing you might do with HTTP cookies is deliberately bypass your own cache, for example to insure that someone who posts a new comment can immediately see their own comment, even if an older version of the page is in the cache.)

The thing about HTTP cookies is that the HTTP client can send you anything it likes as a HTTP cookie and unfortunately some clients will. For example, one feed reader fetcher deliberately attempts to bypass Varnish caches by sending a cookie with all fetch requests, so if the presence of any HTTP cookie causes you to skip your own cache (and other things you do that use the same logic), well, feeder.co is bypassing your caching layer too. Another thing that happens is that some syndication feed fetching clients appear to sometimes leak unrelated cookies into their HTTP requests.

(And of course if your software is hosted along side other software that might set unrestricted cookies for the entire website, those cookies may leak into requests made to your software. For feed fetching specifically, this is probably most likely in feed readers that are browser addons.)

The other little gotcha is that you shouldn't rely on merely the presence or absence of a 'Cookie:' header in the request to tell you if the request has cookies, because a certain number of HTTP clients appear to send a blank Cookie: header (ie, just 'Cookie:'). You might be doing this directly in a CGI by checking for the presence of $HTTP_COOKIE, or you might be doing this indirectly by parsing any Cookie: header in the request into a 'Cookies' object of some sort (even if the value is blank), in which case you'll wind up with an empty Cookies object.

(You can also receive cookies with a blank value in a Cookies: header, eg 'JSESSIONID=', which appears to be a deliberate decision by the software involved, and seems to be to deal with a bad feed source.)

If you actually care about all of this, as I do now that I've discovered it all, you'll want to specifically check for the presence of your own cookies and ignore any other cookies you see, as well as a blank 'Cookie:' HTTP header. Doing extra special things if you see a 'bypass_varnish=1' cookie is up to you.

(In theory I knew that the HTTP Cookies: header was untrusted client data and shouldn't be trusted, and sometimes even contained bad garbage (which got noted every so often in my logs). In practice I didn't think about the implications of that for some of my own code until now.)

Syndication feeds here are now rate-limited on a per-IP basis

By: cks
26 January 2025 at 03:30

For a long time I didn't look very much at the server traffic logs for Wandering Thoughts, including what was fetching my syndication feeds and how, partly because I knew that looking at web server logs invariably turns over a rock or two. In the past few months I started looking at my feed logs, and then I spent some time trying to get some high traffic sources to slow down on an ad-hoc basis, which didn't have much success (partly because browser feed reader addons seem bad at this). Today I finally gave in to temptation and added general per-IP rate limiting for feed requests. A single IP that requests a particular syndication feed too soon after its last successful request will receive a HTTP 429 response.

(The actual implementation is a hack, which is one reason I didn't do it before now; DWiki, the engine behind Wandering Thoughts, doesn't have an easy place for dynamically updated shared state.)

This rate-limiting will probably only moderately reduce the load on Wandering Thoughts, for various reasons, but it will make me happier. I'm also looking forward to having a better picture of what I consider 'actual traffic' to Wandering Thoughts, including actual User-Agent usage, without the distortions added by badly behaved browser addons (I'm pretty sure that my casual view of Firefox's popularity for visitors has been significantly distorted by syndication feed over-fetching).

In applying this rate limiting, I've deliberately decided not to exempt various feed reader providers like NewsBlur, Feedbin, Feedly, and so on. Hopefully all of these places will react properly to receiving periodic HTTP 429 requests and not, say, entirely give up fetching my feeds after a while because they're experiencing 'too many errors'. However, time will tell if this is correct (and if my HTTP 429 responses cause them to slow down their often quite frequent syndication feed requests).

In general I'm going to have to see how things develop, and that's a decent part of why I'm doing this at all. I'm genuinely curious how clients will change their behavior (if they do) and what will emerge, so I'm doing a little experiment (one that's nowhere as serious and careful as rachelbythebay's ongoing work).

PS: The actual rate limiting applies a much higher minimum interval for unconditional HTTP syndication feed requests than for conditional ones, for the usual reason that I feel repeated unconditional requests for syndication feeds is rather antisocial, and if a feed fetcher is going to be antisocial I'm not going to talk to it very often.

Languages don't version themselves using semantic versioning

By: cks
25 January 2025 at 03:46

A number of modern languages have effectively a single official compiler or interpreter, and they version this toolchain with what looks like a semantic version (semver). So we have (C)Python 3.12.8, Go 1.23.5, Rust(c) 1.84.0, and so on, which certainly look like a semver major.minor.patchlevel triplet. In practice, this is not how languages think of their version numbers.

In practice, the version number triplets of things like Go, Rust, and CPython have a meaning that's more like '<dialect>.<release>.<patchlevel>'. The first number is the language dialect and it changes extremely infrequently, because it's a very big deal to significantly break backward compatibility or even to make major changes in language semantics that are sort of backward compatible. Python 1, Python 2, and Python 3 are all in effect different but closely related languages.

(Python 2 is much closer to Python 1 than Python 3 is to Python 2, which is part of why you don't read about a painful and protracted transition from Python 1 to Python 2.)

The second number is somewhere between a major and a minor version number. It's typically increased when the language or the toolchain (or both) do something significant, or when enough changes have built up since the last time the second number was increased and people want to get them out in the world. Languages can and do make major additions with only a change in the second number; Go added generics, CPython added and improved an asynchronous processing system, and Rust has stabilized a whole series of features and improvements, all in Go 1.x, CPython 3.x, and Rust 1.x.

The third number is a patchlevel (or if you prefer, a 'point release'). It's increased when a new version of an X.Y release must be made to fix bugs or security problems, and generally contains minimal code changes and no new language features. I think people would look at the language's developers funny if they landed new language features in a patchlevel instead of an actual release, and they'd definitely be unhappy if something was broken or removed in a patchlevel. It's supposed to be basically completely safe to upgrade to a new patchlevel of the language's toolchain.

Both Go and CPython will break, remove, or change things in new 'release' versions. CPython has deprecated a number of things over the course of the 3.x releases so far, and Go has changed how its toolchain behaves and turned off some old behavior (the toolchain's behavior is not covered by Go's language and standard library compatibility guarantee). In this regard these Go and CPython releases are closer to major releases than minor releases.

(Go uses the term 'major release' and 'minor release' for, eg, 'Go 1.23' and 'Go 1.23.3'; see here. Python often calls each '3.x' a 'series', and '3.x.y' a 'maintenance release' within that series, as seen in the Python 3.13.1 release note.)

The corollary of this is that you can't apply semver expectations about stability to language versioning. Languages with this sort of versioning are 'less stable' than they should be by semver standards, since they make significant and not necessarily backward compatible changes in what semver would call a 'minor' release. This isn't a violation of semver because these languages never claimed or promised to be following semver. Language versioning is different (and basically has to be).

(I've used CPython, Go, and Rust here because they're the three languages where I'm most familiar with the release versioning policies. I suspect that many other languages follow similar approaches.)

Sometimes you need to (or have to) run old binaries of programs

By: cks
24 January 2025 at 03:52

Something that is probably not news to system administrators who've been doing this long enough is that sometimes, you need to or have to run old binaries of programs. I don't mean that you need to run old versions of things (although since the program binaries are old, they will be old versions); I mean that you literally need to run old binaries, ones that were built years ago.

The obvious situation where this can happen is if you have commercial software and the vendor either goes out of business or stops providing updates for the software. In some situations this can result in you needing to keep extremely old systems alive simply to run this old software, and there are lots of stories about 'business critical' software in this situation.

(One possibly apocryphal local story is that the central IT people had to keep a SPARC Solaris machine running for more than a decade past its feasible end of life because it was the only environment that ran a very special printer driver that was used to print payroll checks.)

However, you can also get into this situation with open source software too. Increasingly, rebuilding complex open source software projects is not for the faint of heart and requires complex build environments. Not infrequently, these build environments are 'fragile', in the sense that in practice they depend on and require specific versions of tools, supporting language interpreters and compilers, and so on. If you're trying to (re)build them on a modern version of the OS, you may find some issues (also). You can try to get and run the version of the tools they need, but this can rapidly send you down a difficult rabbit hole.

(If you go back far enough, you can run into 32-bit versus 64-bit issues. This isn't just compilation problems, where code isn't 64-bit safe; you can also have code that produces different results when built as a 64-bit binary.)

This can create two problems. First, historically, it complicates moving between CPU architectures. For a couple of decades that's been a non-issue for most Unix environments, because x86 was so dominant, but now ARM systems are starting to become more and more available and even attractive, and they generally don't run old x86 binaries very well. Second, there are some operating systems that don't promise long term binary compatibility to older versions of themselves; they will update system ABIs, removing the old version of the ABI after a while, and require you to rebuild software to use the new ABIs if you want to run it on the current version of the OS. If you have to use old binaries you're stuck with old versions of the OS and generally no security updates.

(If you think that this is absurd and no one would possibly do that, I will point you to OpenBSD, which does it regularly to help maintain and improve the security of the system. OpenBSD is neither wrong nor right to take their approach; they're making a different set of tradeoffs than, say, Linux, because they have different priorities.)

More features for web page generation systems doing URL remapping

By: cks
23 January 2025 at 04:08

A few years ago I wrote about how web page generation systems should support remapping external URLs (this includes systems that convert some form of wikitext to HTML). At the time I was mostly thinking about remapping single URLs and mentioned things like remapping prefixes (so you could remap an entire domain into web.archive.org) as something for a fancier version. Well, the world turns and things happen and I now think that such prefix remapping is essential; even if you don't start out with it, you're going to wind up with it in the longer term.

(To put it one way, the reality of modern life is that sometimes you no longer want to be associated with some places. And some day, my Fediverse presence may also move.)

In light of a couple of years of churn in my website landscape (after what was in hindsight a long period of stability), I now have revised views on the features I want in a (still theoretical) URL remapping system for Wandering Thoughts. The system I want should be able to remap individual URLs, entire prefixes, and perhaps regular expressions with full scale rewrites (or maybe some scheme with wildcard matching), although I don't currently have a use for full scale regular expression rewrites. As part of this, there needs to be some kind of priority or hierarchy between different remappings that can all potentially match the same URL, because there's definitely at least one case today where I want to remap 'asite/a/*' somewhere and all other 'asite/*' URLs to something else. While it's tempting to do something like 'most specific thing matches', working out what is most specific from a collection of different sorts of remapping rules seems a bit hard, so I'd probably just implement it as 'first match wins' and manage things by ordering matches in the configuration file.

('Most specific match wins' is a common feature in web application frameworks for various reasons, but I think it's harder to implement here, especially if I allow arbitrary regular expression matches.)

Obviously the remapping configuration file should support comments (every configuration system needs to). Less obviously, I'd support file inclusion or the now common pattern of a '<whatever>.d' directory for drop in files, so that remapping rules can be split up by things like the original domain rather than having to all be dumped into an ever-growing single configuration file.

(Since more and more links rot as time passes, we can pretty much guarantee that the number of our remappings is going to keep growing.)

Along with the remapping, I may want something (ie, a tiny web application) that dynamically generates some form of 'we don't know where you can find this now but here is what the URL used to be' page for any URL I feed it. The obvious general reason for this is that sometimes old domain names get taken over by malicious parties and the old content is nowhere to be found, not even on web.archive.org. In that case you don't want to keep a link to what's now a malicious site, but you also don't have any other valid target for your old link. You could rewrite the link to some invalid domain name and leave it to the person visiting you and following the link to work out what happened, but it's better to be friendly.

(This is where you want to be careful about XSS and other hazards of operating what is basically an open 'put text in and we generate a HTML page with it shown in some way' service.)

A change in the handling of PYTHONPATH between Python 3.10 and 3.12

By: cks
22 January 2025 at 03:40

Our long time custom for installing Django for our Django based web application was to install it with 'python3 setup.py install --prefix /some/where', and then set a PYTHONPATH environment variable that pointed to /some/where/lib/python<ver>/site-packages. Up through at least Python 3.10 (in Ubuntu 22.04), you could start Python 3 and then successfully do 'import django' with this; in fact, it worked on different Python versions if you were pointing at the same directory tree (in our case, this directory tree lives on our NFS fileservers). In our Ubuntu 24.04 version of Python 3.12 (which also has the Ubuntu packaged setuptools installed), this no longer works, which is inconvenient to us.

(It also doesn't seem to work in Fedora 40's 3.12.8, so this probably isn't something that Ubuntu 24.04 broke by using an old version of Python 3.12, unlike last time.)

The installed site-packages directory contains a number of '<package>.egg' directories, a site.py file that I believe is generic, and an easy-install.pth that lists the .egg directories. In Python 3.10, strace says that Python 3 opens site.py and then easy-install.pth during startup, and then in a running interpreter, 'sys.path' contains the .egg directories. In Python 3.12, none of this happens, although CPython does appear to look at the overall 'site-packages' directory and 'sys.path' contains it, as you'd expect. Manually adding the .egg directories to a 3.12 sys.path appears to let 'import django' work, although I don't know if everything is working correctly.

I looked through the 3.11 and 3.12 "what's new" documentation (3.11, 3.12) but couldn't find anything obvious. I suspect that this is related to the removal of distutils in 3.12, but I don't know enough to say for sure.

(Also, if I use our usual Django install process, the Ubuntu 24.04 Python 3.12 installs Django in a completely different directory setup than in 3.10; it now winds up in <top level>/local/lib/python3.12/dist-packages. Using 'pip install --prefix ...' does create something where pointing PYTHONPATH at the 'dist-packages' subdirectory appears to work. There's also 'pip install --target', which I'd forgotten about until I stumbled over my old entry.)

All of this makes it even more obvious to me than before that the Python developers expect everyone to use venvs and anything else is probably going to be less and less well supported in the future. Installing system-wide is probably always going to work, and most likely also 'pip install --user', but I'm not going to hold my breath for anything else.

(On Ubuntu 24.04, obviously we'll have to move to a venv based Django installation. Fortunately you can use venvs with programs that are outside the venv.)

The (potential) complexity of good runqueue latency measurement in Linux

By: cks
21 January 2025 at 04:16

Run queue latency is the time between when a Linux task becomes ready to run and when it actually runs. If you want good responsiveness, you want a low runqueue latency, so for a while I've been tracking a histogram of it with eBPF, and I put some graphs of it up on some Grafana dashboards I look at. Then recently I improved the responsiveness of my desktop with the cgroup V2 'cpu.idle' setting, and questions came up about how this different from process niceness. When I was looking at those questions, I realized that my run queue latency measurements were incomplete.

When I first set up my run queue latency tracking, I wasn't using either cgroup V2 cpu.idle or process niceness, and so I set up a single global runqueue latency histogram for all tasks regardless of their priority and scheduling class. Once I started using 'idle' CPU scheduling (and testing the effectiveness of niceness), this resulted in hopelessly muddled data that was effectively meaningless during the time that multiple scheduling types of scheduling or multiple nicenesses were running. Running CPU-consuming processes only when the system is otherwise idle is (hopefully) good for the runqueue latency of my regular desktop processes, but more terrible than usual for those 'run only when idle' processes, and generally there's going to be a lot more of them than my desktop processes.

The moment you introduce more than one 'class' of processes for scheduling, you need to split run queue latency measurements up between these classes if you want to really make sense of the results. What these classes are will depend on your environment. I could probably get away with a class for 'cpu.idle' tasks, a class for heavily nice'd tasks, a class for regular tasks, and perhaps a class for (system) processes running with very high priority. If you're doing fair share scheduling between logins, you might need a class per login (or you could ignore run queue latency as too noisy a measure).

I'm not sure I'd actually track all of my classes as Prometheus metrics. For my personal purposes, I don't care very much about the run queue latency of 'idle' or heavily nice'd processes, so perhaps I should update my personal metrics gathering to just ignore those. Alternately, I could write a bpftrace script that gathered the detailed class by class data, run it by hand when I was curious, and ignore the issue otherwise (continuing with my 'global' run queue latency histogram, which is at least honest in general).

Sometimes print-based debugging is your only choice

By: cks
20 January 2025 at 04:20

Recently I had to investigate a mysterious issue in our Django based Python web application. This issue happened only when the application was actually running as part of the web server (using mod_wsgi, which effectively runs as an Apache process). The only particularly feasible way to dig into what was going on was everyone's stand-by, print based debugging (because I could print into Apache's error log; I could have used any form of logging that would surface the information). Even if I might have somehow been able to attach a debugger to things to debug a HTTP request in flight, using print based debugging was a lot easier and faster in practice.

I'm a long time fan of print based debugging. Sometimes this is because print based debugging is easier if you only dip into a language every so often, but that points to a deeper issue, which is that almost every environment can print or log. Print or log based 'debugging' is an almost universal way to extract information from a system, and sometimes you have no other practical way to do that.

(The low level programming people sometimes can't even print things out, but there are other very basic ways to communicate things.)

As in my example, one of the general cases where you have very little access other than logs is when your issue only shows up in some sort of isolated or encapsulated environment (a 'production' environment). We have a lot of ways of isolating things these days, things like daemon processes, containers, 'cattle' (virtual) servers, and so on, but they all share the common trait that they deliberately detach themselves away from you. There are good reasons for this (which often can be boiled down to wanting to run in a controlled and repeatable environment), but it has its downsides.

Should print based debugging be the first thing you reach for? Maybe not; some sorts of bugs cause me to reach for a debugger, and in general if you're a regular user of your chosen debugger you can probably get a lot of information with it quite easily, easier than sprinkling print statements all over. But I think that you probably should build up some print debugging capabilities, because sooner or later you'll probably need them.

Some ways to restrict who can log in via OpenSSH and how they authenticate

By: cks
19 January 2025 at 04:20

In yesterday's entry on allowing password authentication from the Internet for SSH, I mentioned that there were ways to restrict who this was enabled for or who could log in through SSH. Today I want to cover some of them, using settings in /etc/ssh/sshd_config.

The simplest way is to globally restrict logins with AllowUsers, listing only specific accounts you want to be accessed over SSH. If there are too many such accounts or they change too often, you can switch to AllowGroups and allow only people in a specific group that you maintain, call it 'sshlogins'.

If you want to allow logins generally but restrict, say, password based authentication to only people that you expect, what you want is a Match block and setting AuthenticationMethods within it. You would set it up something like this:

AuthenticationMethods publickey
Match User cks
  AuthenticationMethods any

If you want to be able to log in using password from your local networks but not remotely, you could extend this with an additional Match directive that looked at the origin IP address:

Match Address 127.0.0.0/8,<your networks here>
  AuthenticationMethods any

In general, Match directives are your tool for doing relatively complex restrictions. You could, for example, arrange that accounts in a certain Unix group can only log in from the local network, never remotely. Or reverse this so that only logins in some Unix group can log in remotely, and everyone else is only allowed to use SSH within the local network.

However, any time you're doing complex things with Match blocks, you should make sure to test your configuration to make sure it's working the way you want. OpenSSH's sshd_config is a configuration file with some additional capabilities, not a programming language, and there are undoubtedly some subtle interactions and traps you can fall into.

(This is one reason I'm not giving a lot of examples here; I'd have to carefully test them.)

Sidebar: Restricting root logins via OpenSSH

If you permit root logins via OpenSSH at all, one fun thing to do is to restrict where you'll accept them from:

PermitRootLogin no
Match Address 127.0.0.0/8,<your networks here>
  PermitRootLogin prohibit-password
  # or 'yes' for some places

A lot of Internet SSH probers direct most of their effort against the root account. With this setting you're assured that all of them will fail no matter what.

(This has come up before but I feel like repeating it.)

Thoughts on having SSH allow password authentication from the Internet

By: cks
18 January 2025 at 03:42

On the Fediverse, I recently saw a poll about whether people left SSH generally accessible on its normal port or if they moved it; one of the replies was that the person left SSH on the normal port but disallowed password based authentication and only allowed public key authentication. This almost led to me posting a hot take, but then I decided that things were a bit more nuanced than my first reaction.

As everyone with an Internet-exposed SSH daemon knows, attackers are constantly attempting password guesses against various accounts. But if you're using a strong password, the odds of an attacker guessing it are extremely low, since doing 'password cracking via SSH' has an extremely low guesses per second number (enforced by your SSH daemon). In this sense, not accepting passwords over the Internet is at most a tiny practical increase in security (with some potential downsides in unusual situations).

Not accepting passwords from the Internet protects you against three other risks, two relatively obvious and one subtle one. First, it stops an attacker that can steal and then crack your encrypted passwords; this risk should be very low if you use strong passwords. Second, you're not exposed if your SSH server turns out to have a general vulnerability in password authentication that can be remotely exploited before a successful authentication. This might not be an authentication bypass; it might be some sort of corruption that leads to memory leaks, code execution, or the like. In practice, (OpenSSH) password authentication is a complex piece of code that interacts with things like your system's random set of PAM modules.

The third risk is that some piece of software will create a generic account with a predictable login name and known default password. These seem to be not uncommon, based on the fact that attackers probe incessantly for them, checking login names like 'ubuntu', 'debian', 'admin', 'testftp', 'mongodb', 'gitlab', and so on. Of course software shouldn't do this, but if something does, not allowing password authenticated SSH from the Internet will block access to these bad accounts. You can mitigate this risk by only accepting password authentication for specific, known accounts, for example only your own account.

The potential downside of only accepting keypair authentication for access to your account is that you might need to log in to your account in a situation where you don't have your keypair available (or can't use it). This is something that I probably care about more than most people, because as a system administrator I want to be able to log in to my desktop even in quite unusual situations. As long as I can use password authentication, I can use anything trustworthy that has a keyboard. Most people probably will only log in to their desktops (or servers) from other machines that they own and control, like laptops, tablets, or phones.

(You can opt to completely disallow password authentication from all other machines, even local ones. This is an even stronger and potentially more limiting restriction, since now you can't even log in from another one of your machines unless that machine has a suitable keypair set up. As a sysadmin, I'd never do that on my work desktop, since I very much want to be able to log in to my regular account from the console of one of our servers if I need to.)

Some stuff about how Apache's mod_wsgi runs your Python apps (as of 5.0)

By: cks
17 January 2025 at 04:13

We use mod_wsgi to host our Django application, but if I understood the various mod_wsgi settings for how to run your Python WSGI application when I originally set it up, I've forgotten it all since then. Due to recent events, exactly how mod-wsgi runs our application and what we can control about that is now quite relevant, so I spent some time looking into things and trying to understand settings. Now it's time to write all of this down before I forget it (again).

Mod_wsgi can run your WSGI application in two modes, as covered in the quick configuration guide part of its documentation: embedded mode, which runs a Python interpreter inside a regular Apache process, and daemon mode, where one or more Apache processes are taken over by mod_wsgi and used exclusively to run WSGI applications. Normally you want to use daemon mode, and you have to use daemon mode if you want to do things like run your WSGI application as a Unix user other than the web server's normal user or use packages installed into a Python virtual environment.

(Running as a separate Unix user puts some barriers between your application's data and a general vulnerability that gives the attacker read and/or write access to anything the web server has access to.)

To use daemon mode, you need to configure one or more daemon processes with WSGIDaemonProcess. If you're putting packages (such as Django) into a virtual environment, you give an appropriate 'python-home=' setting here. Your application itself doesn't have to be in this venv. If your application lives outside your venv, you will probably want to set either or both of 'home=' and 'python-path=' to, for example, its root directory (especially if it's a Django application). The corollary to this is that any WSGI application that uses a different virtual environment, or 'home' (starting current directory), or Python path needs to be in a different daemon process group. Everything that uses the same process group shares all of those.

To associate a WSGI application or a group of them with a particular daemon process, you use WSGIProcessGroup. In simple configurations you'll have WSGIDaemonProcess and WSGIProcessGroup right next to each other, because you're defining a daemon process group and then immediately specifying that it's used for your application.

Within a daemon process, WSGI applications can run in either the main Python interpreter or a sub-interpreter (assuming that you don't have sub-interpreter specific problems). If you don't set any special configuration directive, each WSGI application will run in its own sub-interpreter and the main interpreter will be unused. To change this, you need to set something for WSGIApplicationGroup, for instance 'WSGIApplicationGroup %{GLOBAL}' to run your WSGI application in the main interpreter.

Some WSGI applications can cohabit with each other in the same interpreter (where they will potentially share various bits of global state). Other WSGI applications are one to an interpreter, and apparently Django is one of them. If you need your WSGI application to have its own interpreter, there are two ways to achieve this; you can either give it a sub-interpreter within a shared daemon process, or you can give it its own daemon process and have it use the main interpreter in that process. If you need different virtual environments for each of your WSGI applications (or different Unix users), then you'll have to use different daemon processes and you might as well have everything run in their respective main interpreters.

(After recent experiences, my feeling is that processes are probably cheap and sub-interpreters are a somewhat dark corner of Python that you're probably better off avoiding unless you have a strong reason to use them.)

You normally specify your WSGI application to run (and what URL it's on) with WSGIScriptAlias. WSGIScriptAlias normally infers both the daemon process group and the (sub-interpreter) 'application group' from its context, but you can explicitly set either or both. As the documentation notes (now that I'm reading it):

If both process-group and application-group options are set, the WSGI script file will be pre-loaded when the process it is to run in is started, rather than being lazily loaded on the first request.

I'm tempted to deliberately set these to their inferred values simply so that we don't get any sort of initial load delay the first time someone hits one of the exposed URLs of our little application.

For our Django application, we wind up with a collection of directives like this (in its virtual host):

WSGIDaemonProcess accounts ....
WSGIProcessGroup accounts
WSGIApplicationGroup %{GLOBAL}
WSGIScriptAlias ...

(This also needs a <Directory> block to allow access to the Unix directory that the WSGIScriptAlias 'wsgi.py' file is in.)

If we added another Django application in the same virtual host, I believe that the simple update to this would be to add:

WSGIDaemonProcess secondapp ...
WSGIScriptAlias ... process-group=secondapp application-group=%{GLOBAL}

(Plus the <Directory> permissions stuff.)

Otherwise we'd have to mess around with setting the WSGIProcessGroup and WSGIApplicationGroup on a per-directory basis for at least the new application. If we specify them directly in WSGIScriptAlias we can skip that hassle.

(We didn't used to put Django in a venv, but as of Ubuntu 24.04, using a venv seems the easiest way to get a particular Django version into some spot where you can use it. Our Django application doesn't live inside the venv, but we need to point mod_wsgi at the venv so that our application can do 'import django.<...>' and have it work. Multiple Django applications could all share the venv, although they'd have to use different WSGIDaemonProcess settings, or at least different names with the same other settings.)

(Multiple) inheritance in Python and implicit APIs

By: cks
16 January 2025 at 04:16

The ultimate cause of our mystery with Django on Ubuntu 24.04 is that versions of Python 3.12 before 3.12.5 have a bug where builtin types in sub-interpreters get unexpected additional slot wrappers (also), and Ubuntu 24.04 has 3.12.3. Under normal circumstances, 'list' itself doesn't have a '__str__' method but instead inherits it from 'object', so if you have a class that inherits from '(list,YourClass)' and YourClass defines a __str__, the YourClass.__str__ is what gets used. In a sub-interpreter, there is a list.__str__ and suddenly YourClass.__str__ isn't used any more.

(mod_wsgi triggers this issue because in a straightforward configuration, it runs everything in sub-interpreters.)

This was an interesting bug, and one of the things it made me realize is that the absence of a __str__ method on 'list' itself had implicitly because part of list's API. Django had set up class definitions that were 'class Something(..., list, AMixin)', where the 'AMixin' had a direct __str__ method, and Django expected that to work. This only works as long as 'list' doesn't have its own __str__ method and instead gets it through inheritance from object.__str__. Adding such a method to 'list' would break Django and anyone else counting on this behavior, making the lack of the method an implicit API.

(You can get this behavior with more or less any method that people might want to override in such a mixin class, but Python's special methods are probably especially prone to it.)

Before I ran into this issue, I probably would have assumed that where in the class tree a special method like __str__ was implemented was simply an implementation detail, not something that was visible as part of a class's API. Obviously, I would have been wrong. In Python, you can tell the difference and quite easily write code that depends on it, code that was presumably natural to experienced Python programmers.

(Possibly the existence of this implicit API was obvious to experienced Python programmers, along with the implication that various builtin types that currently don't have their own __str__ can't be given one in the future.)

My bug reports are mostly done for work these days

By: cks
15 January 2025 at 03:33

These days, I almost entirely report bugs in open source software as part of my work. A significant part of this is that most of what I stumble over bugs in are things that work uses (such as Ubuntu or OpenBSD), or at least things that I mostly use as part of work. There are some consequences of this that I feel like noting today.

The first is that I do bug investigation and bug reporting on work time during work hours, and I don't work on "work bugs" outside of that, on evenings, weekends, and holidays. This sometimes meshes awkwardly with the time open source projects have available for dealing with bugs (which is often in people's personal time outside of work hours), so sometimes I will reply to things and do additional followup investigation out of hours to keep a bug report moving along, but I mostly avoid it. Certainly the initial investigation and filing of a work bug is a working hours activity.

(I'm not always successful in keeping it to that because there is always the temptation to spend a few more minutes digging a bit more into the problem. This is especially acute when working from home.)

The second thing is that bug filing work is merely one of the claims on my work time. I have a finite amount of work time and a variety of things to get done with varying urgency, and filing and updating bugs is not always the top of the list. And just like other work activity, filing a particular bug has to convince me that it's worth spending some of my limited work time on this particular activity. Work does not pay me to file bugs and make open source better; they pay me to make our stuff work. Sometimes filing a bug is a good way to do this but some of the time it's not, for example because the organization in question doesn't respond to most bug reports.

(Even when it's useful in general to file a bug report because it will result in the issue being fixed at some point in the future, we generally need to deal with the problem today, so filing the bug report may take a back seat to things like developing workarounds.)

Another consequence is that it's much easier for me to make informal Fediverse posts about bugs (often as I discover more and more disconcerting things) or write Wandering Thoughts posts about work bugs than it is to make an actual bug report. Writing for Wandering Thoughts is a personal thing that I do outside of work hours, although I write about stuff from work (and I can often use something to write about, so interesting work bugs are good grist).

(There is also that making bug reports is not necessarily pleasant, and making bad bug reports can be bad. This interacts unpleasantly with the open source valorization of public work. To be blunt, I'm more willing to do unpleasant things when work is paying me than when it's not, although often the bug reports that are unpleasant to make are also the ones that aren't very useful to make.)

PS: All of this leads to a surprisingly common pattern where I'll spend much of a work day running down a bug to the point where I feel I understand it reasonably well, come home after work, write the bug up as a Wandering Thoughts entry (often clarifying my understanding of the bug in the process), and then file a bug report at work the next work day.

A mystery with Django under Apache's mod_wsgi on Ubuntu 24.04

By: cks
14 January 2025 at 04:10

We have a long standing Django web application that these days runs under Python 3 and a more modern version of Django. For as long as it has existed, it's had some forms that were rendered to HTML through templates, and it has rendered errors in those forms in what I think of as the standard way:

{{ form.non_field_errors }}
{% for field in form %}
  [...]
  {{ field.errors }}
  [...]
{% endfor %}

This web application runs in Apache using mod_wsgi, and I've recently been working on moving the host this web application runs on to Ubuntu 24.04 (still using mod_wsgi). When I stood up a test virtual machine and looked at some of these HTML forms, what I saw was that when there were no errors, each place that errors would be reported was '[]' instead of blank. This did not happen if I ran the web application on the same test machine in Django's 'runserver' development testing mode.

At first I thought that this was something to do with locales, but the underlying cause is much more bizarre and inexplicable to me. The template operation for form.non_field_errors results in calling Form.non_field_errors(), which returns a django.forms.utils.ErrorList object (which is also what field.errors winds up being). This class is a multiple-inheritance subclass of UserList, list, and django.form.utils.RenderableErrorMixin. The latter is itself a subclass of django.forms.utils.RenderableMixin, which defines a __str__() special method value that is RenderableMixin.render(), which renders the error list properly, including rendering it as a blank if the error list is empty.

In every environment except under Ubuntu 24.04's mod_wsgi, ErrorList.__str__ is RenderableMixin.render and everything works right for things like 'form.non_field_errors' and 'field.errors'. When running under Ubuntu 24.04's mod_wsgi, and only then, ErrorList.__str__ is actually the standard list.__str__, so empty lists render as '[]' (and had I tried to render any forms with actual error reports, worse probably would have happened, especially since list.__str__ isn't carefully escaping special HTML characters).

I have no idea why this is happening in the 24.04 mod_wsgi. As far as I can tell, the method resolution order (MRO) for ErrorList is the same under mod_wsgi as outside it, and sys.path is the same. The RenderableErrorMixin class is getting included as a parent of ErrorList, which I can tell because RenderableMixin also provides a __html__ definition, and ErrorList.__html__ exists and is correct.

The workaround for this specific situation is to explicitly render errors to some format instead of counting on the defaults; I picked .as_ul(), because this is what we've normally gotten so far. However the whole thing makes me nervous since I don't understand what's special about the Ubuntu 24.04 mod_wsgi and who knows if other parts of Django are affected by this.

(The current Django and mod_wsgi setup is running from a venv, so it should also be fully isolated from any Ubuntu 24.04 system Python packages.)

(This elaborates on a grumpy Fediverse post of mine.)

The history and use of /etc/glob in early Unixes

By: cks
13 January 2025 at 04:41

One of the innovations that the V7 Bourne shell introduced was built in shell wildcard globbing, which is to say expanding things like *, ?, and so on. Of course Unix had shell wildcards well before V7, but in V6 and earlier, the shell didn't implement globbing itself; instead this was delegated to an external program, /etc/glob (this affects things like looking into the history of Unix shell wildcards, because you have to know to look at the glob source, not the shell).

As covered in places like the V6 glob(8) manual page, the glob program was passed a command and its arguments (already split up by the shell), and went through the arguments to expand any wildcards it found, then exec()'d the command with the now expanded arguments. The shell operated by scanning all of the arguments for (unescaped) wildcard characters. If any were found, the shell exec'd /etc/glob with the whole show; otherwise, it directly exec()'d the command with its arguments. Quoting wildcards used a hack that will be discussed later.

This basic /etc/glob behavior goes all the way back to Unix V1, where we have sh.s and in it we can see that invocation of /etc/glob. In V2, glob is one of the programs that have been rewritten in C (glob.c), and in V3 we have a sh.1 that mentions /etc/glob and has an interesting BUGS note about it:

If any argument contains a quoted "*", "?", or "[", then all instances of these characters must be quoted. This is because sh calls the glob routine whenever an unquoted "*", "?", or "[" is noticed; the fact that other instances of these characters occurred quoted is not noticed by glob.

This section has disappeared in the V4 sh.1 manual page, which suggests that the V4 shell and /etc/glob had acquired the hack they use in V5 and V6 to avoid this particular problem.

How escaping wildcards works in the V5 and V6 shell is that all characters in commands and arguments are restricted to being seven-bit ASCII. The shell and /etc/glob both use the 8th bit to mark quoted characters, which means that such quoted characters don't match their unquoted versions and won't be seen as wildcards by either the shell (when it's deciding whether or not it needs to run /etc/glob) or by /etc/glob itself (when it's deciding what to expand). However, obviously neither the shell nor /etc/glob can pass such 'marked as quoted' characters to actual commands, so each of them strips the high bit from all characters before exec()'ing actual commands.

(This is clearer in the V5 glob.c source; look for how cat() ands every character with octal 0177 (0x7f) to drop the high bit. You can also see it in the V5 sh.c source, where you want to look at trim(), and also the #define for 'quote' at the start of sh.c and how it's used later.)

PS: I don't know why expanding shell wildcards used a separate program in V6 and earlier, but part of it may have been to keep the shell smaller and more minimal so that it required less memory.

PPS: See also Stephen R. Bourne's 2015 presentation from BSDCan [PDF], which has a bunch of interesting things on the V7 shell and confirms that /etc/glob was there from V1.

IMAP clients can vary in their reactions to IMAP errors

By: cks
12 January 2025 at 03:55

For reasons outside of the scope of this entry, we recently modified our IMAP server so that it would only return 20,000 results from an IMAP LIST command (technically 20,001 results). In our environment, an IMAP LIST operation that generates this many results is because one of the people who can hit this have run into our IMAP server backward compatibility problem. When we made this change, we had a choice for what would happen when the limit was hit, and specifically we had a choice of whether to claim that the IMAP LIST operation had succeeded or had failed. In the end we decided it was better to report that the IMAP LIST operation had failed, which also allowed us to include a text message explaining what had happened (in IMAP these are relatively free form).

(The specifics of the situation are that the IMAP LIST command will report a stream of IMAP folders back to the client and then end the stream after 20,001 entries, with either an 'ok' result or an error result with text. So in the latter case, the IMAP client gets 20,001 folder entries and an error at the end.)

Unsurprisingly, after deploying this change we've seen that IMAP clients (both mail readers and things like server webmail code) vary in their behavior when this limit is hit. The behavior we'd like to see is that the client considers itself to have a partial result and uses it as much as possible, while also telling the person using it that something went wrong. I'm not sure any IMAP client actually does this. One webmail system that we use reports the entire output from the IMAP LIST command as an 'error' (or tries to); since the error message is the last part of the output, this means it's never visible. One mail client appears to throw away all of the LIST results and not report an error to the person using it, which in practice means that all of your folders disappear (apart from your inbox).

(Other mail clients appear to ignore the error and probably show the partial results they've received.)

Since the IMAP server streams the folder list from IMAP LIST to the client as it traverses the folders (ie, Unix directories), we don't immediately know if there are going to be too many results; we only find that out after we've already reported those 20,000 folders. But in hindsight, what we could have done is reported a final synthetic folder with a prominent explanatory name and then claimed that the command succeeded (and stopped). In practice this seems more likely to show something to the person using the mail client, since actually reporting the error text we provide is apparently not anywhere near as common as we might hope.

The problem with combining DNS CNAME records and anything else

By: cks
11 January 2025 at 03:55

A famous issue when setting up DNS records for domains is that you can't combine a CNAME record with any other type, such as a MX record or a SOA (which is required at the top level of a domain). One modern reason that you would want such a CNAME record is that you're hosting your domain's web site at some provider and the provider wants to be able to change what IP addresses it uses for this, so from the provider's perspective they want you to CNAME your 'web site' name to 'something.provider.com'.

The obvious reason for 'no CNAME and anything else' is 'because the RFCs say so', but this is unsatisfying. Recently I wondered why the RFCs couldn't have said that when a CNAME is combined with other records, you return the other records when asked for them but provide the CNAME otherwise (or maybe you return the CNAME only when asked for the IP address if there are other records). But when I thought about it more, I realized the answer, the short version of which is caching resolvers.

If you're the authoritative DNS server for a zone, you know for sure what DNS records are and aren't present. This means that if someone asks you for an MX record and the zone has a CNAME, a SOA, and an MX, you can give them the MX record, and if someone asks for the A record, you can give them the CNAME, and everything works fine. But a DNS server that is a caching resolver doesn't have this full knowledge of the zone; it only knows what's in its cache. If such a DNS server has a CNAME for a domain in its cache (perhaps because someone asked for the A record) and it's now asked for the MX records of that domain, what is it supposed to do? The correct answer could be either the CNAME record the DNS server has or the MX records it would have to query an authoritative server for. At a minimum combining CNAME plus other records this way would require caching resolvers to query the upstream DNS server and then remember that they got a CNAME answer for a specific query.

In theory this could have been written into DNS originally, at the cost of complicating caching DNS servers and causing them to make more queries to upstream DNS servers (which is to say, making their caching less effective). Once DNS existed with the CNAME behavior such that caching DNS resolvers could cache CNAME responses and serve them, the CNAME behavior was fixed.

(This is probably obvious to experienced DNS people, but since I had to work it out in my head I'm going to write it down.)

Sidebar: The pseudo-CNAME behavior offered by some DNS providers

Some DNS providers and DNS servers offer an 'ANAME' or 'ALIAS' record type. This isn't really a DNS record; instead it's a processing instruction to the provider's DNS software that it should look up the A and AAAA records of the target name and insert them into your zone in place of the ANAME/ALIAS record (and redo the lookup every so often in case the target name's IP addresses change). In theory any changes in the A or AAAA records should trigger a change in the zone serial number; in practice I don't know if providers actually do this.

(If your DNS provider doesn't have ANAME/ALIAS 'records' but does have an API, you can build this functionality yourself.)

Realizing why Go reflection restricts what struct fields can be modified

By: cks
10 January 2025 at 04:19

Recently I read Rust, reflection and access rules. Among other things, it describes how a hypothetical Rust reflection system couldn't safely allow access to private fields of things, and especially how it couldn't allow code to set them through reflection. My short paraphrase of the article's discussion is that in Rust, private fields can be in use as part of invariants that allow unsafe operations to be done safely through suitable public APIs. This brought into clarity what had previously been a somewhat odd seeming restriction in Go's reflect package.

Famously (for people who've dabbled in reflect), you can only set exported struct fields. This is covered in both the Value.CanSet() package documentation and The Laws of Reflection (in passing). Since one of the uses of reflection is for going between JSON and structs, encoding/json only works on exported struct fields and you'll find a lot of such fields in lots of code. This requirement can be a bit annoying. Wouldn't it be nice if you didn't have to make your fields public just to serialize them easily?

(You can use encoding/json and still serialize non-exported struct fields, but you have to write some custom methods instead of just marking struct fields the way you could if they were exported.)

Go has this reflect restriction, presumably, for the same reason that reflection in Rust wouldn't be able to modify private fields. Since private fields in a Go struct may be used by functions and methods in the package to properly manage the struct, modifying those fields yourself is unsafe (in the general sense). The reflect package will let you see the fields (and their values) but not change their values. You're allowed to change exported fields because (in theory) arbitrary Go code can already change the value of those fields, and so code in the struct's package can't count on them having any particular value. It can at least sort of count on private fields having approved values (or the zero value, I believe).

(I understand why the reflect documentation doesn't explain the logic of not being able to modify private fields, since package documentation isn't necessarily the right place for a rationale. Also, perhaps it was considered obvious.)

Using tcpdump to see only incoming or outgoing traffic

By: cks
9 January 2025 at 03:13

In the normal course of events, implementations of 'tcpdump' report on packets going in both directions, which is to say it reports both packets received and packets sent. Normally this isn't confusing and you can readily tell one from the other, but sometimes situations aren't normal and you want to see only incoming packets or only outgoing packets (this has come up before). Modern versions of tcpdump can do this, but you have to know where to look.

If you're monitoring regular network interfaces on Linux, FreeBSD, or OpenBSD, this behavior is controlled by a tcpdump command line switch. On modern Linux and on FreeBSD, this is '-Q in' or '-Q out', as covered in the Linux manpage and the FreeBSD manpage. On OpenBSD, you use a different command line switch, '-D in' or '-D out', per the OpenBSD manpage.

(The Linux and FreeBSD tcpdump use '-D' to mean 'list all interfaces'.)

There are network types where the in or out direction can be matched by tcpdump pcap filter rules, but plain Ethernet is not one of them. This implies that you can't write a pcap filter rule that matches some packets only inbound and some packets only outbound at the same time; instead you have to run two tcpdumps.

If you have a (software) bridge interface or bridged collection of interfaces, as far as I know on both OpenBSD and FreeBSD the 'in' and 'out' directions on the underlying physical interfaces work the way you expect. Which is to say, if you have ix0 and ix1 bridged together as bridge0, 'tcpdump -Q in -i ix0' shows packets that ix0 is receiving from the physical network and doesn't include packets forward out through ix0 by the bridge interface (which in some sense you could say are 'sent' to ix0 by the bridge).

The PF packet filter system on both OpenBSD and FreeBSD can log packets to a special network interface, normally 'pflog0'. When you tcpdump this interface, both OpenBSD and FreeBSD accept an 'on <interface>' (which these days is a synonym for 'ifname <interface>') clause in pcap filters, which I believe means that the packet was received on the specific interface (per my entry on various filtering options for OpenBSD). Both also have 'inbound' and 'outbound', which I believe match based on whether the particular PF rule that caused them to match was an 'in' or an 'out' rule.

(See the OpenBSD pcap-filter and the FreeBSD pcap-filter manual pages.)

What a FreeBSD kernel message about your bridge means

By: cks
8 January 2025 at 03:58

Suppose, not hypothetically, that you're operating a FreeBSD based bridging firewall (or some other bridge situation) and you see something like the following kernel message:

kernel: bridge0: mac address 01:02:03:04:05:06 vlan 0 moved from ix0 to ix1
kernel: bridge0: mac address 01:02:03:04:05:06 vlan 0 moved from ix1 to ix0

The bad news is that this message means what you think it means. Your FreeBSD bridge between ix0 and ix1 first saw this MAC address as the source address on a packet it received on the ix0 interface of the bridge, and then it saw the same MAC address as the source address of a packet received on ix1, and then it received another packet on ix0 with that MAC address as the source address. Either you have something echoing those packets back on one side, or there is a network path between the two sides that bypasses your bridge.

(If you're lucky this happens regularly. If you're not lucky it happens only some of the time.)

This particular message comes from bridge_rtupdate() in sys/net/if_bridge.c, which is called to update the bridge's 'routing entries', which here means MAC addresses, not IP addresses. This function is called from bridge_forward(), which forwards packets, which is itself called from bridge_input(), which handles received packets. All of this only happens if the underlying interfaces are in 'learning' mode, but this is the default.

As covered in the ifconfig manual page, you can inspect what MAC addresses have been learned on which device with 'ifconfig bridge0 addr' (covered in the 'Bridge Interface Parameters' section of the manual page). This may be useful to see if your bridge normally has a certain MAC address (perhaps the one that's moving) on the interface it should be on. If you want to go further, it's possible to set a static mapping for some MAC addresses, which will make them stick to one interface even if seen on another one.

Logging this message is controlled by the net.link.bridge.log_mac_flap sysctl, and it's rate limited to only being reported five times a second in general (using ppsratecheck()). That's five times total, even if each time is a different MAC address or even a different bridge. This 'five times a second' log count isn't controllable through a sysctl.

(I'm writing all of this down because I looked much of it up today. Sometimes I'm a system programmer who goes digging in the (FreeBSD) kernel source just to be sure.)

The issue with DNF 5 and script output in Fedora 41

By: cks
7 January 2025 at 04:45

These days Fedora uses DNF as its high(er) level package management software, replacing yum. However, there are multiple versions of DNF, which behave somewhat differently. Through Fedora 40, the default version of DNF was DNF 4; in Fedora 41, DNF is now DNF 5. DNF 5 brings a number of improvements but it has at least one issue that makes me unhappy with it in my specific situation. Over on the Fediverse I said:

Oh nice, DNF 5 in Fedora 41 has nicely improved the handling of output from RPM scriptlets, so that you can more easily see that it's scriptlet output instead of DNF messages.

[later]

I must retract my praise for DNF 5 in Fedora 41, because it has actually made the handling of output from RPM scriptlets *much* worse than in dnf 4. DNF 5 will repeatedly re-print the current output to date of scriptlets every time it updates a progress indicator of, for example, removing packages. This results in a flood of output for DKMS module builds during kernel updates. Dnf 5's cure is far worse than the disease, and there's no way to disable it.

<bugzilla 2331691>

(Fedora 41 specifically has dnf5-5.2.8.1, at least at the moment.)

This can be mostly worked around for kernel package upgrades and DKMS modules by manually removing and upgrading packages before the main kernel upgrade. You want to do this so that dnf is removing as few packages as possible while your DKMS modules are rebuilding. This is done with:

  1. Upgrade all of your non-kernel packages first:

    dnf upgrade --exclude 'kernel*'
    

  2. Remove the following packages for the old kernel:

    kernel kernel-core kernel-devel kernel-modules kernel-modules-core kernel-modules-extra

    (It's probably easier to do 'dnf remove kernel*<version>*' and let DNF sort it out.)

  3. Upgrade two kernel packages that you can do in advance:

    dnf upgrade kernel-tools kernel-tools-libs
    

Unfortunately in Fedora 41 this still leaves you with one RPM package that you can't upgrade in advance and that will be removed while your DKMS module is rebuilding, namely 'kernel-devel-matched'. To add extra annoyance, this is a virtual package that contains no files, and you can't remove it because a lot of things depend on it.

As far as I can tell, DNF 5 has absolutely no way to shut off its progress bars. It completely ignores $TERM and I can't see anything else that leaves DNF usable. It would have been nice to have some command line switches to control this, but it seems pretty clear that this wasn't high on the DNF 5 road map.

(Although I don't expect this to be fixed in Fedora 41 over its lifetime, I am still deferring the Fedora 41 upgrades of my work and home desktops for as long as possible to minimize the amount of DNF 5 irritation I have to deal with.)

WireGuard's AllowedIPs aren't always the (WireGuard) routes you want

By: cks
6 January 2025 at 04:35

A while back I wrote about understanding WireGuard's AllowedIPs, and also recently I wrote about how different sorts of WireGuard setups have different difficulties, where one of the challenges for some setups is setting up what you want routed through WireGuard connections. As Ian Z aka nobrowser recently noted in a comment on the first entry, these days many WireGuard related programs (such as wg-quick and NetworkManager) will automatically set routes for you based on AllowedIPs. Much of the time this will work fine, but there are situations where adding routes for all AllowedIPs ranges isn't what you want.

WireGuard's AllowedIPs setting for a particular peer controls two things at once: what (inside-WireGuard) source IP addresses you will accept from the peer, and what destination addresses WireGuard will send to that peer if the packet is sent to that WireGuard interface. However, it's the routing table that controls what destination addresses are sent to a particular WireGuard interface (or more likely a combination of IP policy routing rules and some routing table).

If your WireGuard IP address is only reachable from other WireGuard peers, you can sensibly bound your AllowedIPs so that the collection of all of them matches the routing table. This is also more or less doable if some of them are gateways for additional networks; hopefully your network design puts all of those networks under some subnet and the subnet isn't too big. However, if your WireGuard IP can wind up being reached by a broader range of source IPs, or even 'all of the Internet' (as is my case), then your AllowedIPs range is potentially much larger than what you want to always be routed to WireGuard.

A related case is if you have a 'work VPN' WireGuard configuration where you could route all of your traffic through your WireGuard connection but some of the time you only want to route traffic to specific (work) subnets. Unless you like changing AllowedIPs all of the time or constructing two different WireGuard interfaces and only activating the correct one, you'll want an AllowedIPs that accepts everything but some of the time you'll only route specific networks to the WireGuard interface.

(On the other hand, with the state of things in Linux, having two separate WireGuard interfaces might be the easiest way to manage this in NetworkManager or other tools.)

I think that most people's use of WireGuard will probably involve AllowedIPs settings that also work for routing, provided that the tools involve handle the recursive routing problem. These days, NetworkManager handles that for you, although I don't know about wg-quick.

(This is one of the entries that I write partly to work it out in my own head. My own configuration requires a different AllowedIPs than the routes I send through the WireGuard tunnel. I make this work with policy based routing.)

There are different sorts of WireGuard setups with different difficulties

By: cks
5 January 2025 at 04:37

I've now set up WireGuard in a number of different ways, some of which were easy and some of which weren't. So here are my current views on WireGuard setups, starting with the easiest and going to the most challenging.

The easiest WireGuard setup is where the 'within WireGuard' internal IP address space is completely distinct from the outside space, with no overlap. This makes routing completely straightforward; internal IPs reachable over WireGuard aren't reachable in any other way, and external IPs aren't reachable over WireGuard. You can do this as a mesh or use the WireGuard 'router' pattern (or some mixture). If you allocate all internal IP addresses from the same network range, you can set a single route to your WireGuard interface and let AllowedIps sort it out.

(An extreme version of this would be to configure the inside part of WireGuard with only link local IPv6 addresses, although this would probably be quite inconvenient in practice.)

A slightly more difficult setup is where some WireGuard endpoints are gateways to additional internal networks, networks that aren't otherwise reachable. This setup potentially requires more routing entries but it remains straightforward in that there's no conflict on how to route a given IP address.

The next most difficult setup is using different IP address types inside WireGuard than from outside it, where the inside IP address type isn't otherwise usable for at least one of the ends. For example, you have an IPv4 only machine that you're giving a public IPv6 address through an IPv6 tunnel. This is still not too difficult because the inside IP addresses associated with each WireGuard peer aren't otherwise reachable, so you never have a recursive routing problem.

The most difficult type of WireGuard setup I've had to do so far is a true 'VPN' setup, where some or many of the WireGuard endpoints you're talking to are reachable both outside WireGuard and through WireGuard (or at least there are routes that try to send traffic to those IPs through WireGuard, such as a VPN 'route all traffic through my WireGuard link' default route). Since your system could plausibly recursively route your encrypted WireGuard traffic over WireGuard, you need some sort of additional setup to solve this. On Linux, this will often be done using a fwmark (also) and some policy based routing rules.

One of the reasons I find it useful to explicitly think about these different types of setups is to better know what to expect and what I'll need to do when I'm planning a new WireGuard environment. Either I will be prepared for what I'm going to have to do, or I may rethink my design in order to move it up the hierarchy, for example deciding that we can configure services to talk to special internal IPs (over WireGuard) so that we don't have to set up fwmark-based routing on everything.

(Some services built on top of WireGuard handle this for you, for example Tailscale, although Tailscale can have routing challenges of its own depending on your configuration.)

The programmable web browser was and is inevitable

By: cks
4 January 2025 at 03:40

In a comment on my entry on why the modern web is why web browsers can't have nice things, superkuh wrote in part:

In the past it was seen as crazy to open every executable file someone might send you over the internet (be it email, ftp, web, or whatever). But sometime in the 2010s it became not only acceptable, but standard practice to automatically run every executable sent to you by any random endpoint on the internet.

For 'every executable' you should read 'every piece of JavaScript', which is executable code that is run by your browser as a free and relatively unlimited service provided to every web page you visit. The dominant thing restraining the executables that web pages send you is the limited APIs that browsers provide, which is why they provide such limited APIs. This comment sparked a chain of thoughts that led to a thesis.

I believe that the programmable web browser was (and is) inevitable. I don't mean this just in the narrow sense that if it hadn't been JavaScript it would have been Flash or Java applets or Lua or WASM or some other relatively general purpose language that the browser would up providing. Instead, I mean it in a broad and general sense, because 'programmability' of the browser is driven by a general and real problem.

For almost as long as the web has existed, people have wanted to create web pages that had relatively complex features and interactions. They had excellent reasons for this; they wanted drop-down or fold-out menus to save screen space so that they could maximize the amount of space given to important stuff instead of navigation, and they wanted to interactively validate form contents before submission for fast feedback to the people filling them in, and so on. At the same time, browser developers didn't want to (and couldn't) program every single specific complex feature that web page authors wanted, complete with bespoke HTML markup for it and so on. To enable as many of these complex features as possible with as little work on their part as possible, browser developers created primitives that could be assembled together to create more sophisticated features, interactions, layouts, and so on.

When you have a collection of primitives that people are expected to use to create their specific features, interactions, and so on, you have a programming language and a programming environment. It doesn't really matter if this programming language is entirely declarative (and isn't necessarily Turing complete), as in the case of CSS; people have to program the web browser to get what they want.

So my view is that we were always going to wind up with at least one programming language in our web browsers, because a programming language is the meeting point between what web page authors want to have and what browser developers want to provide. The only question was (and is) how good of a programming language (or languages) we were going to get. Or perhaps an additional question was whether the people designing the 'programming language' were going to realize that they were doing so, or if they were going to create one through an accretion of features.

(My view is that CSS absolutely is a programming language in this sense, in that you must design and 'program' it in order to achieve the effects you want, especially if you want sophisticated ones like drop down menus. Modern CSS has thankfully moved beyond the days when I called it an assembly language.)

(This elaborates on a Fediverse post.)

Rejecting email at SMTP time based on the From: header address

By: cks
3 January 2025 at 04:14

Once upon a time (a long time ago), filtering and rejecting email based on the SMTP envelope sender (the SMTP MAIL FROM) was a generally sufficient mechanism to deal with many repeat spam sources. It didn't deal with all of them but many used their own domain in the envelope sender, even if they send from a variety of different IP addresses. Unfortunately, the rise of (certain) mail service providers has increasingly limited the usefulness of envelope sender address filtering, because an increasing number of the big providers use their own domains for the envelope sender addresses of all outgoing email. Unless you feel like blocking the provider entirely (often this isn't feasible, even on an individual basis), rejecting based on the envelope sender doesn't do you any good here.

This has made it increasingly useful to be able to do SMTP time rejection (and general filtering) based on the 'From:' header address. Many mail sending services will put the real spam source's email address in the From: and at least the top level domain of this will be consistent for a particular source, which means that you can use it to reject some of their customers but accept others. These days, MTAs (mail transfer agents) generally give you an opportunity to reject messages at the SMTP DATA phase, after you've received the headers and message body, so you can use this to check the From: header address.

(If you're applying per-destination filtering, you have the SMTP DATA error problem and may only be able to do this filtering if the incoming email has only a single recipient. Conveniently, the mail service providers that commonly obfuscate the envelope sender address usually send messages with only a single recipient for various reasons, including VERP or at least something that looks like it.)

I feel that From: address filtering works best on pseudo-legitimate sources of repeat spam, such as companies that are sending you marketing email without consent. These are the senders that are least likely to vary their top level domain, because they have a business and want to look legitimate, be found at a consistent address, and build up reputation. These are also the sources of unwanted email that are the least likely to be dropped as customers by mail service providers (for a collection of likely reasons that are beyond the scope of this entry).

There are plenty of potential limitations on From: header address filtering. Bad actors can put various sorts of badly formed garbage in the From:, you definitely have to parse it (ideally your MTA will provide this as a built-in), and I believe that it still technically might have multiple addresses. But as a heuristic for rejecting unwanted mail, all of this is not a serious problem. Most From: addresses are well formed and good, especially now that DMARC and DKIM are increasingly required if you want the large providers to accept your email.

(DKIM signing in 'alignment' with the From: header is increasingly mandatory in practice, which requires that the From: header has to be well formed. I don't know how Google and company react to badly formed or peculiar From: headers, but I doubt it helps your email appear in people's inboxes.)

PS: While you can filter or discard email based on the From: header in a variety of places, I like rejecting at SMTP time and it's possible that SMTP rejections at DATA time will trigger anti-spam precautions in the mail service providers (it's a possible signal of badness in the message).

The modern web is why web browsers don't have "nice things" (platform APIs)

By: cks
2 January 2025 at 04:00

Every so often I read something that says or suggests that the big combined browser and platform vendors (Google, Apple, and to a lesser extent Microsoft) have deliberately limited their browser's access to platform APIs that would put "progressive web applications" on par with native applications. While I don't necessarily want to say that these vendors are without sin, in my view this vastly misses the core reason web browsers have limited and slow moving access to platform APIs. To put it simply, it's because of what the modern web has turned into, namely "a hive of scum and villainy" to sort of quote a famous movie.

Any API the browser exposes to web pages is guaranteed to be used by bad actors, and this has been true for a long time. Bad actors will use these APIs to track people, to (try to) compromise their systems, to spy on them, or basically for anything that can make money or gain information. Many years ago I said this was why native applications weren't doomed and basically nothing has changed since then. In particular, browsers are no better at designing APIs that can't be abused or blocking web pages that abuse these APIs, and they probably never will be.

(One of the problems is the usual one in security; there are a lot more attackers than there are browser developers designing APIs, and the attackers only have to find one oversight or vulnerability. In effect attackers are endlessly ingenious while browser API designers have finite time they can spend if they want to ship anything.)

The result of this is that announcements of new browser APIs are greeted not with joy but with dread, because in practice they will mostly be yet another privacy exposure and threat vector (Chrome will often ship these APIs anyway because in practice as demonstrated by their actions, Google mostly doesn't care). Certainly there are some web sites and in-browser applications that will use them well, but generally they'll be vastly outnumbered by attackers that are exploiting these APIs. Browser vendors (even Google with Chrome) are well aware of these issues, which is part of why they create and ship so few APIs and often don't give them very much power.

(Even native APIs are increasingly restricted, especially on mobile devices, because there are similar issues on those. Every operating system vendor is more and more conscious of security issues and the exposures that are created for malicious applications.)

You might be tempted to say that the answer is forcing web pages to ask for permission to use these APIs. This is a terrible idea for at least two reasons. The first reason is alert (or question) fatigue; at a certain point this becomes overwhelming and people stop paying attention. The second reason is that people generally want to use websites that they're visiting, and if faced with a choice between denying a permission and being unable to use the website or granting the permission and being able to use the website, they will take the second choice a lot of the time.

(We can see both issues in effect in mobile applications, which have similar permissions requests and create similar permissions fatigue. And mobile applications ask for permissions far less often than web pages often would, because most people visit a lot more web pages than they install applications.)

My unusual X desktop wasn't made 'from scratch' in a conventional sense

By: cks
1 January 2025 at 04:10

There are people out there who set up unusual (Unix) environments for themselves from scratch; for example, Mike Hoye recently wrote Idiosyncra. While I have an unusual desktop, I haven't built it from scratch in quite the same way that Mike Hoye and other people have; instead I've wound up with my desktop through a rather easier process.

It would be technically accurate to say that my current desktop environment has been built up gradually over time (including over the time I've been writing Wandering Thoughts, such as my addition of dmenu). But this isn't really how it happened, in that I didn't start from a normal desktop and slowly change it into my current one. The real story is that the core of my desktop dates from the days when everyone's X desktops looked like mine does. Technically there were what we would call full desktops back in those days, if you had licensed the necessary software from your Unix vendor and chose to run it, but hardware was sufficiently slow back then that people at universities almost always chose to run more lightweight environments (especially since they were often already using the inexpensive and slow versions of workstations).

(Depending on how much work your local university system administrators had done, your new Unix account might start out with the Unix vendor's X setup, or it could start out with what X11R<whatever> defaulted to when built from source, or it might be some locally customized setup. In all cases you often were left to learn about the local tastes in X desktops and how to improve yours from people around you.)

To show how far back this goes (which is to say how little of it has been built 'from scratch' recently), my 1996 SGI Indy desktop has much of the look and the behavior of my current desktop, and its look and behavior wasn't new then; it was an evolution of my desktop from earlier Unix workstations. When I started using Linux, I migrated my Indy X environment to my new (and better) x86 hardware, and then as Linux has evolved and added more and more things you have to run to have a usable desktop with things like volume control, your SSH agent, and automatically mounted removable media, I've added them piece by piece (and sometimes updated them as how you do this keeps changing).

(At some point I moved from twm as my window manager to fvwm, but that was merely redoing my twm configuration in fvwm, not designing a new configuration from scratch.)

I wouldn't want to start from scratch today to create a new custom desktop environment; it would be a lot of work (and the one time I looked at it I wound up giving up). Someday I will have to move from X, fvwm, dmenu, and so on to some sort of Wayland based environment, but even when I do I expect to make the result as similar to my current X setup as I can, rather than starting from a clean sheet design. I know what I want because I'm very used to my current environment and I've been using variants of it for a very long time now.

(This entry was sparked by Ian Z aka nobrowser's comment on my entry from yesterday.)

PS: Part of the long lineage and longevity of my X desktop is that I've been lucky and determined enough to use Unix and X continuously at work, and for a long time at home as well. So I've never had a time when I moved away from X on my desktop(s) and then had to come back to reconstruct an environment and catch it up to date.

PPS: This is one of the roots of my xdm heresy, where my desktops boot into a text console and I log in there to manually start X with a personal script that's a derivative of the ancient startx command.

I'm firmly attached to a mouse and (overlapping) windows

By: cks
31 December 2024 at 04:45

In the tech circles I follow, there are a number of people who are firmly in what I could call a 'text mode' camp (eg, also). Over on the Fediverse, I said something in an aside about my personal tastes:

(Having used Unix through serial terminals or modems+emulators thereof back in the days, I am not personally interested in going back to a single text console/window experience, but it is certainly an option for simplicity.)

(Although I didn't put it in my Fediverse post, my experience with this 'single text console' environment extends beyond Unix. Similarly, I've lived without a mouse and now I want one (although I have particular tastes in mice).)

On the surface I might seem like someone who is a good candidate for the single pane of text experience, since I do much of my work in text windows, either terminals or environments (like GNU Emacs) that ape them, and I routinely do odd things like read email from the command line. But under the surface, I'm very much not. I very much like having multiple separate blocks of text around, being able to organize these blocks spatially, having a core area where I mostly work from with peripheral areas for additional things, and being able to overlap these blocks and apply a stacking order to control what is completely visible and what's partly visible.

In one view, you could say that this works partly because I have enough screen space. In another view, it would be better to say that I've organized my computing environment to have this screen space (and the other aspects). I've chosen to use desktop computers instead of portable ones, partly for increased screen space, and I've consistently opted for relatively large screens when I could reasonably get them, steadily moving up in screen size (both physical and resolution wise) over time.

(Over the years I've gone out of my way to have this sort of environment, including using unusual window systems.)

The core reason I reach for windows and a mouse is simple: I find the pure text alternative to be too confining. I can work in it if I have to but I don't like to. Using finer grained graphical windows instead of text based ones (in a text windowing environment, which exist), and being able to use a mouse to manipulate things instead of always having to use keyboard commands, is nicer for me. This extends beyond shell sessions to other things as well; for example, generally I would rather start new (X) windows for additional Emacs or vim activities rather than try to do everything through the text based multi-window features that each has. Similarly, I almost never use screen (or tmux) within my graphical desktop; the only time I reach for either is when I'm doing something critical that I might be disconnected from.

(This doesn't mean that I use a standard Unix desktop environment for my main desktops; I have a quite different desktop environment. I've also written a number of tools to make various aspects of this multi-window environment be easy to use in a work environment that involves routine access to and use of a bunch of different machines.)

If I liked tiling based window environments, it would be easier to switch to a text (console) based environment with text based tiling of 'windows', and I would probably be less strongly attached to the mouse (although it's hard to beat the mouse for selecting text). However, tiling window environments don't appeal to me (also), either in graphical or in text form. I'll use tiling in environments where it's the natural choice (for example, in vim and emacs), but I consider it merely okay.

My screens now have areas that are 'good' and 'bad' for me

By: cks
30 December 2024 at 04:23

Once upon a time, I'm sure that everywhere on my screen (because it would have been a single screen at that time) was equally 'good' for me; all spots were immediately visible, clearly readable, didn't require turning my head, and so on. As the number of screens I use has risen, as the size of the screens has increased (for example when I moved from 24" non-HiDPI 3:2 LCD panels to 27" HiDPI 16:9 panels), and as my eyes have gotten older, this has changed. More and more, there is a 'good' area that I've set up so I'm looking straight at and then increasingly peripheral areas that are not as good.

(This good area is not necessarily the center of the screen; it depends on how I sit relatively to the screen, the height of the monitor, and so on. If I adjust these I can change what the good spot is, and I sometimes will do so for particular purposes.)

Calling the peripheral areas 'bad' is a relative term. I can see them, but especially on my office desktop (which has dual 27" 16:9 displays), these days the worst spots can be so far off to the side that I don't really notice things there much of the time. If I want to really look, I have to turn my head, which means I have to have a reason to look over there at whatever I put there. Hopefully it's not too important.

For a long time I didn't really notice this change or think about its implications. As the physical area covered by my 'display surface' expanded, I carried over the much the same desktop layout that I had used (in some form) for a long time. It didn't register that some things were effectively being exiled into the outskirts where I would never notice them, or that my actual usage was increasingly concentrated in one specific area of the screen. Now that I have consciously noticed this shift (which is a story for another entry), I may want to rethink some of how I lay things out on my office desktop (and maybe my home one too) and what I put where.

(One thing I've vaguely considered is if I should turn my office displays sideways, so the long axis is vertical, although I don't know if is feasible with their current stands. I have what is in practice too much horizontal space today, so that would be one way to deal with it. But probably this would give me two screens that each are a bit too narrow to be comfortable for me. And sadly there are no ideal LCD panels these days; I would ideally like a HiDPI 24" or 25" 3:2 panel but vendors don't do those.)

In an unconfigured Vim, I want to do ':set paste' right away

By: cks
29 December 2024 at 03:53

Recently I wound up using a FreeBSD machine, where I promptly installed vim for my traditional reason. When I started modifying some files, I had contents to paste in from another xterm window, so I tapped my middle mouse button while in insert mode (ie, I did the standard xterm 'paste text' thing). You may imagine the 'this is my face' meme when what vim inserted was the last thing I'd deleted in vim on that FreeBSD machine, instead of my X text selection.

For my future use, the cure for this is ':set paste', which turns off basically all of vim's special handling of pasted text. I've traditionally used this to override things like vim auto-indenting or auto-commenting the text I'm pasting in, but it also turns off vim's special mouse handling, which is generally active in terminal windows, including over SSH.

(The defaults for ':set mouse' seem to vary from system to system and probably vim build to vim build. For whatever reason, this FreeBSD system and its vim defaulted to 'mouse=a', ie special mouse handling was active all the time. I've run into mouse handling limits in vim before, although things may have changed since then.)

In theory, as covered in Vim's X11 selection mechanism, I might be able to paste from another xterm (or whatever) using "*p (to use the '"*' register, which is the primary selection or the cut buffer if there's no primary selection). In practice I think this only works under limited circumstances (although I'm not sure what they are) and the Vim manual itself tells you to get used to using Shift with your middle mouse button. I would rather set paste mode, because that gets everything; a vim that has the mouse active probably has other things I don't want turned on too.

(Some day I'll put together a complete but minimal collection of vim settings to disable everything I want disabled, but that day isn't today.)

PS: If I'm reading various things correctly, I think vim has to be built with the 'xterm_clipboard' option in order to pull out selection information from xterm. Xterm itself must have 'Window Ops' allowed, which is not a normal setting; with this turned on, vim (or any other program) can use the selection manipulation escape sequences that xterm documents in "Operating System Commands". These escape sequences don't require that vim have direct access to your X display, so they can be used over plain SSH connections. Support for these escape sequences is probably available in other terminal emulators too, and these terminal emulators may have them always enabled.

(Note that access to your selection is a potential security risk, which is probably part of why xterm doesn't allow it by default.)

Cgroup V2 memory limits and their potential for thrashing

By: cks
28 December 2024 at 04:10

Recently I read 32 MiB Working Sets on a 64 GiB machine (via), which recounts how under some situations, Windows could limit the working set ('resident set') of programs to 32 MiB, resulting in a lot of CPU time being spent on soft (or 'minor') page faults. On Linux, you can do similar things to limit memory usage of a program or an entire cgroup, for example through systemd, and it occurred to me to wonder if you can get the same thrashing effect with cgroup V2 memory limits. Broadly, I believe that the answer depends on what you're using the memory for and what you use to set limits, and it's certainly possible to wind up setting limits so that you get thrashing.

(As a result, this is now something that I'll want to think about when setting cgroup memory limits, and maybe watch out for.)

Cgroup V2 doesn't have anything that directly limits a cgroup's working set (what is usually called the 'resident set size' (RSS) on Unix systems). The closest it has is memory.high, which throttles a cgroup's memory usage and puts it under heavy memory reclaim pressure when it hits this high limit. What happens next depends on what sort of memory pages are being reclaimed from the process. If they are backed by files (for example, they're pages from the program, shared libraries, or memory mapped files), they will be dropped from the process's resident set but may stay in memory so it's only a soft page fault when they're next accessed. However, if they're anonymous pages of memory the process has allocated, they must be written to swap (if there's room for them) and I don't know if the original pages stay in memory afterward (and so are eligible for a soft page fault when next accessed). If the process keeps accessing anonymous pages that were previously reclaimed, it will thrash on either soft or hard page faults.

(The memory.high limit is set by systemd's MemoryHigh=.)

However, the memory usage of a cgroup is not necessarily in ordinary process memory that counts for RSS; it can be in all sorts of kernel caches and structures. The memory.high limit affects all of them and will generally shrink all of them, so in practice what it actually limits depends partly on what the processes in the cgroup are doing and what sort of memory that allocates. Some of this memory can also thrash like user memory does (for example, memory for disk cache), but some won't necessarily (I believe shrinking some sorts of memory usage discards the memory outright).

Since memory.high is to a certain degree advisory and doesn't guarantee that the cgroup never goes over this memory usage, I think people more commonly use memory.max (for example, via the systemd MemoryMax= setting). This is a hard limit and will kill programs in the cgroup if they push hard on going over it; however, the memory system will try to reduce usage with other measures, including pushing pages into swap space. In theory this could result in either swap thrashing or soft page fault thrashing, if the memory usage was just right. However, in our environments cgroups that hit memory.max generally wind up having programs killed rather than sitting there thrashing (at least for very long). This is probably partly because we don't configure much swap space on our servers, so there's not much room between hitting memory.max with swap available and exhausting the swap space too.

My view is that this generally makes it better to set memory.max than memory.high. If you have a cgroup that overruns whatever limit you're setting, using memory.high is much more likely to cause some sort of thrashing because it never kills processes (the kernel documentation even tells you that memory.high should be used with some sort of monitoring to 'alleviate heavy reclaim pressure', ie either raise the limit or actually kill things). In a past entry I set MemoryHigh= to a bit less than my MemoryMax setting, but I don't think I'll do that in the future; any gap between memory.high and memory.max is an opportunity for thrashing through that 'heavy reclaim pressure'.

WireGuard on OpenBSD just works (at least as a VPN server)

By: cks
27 December 2024 at 04:12

A year or so ago I mentioned that I'd set up WireGuard on an Android and an iOS device in a straightforward VPN configuration. What I didn't mention in that entry is that the other end of the VPN was not on a Linux machine, but on one of our OpenBSD VPN servers. At the time it was running whatever was the then-current OpenBSD version, and today it's running OpenBSD 7.6, which is the current version at the moment. Over that time (and before it, since the smartphones weren't its first WireGuard clients), WireGuard on OpenBSD has been trouble free and has just worked.

In our configuration, OpenBSD WireGuard requires installing the 'wireguard-tools' package, setting up an /etc/wireguard/wg0.conf (perhaps plus additional files for generated keys), and creating an appropriate /etc/hostname.wg0. I believe that all of these are covered as part of the standard OpenBSD documentation for setting up WireGuard. For this VPN server I allocated a /24 inside the RFC 1918 range we use for VPN service to be used for WireGuard, since I don't expect too many clients on this server. The server NATs WireGuard connections just as it NATs connections from the other VPNs it supports, which requires nothing special for WireGuard in its /etc/pf.conf.

(I did have to remember to allow incoming traffic to the WireGuard UDP port. For this server, we allow WireGuard clients to send traffic to each other through the VPN server if they really want to, but in another one we might want to restrict that with additional pf rules.)

Everything I'd expect to work does work, both in terms of the WireGuard tools (I believe the information 'wg' prints is identical between Linux and OpenBSD, for example) and for basic system metrics (as read out by, for example, the OpenBSD version of the Prometheus host agent, which has overall metrics for the 'wg0' interface). If we wanted per-client statistics, I believe we could probably get them through this third party WireGuard Prometheus exporter, which uses an underlying package to talk to WireGuard that does apparently work on OpenBSD (although this particular exporter can potentially have label cardinality issues), or generate them ourselves by parsing 'wg' output (likely from 'wg show all dump').

This particular OpenBSD VPN server is sufficiently low usage that I haven't tried to measure either the possible bandwidth we can achieve with WireGuard or the CPU usage of WireGuard. Historically, neither are particularly critical for our VPNs in general, which have generally not been capable of particularly high bandwidth (with either OpenVPN or L2TP, our two general usage VPN types so far; our WireGuard VPN is for system staff only).

(In an ideal world, none of this should count as surprising. In this world, I like to note when things that are a bit out of the mainstream just work for me, with a straightforward setup process and trouble free operation.)

x86 servers, ATX power supply control, and reboots, resets, and power cycles

By: cks
26 December 2024 at 04:15

I mentioned recently a case when power cycling an (x86) server wasn't enough to recover it, although perhaps I should have put quotes around "power cycling". The reason for the scare quotas is that I was doing this through the server's BMC, which means that what was actually happening was not clear because there are a variety of ways the BMC could be doing power control and the BMC may have done something different for what it described as a 'power cycle'. In fact, to make it less clear, this particular server's BMC offers both a "Power Cycle" and a "Power Reset" option.

(According to the BMC's manual, a "power cycle" turns the system off and then back on again, while a "power reset" performs a 'warm restart'. I may have done a 'power reset' instead of a 'power cycle', it's not clear from what logs we have.)

There are a spectrum of ways to restart an x86 server, and they (probably) vary in their effects on peripherals, PCIe devices, and motherboard components. The most straightforward looking is to ask the Linux kernel to reboot the system, although in practice I believe that actually getting the hardware to do the reboot is somewhat complex (and in the past Linux sometimes had problems where it couldn't persuade the hardware, so your 'reboot' would hang). Looking at the Linux kernel code suggests that there are multiple ways to invoke a reboot, involving ACPI, UEFI firmware, old fashioned BIOS firmware, a PCIe configuration register, via the keyboard, and so on (for a fun time, look at the 'reboot=' kernel parameter). In general, a reboot can only be initiated by the server's host OS, not by the BMC; if the host OS is hung you can't 'reboot' the server as such.

Your x86 desktop probably has a 'reset' button on the front panel. These days the wire from this is probably tied into the platform chipset (on Intel, the ICH, which came up for desktop motherboard power control) and is interpreted by it. Server platforms probably also have a (conceptual) wire and that wire may well be connected to the BMC, which can then control it to implement, for example a 'reset' operation. I believe that a server reboot can also trigger the same platform chipset reset handling that the reset button does, although this isn't sure. If I'm reading Intel ICH chipset documentation correctly, triggering a reset this way will or may signal PCIe devices and so on that a reset has happened, although I don't think it cuts power to them; in theory anything getting this signal should reset its state.

(The CF9 PCI "Reset Control Register" (also) can be used to initiate a 'soft' or 'hard' CPU reset, or a full reset in which the (Intel) chipset will do various things to signals to peripherals, not just the CPU. I don't believe that Linux directly exposes these options to user space (partly because it may not be rebooting through direct use of PCI CF9 in the first place), although some of them can be controlled through kernel command line parameters. I think this may also control whether the 'reset' button and line do a CPU reset or a full reset. It seems possible that the warm restart of this server's BMC's "power reset" works by triggering the reset line and assuming that CF9 is left in its default state to make this a CPU reset instead of a full reset.)

Finally, the BMC can choose to actually cycle the power off and then back on again. As discussed, 'off' is probably not really off, because standby power and BMC power will remain available, but this should put both the CPU and the platform chipset through a full power-on sequence. However, it likely won't leave power off long enough for various lingering currents to dissipate and capacitors to drain. And nothing you do through the BMC can completely remove power from the system; as long as a server is connected to AC power, it's supplying standby power and BMC power. If you want a total reset, you must either disconnect its power cords or turn its outlet or outlets off in your remote controllable PDU (which may not work great if it's on a UPS). And as we've seen, sometimes a short power cycle isn't good enough and you need to give the server a time out.

(While the server's OS can ask for the server to be powered down instead of rebooted, I don't think it can ask for the server to be power cycled, not unless it talks to the BMC instead of doing a conventional reboot or power down.)

One of the things I've learned from this is that if I want to be really certain I understand what a BMC is doing, I probably shouldn't rely on any option to do a power cycle or power reset. Instead I should explicitly turn power off, wait until that's taken, and then turn power on. Asking a BMC to do a 'power cycle' is a bit optimistic, although it will probably work most of the time.

(If there's another time of our specific 'reset is not enough' hang, I will definitely make sure to use at least the BMC's 'power cycle' and perhaps the full brief off then on approach.)

The TLS certificate multi-file problem (for automatic updates)

By: cks
25 December 2024 at 03:25

In a recent entry on short lived TLS certificates and graceful certificate rollover in web servers, I mentioned that one issue with software automatically reloading TLS certificates was that TLS certificates are almost always stored in multiple files. Typically this is either two files (the TLS certificate's key and a 'fullchain' file with the TLS certificate and intermediate certificates together) or three files (the key, the signed certificate, and a third file with the intermediate chain). The core problem this creates is the same one you have any time information is split across multiple files, namely making 'atomic' changes to the set of files, so that software never sees an inconsistent state with some updated files and some not.

With TLS certificates, a mismatch between the key and the signed certificate will cause the server to be unable to properly prove that it controls the private key for the TLS certificate it presented. Either it will load the new key and the old certificate or the old key and the new certificate, and in both cases they won't be able to generate the correct proof (assuming the secure case where your TLS certificate software generates a new key for each TLS certificate renewal, which you want to do since you want to guard against your private key having been compromised).

The potential for a mismatch is obvious if the file with the TLS key and the file with the TLS certificate are updated separately (or a new version is written out and swapped into place separately). At this point your mind might turn to clever tricks like writing all of the new files to a new directory and somehow swapping the whole directory in at once (this is certainly where mine went). Unfortunately, even this isn't good enough because the program has to open the two (or three) files separately, and the time gap between the opens creates an opportunity for a mismatch more or less no matter what we do.

(If the low level TLS software operates by, for example, first loading and parsing the TLS certificate, then loading the private key to verify that it matches, the time window may be bigger than you expect because the parsing may take a bit of time. The minimal time window comes about if you open the two files as close to each other as possible and defer all loading and processing until after both are opened.)

The only completely sure way to get around this is to put everything in one file (and then use an appropriate way to update the file atomically). Short of that, I believe that software could try to compensate by checking that the private key and the TLS certificate match after they're automatically reloaded, and if they don't, it should reload both.

(If you control both the software that will use the TLS certificates and the renewal software, you can do other things. For example, you can always update the files in a specific order and then make the server software trigger an automatic reload only when the timestamp changes on the last file to be updated. That way you know the update is 'done' by the time you're loading anything.)

A gotcha with importing ZFS pools and NFS exports on Linux (as of ZFS 2.3.0)

By: cks
24 December 2024 at 03:41

Ever since its Solaris origins, ZFS has supported automatic NFS and CIFS sharing of ZFS filesystems through their 'sharenfs' and 'sharesmb' properties. Part of the idea of this is that you could automatically have NFS (and SMB) shares created and removed as you did things like import and export pools, rather than have to maintain a separate set of export information and keep it in sync with what ZFS filesystems were available. On Linux, OpenZFS still supports this, working through standard Linux NFS export permissions (which don't quite match the Solaris/Illumos model that's used for sharenfs) and standard tools like exportfs. A lot of this works more or less as you'd expect, but it turns out that there's a potentially unpleasant surprise lurking in how 'zpool import' and 'zpool export' work.

In the current code, if you import or export a ZFS pool that has no filesystems with a sharenfs set, ZFS will still run 'exportfs -ra' at the end of the operation even though nothing could have changed in the NFS exports situation. An important effect that this has is that it will wipe out any manually added or changed NFS exports, reverting your NFS exports to what is currently in /etc/exports and /etc/exports.d. In many situations (including ours) this is a harmless operation, because /etc/exports and /etc/exports.d are how things are supposed to be. But in some environments you may have programs that maintain their own exports list and permissions through running 'exportfs' in various ways, and in these environments a ZFS pool import or export will destroy those exports.

(Apparently one such environment is high availability systems, some of which manually manage NFS exports outside of /etc/exports (I maintain that this is a perfectly sensible design decision). These are also the kind of environment that might routinely import or export pools, as HA pools move between hosts.)

The current OpenZFS code runs 'exportfs -ra' entirely blindly. It doesn't matter if you don't NFS export any ZFS filesystems, much less any from the pool that you're importing or exporting. As long as an 'exportfs' binary is on the system and can be executed, ZFS will run it. Possibly this could be changed if someone was to submit an OpenZFS bug report, but for a number of reasons (including that we're not directly affected by this and aren't in a position to do any testing), that someone will not be me.

(As far as I can tell this is the state of the code in all Linux OpenZFS versions up through the current development version and 2.3.0-rc4, the latest 2.3.0 release candidate.)

Appendix: Where this is in the current OpenZFS source code

The exportfs execution is done in nfs_commit_shares() in lib/libshare/os/linux/nfs.c. This is called (indirectly) by sa_commit_shares() in lib/libshare/libshare.c, which is called by zfs_commit_shares() in lib/libzfs/libzfs_mount.c. In turn this is called by zpool_enable_datasets() and zpool_disable_datasets(), also in libzfs_mount.c, which are called as part of 'zpool import' and 'zpool export' respectively.

(As a piece of trivia, zpool_disable_datasets() will also be called during 'zpool destroy'.)

Two views of Python type hints and catching bugs

By: cks
23 December 2024 at 04:03

I recently wrote a little Python program where I ended up adding type hints, an experience that I eventually concluded was worth it overall even if it was sometimes frustrating. I recently fixed a small bug in the program; like many of my bugs, it was a subtle logic bug that wasn't caught by typing (and I don't think it would have been caught by any reasonable typing).

One view you could take of type hints is that they often don't catch any actual bugs, and so you can question their worth (when viewed only from a bug catching perspective). Another view, one that I'm more inclined to, is that type hints sweep away the low hanging fruit of bugs. A type confusion bug is almost always found pretty fast when you try to use the code, because your code usually doesn't work at all. However, using type hints and checking them provides early and precise detection of these obvious bugs, so you get rid of them right away before they take up your time with you trying to work out why this object doesn't have the methods or fields that you expect.

("Type hints", which is to say documenting what types are used where for what, also have additional benefits, such as accurate documentation and enabling type based things in IDEs, LSP servers, and so on.)

So although my use of type hints and mypy didn't catch this particular logic oversight, my view of them remains positive. And type hints did help me make sure I wasn't adding an obvious bug when I fixed this issue (my fix required passing an extra argument to something, creating an opportunity for a bit of type confusion if I got the arguments wrong).

Sidebar: my particular non-type bug

This program reports the current, interesting alerts from our Prometheus metrics system. For various reasons, it supports getting the alerts as of some specific time, not just 'now', and it also filters out some alerts when they aren't old enough. My logic bug with was with the filtering; in order to compute the age of an alert, I did:

age = time.time() - alert_started_at

The logic problem is that when I'm getting the alerts at a particular time instead of 'now', I also want to compute the age of the alert as of that time, not as of 'right now'. So I don't want 'time.time()', I want 'as of the logical time when we're obtaining this information'.

(This sort of logic oversight is typical for non-obvious bugs that linger in my programs after they're basically working. I only noticed it because I was adding a new filter, and needed to get the alerts as of a time when what I wanted to filter out was happening.)

When power cycling your (x86) server isn't enough to recover it

By: cks
22 December 2024 at 03:43

We have various sorts of servers here, and generally they run without problems unless they experience obvious hardware failures. Rarely, we experience Linux kernel hangs on them, and when this happens, we power cycle the machines, as one does, and the server comes back. Well, almost always. We have two servers (of the same model), where something different has happened once.

Each of the servers either crashed in the kernel and started to reboot or hung in the kernel and was power cycled (both were essentially unused at the time). As each server was running through the system firmware ('BIOS'), both of them started printing an apparently endless series of error dumps to their serial consoles (which had been configured in the BIOS as well as in the Linux kernel). These were like the following:

!!!! X64 Exception Type - 12(#MC - Machine-Check)  CPU Apic ID - 00000000 !!!!
RIP  - 000000006DABA5A5, CS  - 0000000000000038, RFLAGS - 0000000000010087
RAX  - 0000000000000008, RCX - 0000000000000000, RDX - 0000000000000001
RBX  - 000000007FB6A198, RSP - 000000005D29E940, RBP - 000000005DCCF520
RSI  - 0000000000000008, RDI - 000000006AB1B1B0
R8   - 000000005DCCF524, R9  - 000000005D29E850, R10 - 000000005D29E8E4
R11  - 000000005D29E980, R12 - 0000000000000008, R13 - 0000000000000001
R14  - 0000000000000028, R15 - 0000000000000000
DS   - 0000000000000030, ES  - 0000000000000030, FS  - 0000000000000030
GS   - 0000000000000030, SS  - 0000000000000030
CR0  - 0000000080010013, CR2 - 0000000000000000, CR3 - 000000005CE01000
CR4  - 0000000000000668, CR8 - 0000000000000000
DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 0000000076E46000 0000000000000047, LDTR - 0000000000000000
IDTR - 000000006AC3D018 0000000000000FFF,   TR - 0000000000000000
FXSAVE_STATE - 000000005D29E5A0
!!!! Can't find image information. !!!!

(The last line leaves me with questions about the firmware/BIOS but I'm unlikely to get answers to them. I'm putting the full output here for the usual reason.)

Some of the register values varied between reports, others didn't after the first one (for example, from the second onward the RIP appears to have always been 6DAB14D1, which suggests maybe it's an exception handler).

In both cases, we turned off power to the machines (well, to the hosts; we were working through the BMC, which stayed powered on), let them sit for a few minutes, and then powered them on again. This returned them to regular, routine, unexciting service, where neither of them have had problems since.

I knew in a theoretical way that there are parts of an x86 system that aren't necessarily completely reset if the power is only interrupted briefly (my understanding is that a certain amount of power lingers until capacitors drain and so on, but this may be wrong and there's a different mechanism in action). But I usually don't have it demonstrated in front of me this way, where a simple power cycle isn't good enough to restore a system but a cool down period works.

(Since we weren't cutting external power to the entire system, this also left standby power (also) available, which means some things never completely lost power even with the power being 'off' for a couple of minutes.)

PS: Actually there's an alternate explanation, which is that the first power cycle didn't do enough to reset things but a second one would have worked if I'd tried that instead of powering the servers off for a few minutes. I'm not certain I believe this and in any case, powering the servers off for a cool down period was faster than taking a chance on a second power cycle reset.

Remembering to make my local changes emit log messages when they act

By: cks
21 December 2024 at 03:48

Over on the Fediverse, I said something:

Current status: respinning an Ubuntu package build (... painfully) because I forgot the golden rule that when I add a hack to something, I should always make it log when my hack was triggered. Even if I can observe the side effects in testing, we'll want to know it happened in production.

(Okay, this isn't applicable to all hacks, but.)

Every so often we change or augment some standard piece of software or standard part of the system to do something special under specific circumstances. A rule I keep forgetting and then either re-learning or reminding myself of is that even if the effects of my change triggering are visible to the person using the system, I want to make it log as well. There are at least two reasons for this.

The first reason is that my change may wind up causing some problem for people, even if we don't think it's going to. Should it cause such problems, it's very useful to have a log message (perhaps shortly before the problem happens) to the effect of 'I did this new thing'. This can save a bunch of troubleshooting, both at the time when we deploy this change and long afterward.

The second reason is that we may turn out to be wrong about how often our change triggers, which is to say how common the specific circumstances are. This can go either way. Our change can trigger a lot more than we expected, which may mean that it's overly aggressive and is affecting people more than we want, and cause us to look for other options. Or this could be because the issue we're trying to deal with could be more significant than we expect and justifies us doing even more. Alternately, our logging can trigger a lot less than we expect, which may mean we want to take the change out rather than have to maintain a local modification that doesn't actually do much (one that almost invariably makes the system more complex and harder to understand).

In the log message itself, I want to be clear and specific, although probably not as verbose as I would be for an infrequent error message. Especially for things I expect to trigger relatively infrequently, I should probably put as many details about the special circumstances as possible into the log message, because the log message is what me and my co-workers may have to work from in six months when we've forgotten the details.

Unix's buffered IO in assembly and in C

By: cks
9 December 2024 at 02:44

Recently on the Fediverse, I said something related to Unix's pre-V7 situation with buffered IO:

[...]

(I think the V1 approach is right for an assembly based minimal OS, while the stdio approach kind of wants malloc() and friends.)

The V1 approach, as documented in its putc.3 and getw.3 manual pages, is that the caller to the buffered IO routines supplies the data area used for buffering, and the library functions merely initialize it and later use it. How you get the data area is up to you and your program; you might, for example, simply have a static block of memory in your BSS segment. You can dynamically allocate this area if you want to, but you don't have to. The V2 and later putchar have a similar approach but this time they contain a static buffer area and you just have to do a bit of initialization (possibly putchar was in V1 too, I don't know for sure).

Stdio of course has a completely different API. In stdio, you don't provide the data area; instead, stdio provides you an opaque reference (a 'FILE *') to the information and buffers it maintains internally. This is an interface that definitely wants some degree of dynamic memory allocation, for example for the actual buffers themselves, and in modern usage most of the FILE objects will be dynamically allocated too.

(The V7 stdio implementation had a fixed set of FILE structs and so would error out if you used too many of them. However, it did use malloc() for the buffer associated with them, in filbuf.c and flsbuf.c.)

You can certainly do dynamic memory allocation in assembly, but I think it's much more natural in C, and certainly the C standard library is more heavyweight than the relatively small and minimal assembly language stuff early Unix programs (written in assembly) seem to have required. So I think it makes a lot of sense that Unix started with a buffering approach where the caller supplies the buffer (and probably doesn't dynamically allocate it), then moved to one where the library does at least some allocation and supplies the buffer (and other data) itself.

PCIe cards we use and have used in our servers

By: cks
8 December 2024 at 03:00

In a comment on my entry on how common (desktop) motherboards are supporting more M.2 NVMe slots but fewer PCIe cards, jmassey was curious about what PCIe cards we needed and used. This is a good and interesting question, especially since some number of our 'servers' are actually built using desktop motherboards for various reasons (for example, a certain number of the GPU nodes in our SLURM cluster, and some of our older compute servers, which we put together ourselves using early generation AMD Threadrippers and desktop motherboards for them).

Today, we have three dominant patterns of PCIe cards. Our SLURM GPU nodes obviously have a GPU card (x16 PCIe lanes) and we've added a single port 10G-T card (which I believe are all PCIe x4) so they can pull data from our fileservers as fast as possible. Most of our firewalls have an extra dual-port 10G card (mostly 10G-T but a few use SFPs). And a number of machines have dual-port 1G cards because they need to be on more networks; our current stock of these cards are physically x4 PCIe, although I haven't looked to see if they use all the lanes.

(We also have single-port 1G cards lying around that sometimes get used in various machines; these are x1 cards. The dual-port 10G cards are probably some mix of x4 and x8, since online checks say they come in both varieties. We have and use a few quad-port 1G cards for semi-exotic situations, but I'm not sure how many PCIe lanes they want, physically or otherwise. In theory they could reasonably be x4, since a single 1G is fine at x1.)

In the past, one generation of our fileserver setup had some machines that needed to use PCIe SAS controller in order to be able to talk to all of the drives in their chassis, and I believe these cards were PCIe x8; these machines also used a dual 10G-T card. The current generation handles all of their drives through motherboard controllers, but we might need to move back to cards in future hardware configurations (depending on what the available server motherboards handle on the motherboard). The good news, for fileservers, is that modern server motherboards increasingly have at least one onboard 10G port. But in a worst case situation, a large fileserver might need two SAS controller cards and a 10G card.

It's possible that we'll want to add NVMe drives to some servers (parts of our backup system may be limited by SATA write and read speeds today). Since I don't believe any of our current servers support PCIe bifurcation, this would require one or two PCIe x4 cards and slots (two if we want to mirror this fast storage, one if we decide we don't care). Such a server would likely also want 10G; if it didn't have a motherboard 10G port, that would require another x4 card (or possibly a dual-port 10G card at x8).

The good news for us is that servers tend to make all of their available slots be physically large (generally large enough for x8 cards, and maybe even x16 these days), so you can fit in all these cards even if some of them don't get all the PCIe lanes they'd like. And modern server CPUs are also coming with more and more PCIe lanes, so probably we can actually drive many of those slots at their full width.

(I was going to say that modern server motherboards mostly don't design in M.2 slots that reduce the available PCIe lanes, but that seems to depend on what vendor you look at. A random sampling of Supermicro server motherboards suggests that two M.2 slots are not uncommon, while our Dell R350s have none.)

Common motherboards are supporting more and more M.2 NVMe drive slots

By: cks
7 December 2024 at 04:27

Back at the start of 2020, I wondered if common (x86 desktop) motherboards would ever have very many M.2 NVMe drive slots, where by 'very many' I meant four or so, which even back then was a common number of SATA ports for desktop motherboards to provide. At the time I thought the answer was probably no. As I recently discovered from investigating a related issue, I was wrong, and it's now fairly straightforward to find x86 desktop motherboards that have as many as four M.2 NVMe slots (although not all four may be able to run at x4 PCIe lanes, especially if you have things like a GPU).

For example, right now it's relatively easy to find a page full of AMD AM5-based motherboards that have four M.2 NVMe slots. Most of these seem to be based on the high end X series AMD chipsets (such as the X670 or the X870, but I found a few that were based on the B650 chipset. On the Intel side, should you still be interested in an Intel CPU in your desktop at this point, there's also a number of them based primarily on the Z790 chipset (and some the older Z690). There's even a B760 based motherboard with four M.2 NVMe slots (although two of them are only x1 lanes and PCIe 3.0), and an H770 based one that manages to (theoretically) support all four M.2 slots at x4 lanes.

One of the things that I think has happened on the way to this large supply of M.2 slots is that these desktop motherboards have dropped most of their PCIe slots. These days, you seem to commonly get three slots in total on the kind of motherboard that has four M.2 slots. There's always one x16 slot, often two, and sometimes three (although that's physical x16; don't count on getting all 16 PCIe lanes in every slot). It's not uncommon to see the third PCIe slot be physically x4, or a little x1 slot tucked away at the bottom of the motherboard. It also isn't necessarily the case that lower end desktops have more PCIe slots to go with their fewer M.2 slots; they too seem to have mostly gone with two or three PCIe slots, generally with limited number of lanes even if they're physically x16.

(I appreciate having physical x16 slots even if they're only PCIe x1, because that means you can use any card that doesn't require PCIe bifurcation and it should work, although slowly.)

As noted by commentators on my entry on PCIe bifurcation and its uses for NVMe drives, a certain amount of what we used to need PCIe slots for can now be provided through high speed USB-C and similar things. And of course there are only so many PCIe lanes to go around from the CPU and the chipset, so those USB-C ports and other high-speed motherboard devices consume a certain amount of them; the more onboard devices the motherboard has the fewer PCIe lanes there are left for PCIe slots, whether or not you have any use for those onboard devices and connectors.

(Having four M.2 NVMe slots is useful for me because I use my drives in mirrored pairs, so four M.2 slots means I can run my full old pair in parallel with a full new pair, either in a four way mirror or doing some form of migration from one mirrored pair to the other. Three slots is okay, since that lets me add a new drive to a mirrored pair for gradual migration to a new pair of drives.)

Buffered IO in Unix before V7 introduced stdio

By: cks
6 December 2024 at 04:16

I recently read Julia Evans' Why pipes sometimes get "stuck": buffering. Part of the reason is that almost every Unix program does some amount of buffering for what it prints (or writes) to standard output and standard error. For C programs, this buffering is built into the standard library, specifically into stdio, which includes familiar functions like printf(). Stdio is one of the many things that appeared first in Research Unix V7. This might leave you wondering if this sort of IO was buffered in earlier versions of Research Unix and if it was, how it was done.

The very earliest version of Research Unix is V1, and in V1 there is putc.3 (at that point entirely about assembly, since C was yet to come). This set of routines allows you to set up and then use a 'struct' to implement IO buffering for output. There is a similar set of buffered functions for input, in getw.3, and I believe the memory blocks the two sets of functions use are compatible with each other. The V1 manual pages note it as a bug that the buffer wasn't 512 bytes, but also notes that several programs would break if the size was changed; the buffer size will be increased to 512 bytes by V3.

In V2, I believe we still have putc and getw, but we see the first appearance of another approach, in putchr.s. This implements putchar(), which is used by printf() and which (from later evidence) uses an internal buffer (under some circumstances) that has to be explicitly flush()'d by programs. In V3, there's manual pages for putc.3 and getc.3 that are very similar to the V1 versions, which is why I expect these were there in V2 as well. In V4, we have manual pages for both putc.3 (plus getc.3) and putch[a]r.3, and there is also a getch[a]r.3 that's the input version of putchar(). Since we have a V4 manual page for putchar(), we can finally see the somewhat tangled way it works, rather than having to read the PDP-11 assembly. I don't have links to V5 manuals, but the V5 library source says that we still have both approaches to buffered IO.

(If you want to see how the putchar() approach was used, you can look at, for example, the V6 grep.c, which starts out with the 'fout = dup(1);' that the manual page suggests for buffered putchar() usage, and then periodically calls flush().)

In V6, there is a third approach that was added, in /usr/source/iolib, although I don't know if any programs used it. Iolib has a global array of structs, that were statically associated with a limited number of low-numbered file descriptors; an iolib function such as cflush() would be passed a file descriptor and use that to look up the corresponding struct. One innovation iolib implicitly adds is that its copen() effectively 'allocates' the struct for you, in contrast to putc() and getc(), where you supply the memory area and fopen()/fcreate() merely initialize it with the correct information.

Finally V7 introduces stdio and sorts all of this out, at the cost of some code changes. There's still getc() and putc(), but now they take a FILE *, instead of their own structure, and you get the FILE * from things like fopen() instead of supplying it yourself and having a stdio function initialize it. Putchar() (and getchar()) still exist but are now redone to work with stdio buffering instead of their own buffering, and 'flush()' has become fflush() and takes an explicit FILE * argument instead of implicitly flushing putchar()'s buffer, and generally it's not necessary any more. The V7 grep.c still uses printf(), but now it doesn't explicitly flush anything by calling fflush(); it just trusts in stdio.

Sorting out 'PCIe bifurcation' and how it interacts with NVMe drives

By: cks
5 December 2024 at 03:01

Suppose, not hypothetically, that you're switching from one mirrored set of M.2 NVMe drives to another mirrored set of M.2 NVMe drives, and so would like to have three or four NVMe drives in your desktop at the same time. Sadly, you already have one of your two NVMe drives on a PCIe card, so you'd like to get a single PCIe card that handles two or more NVMe drives. If you look around today, you'll find two sorts of cards for this; ones that are very expensive, and ones that are relatively inexpensive but require that your system supports a feature that is generally called PCIe bifurcation.

NVMe drives are PCIe devices, so a PCIe card that supports a single NVMe drive is a simple, more or less passive thing that wires four PCIe lanes and some other stuff through to the M.2 slot. I believe that in theory, a card could be built that only required x2 or even x1 PCIe lanes, but in practice I think all such single drive cards are physically PCIe x4 and so require a physical x4 or better PCIe slot, even if you'd be willing to (temporarily) run the drive much slower.

A PCIe card that supports more than one M.2 NVMe drive has two options. The expensive option is to put a PCIe bridge on the card, with the bridge (probably) providing a full set of PCIe lanes to the M.2 NVMe drives locally on one side and doing x4, x8, or x16 PCIe with the motherboard on the other. In theory, such a card will work even at x4 or x2 PCIe lanes, because PCIe cards are supposed to do that if the system says 'actually you only get this many lanes' (although obviously you can't drive four x4 NVMe drives at full speed through a single x4 or x2 PCIe connection).

The cheap option is to require that the system be able to split a single PCIe slot into multiple independent groups of PCIe lanes (I believe these are usually called links); this is PCIe bifurcation. In PCIe bifurcation, the system takes what is physically and PCIe-wise an x16 slot (for example) and splits it into four separate x4 links (I've seen this sometimes labeled as 'x4/x4/x4/x4'). This is cheap for the card because it can basically be four single M.2 NVMe PCIe cards jammed together, with each set of x4 lanes wired through to a single M.2 NVMe slot. A PCIe card for two M.2 NVMe drives will require an x8 PCIe slot bifurcated to two x4 links; if you stick this card in an x16 slot, the upper 8 PCIe lanes just get ignored (which means that you can still set your BIOS to x4/x4/x4/x4).

As covered in, for example, this Synopsys page, PCIe bifurcation isn't something that's negotiated as part of bringing up PCIe connections; a PCIe device can't ask for bifurcation and can't be asked whether or not it supports it. Instead, the decision is made as part of configuring the PCIe root device or bridge, which in practice means it's a firmware ('BIOS') decision. However, I believe that bifurcation may also requires hardware support in the 'chipset' and perhaps the physical motherboard.

I put chipset into quotes because for quite some time, some PCIe lanes come directly from the CPU and only some others come through the chipset as such. For example, in desktop motherboards, the x16 GPU slot is almost always driven directly by CPU PCIe lanes, so it's up to the CPU to have support (or not have support) for PCIe bifurcation of that slot. I don't know if common desktop chipsets support bifurcation on the chipset PCIe slots and PCIe lanes, and of course you need chipset-driven PCIe slots that have enough lanes to be bifurcated in the first place. If the PCIe slots driven by the chipset are a mix of x4 and x1 slots, there's no really useful bifurcation that can be done (at least for NVMe drives).

If you have a limited number of PCIe slots that can actually support x16 or x8 and you need a GPU card, you may not be able to use PCIe bifurcation in practice even if it's available for your system. If you have only one PCIe slot your GPU card can go in and it's the only slot that supports bifurcation, you're stuck; you can't have both a bifurcated set of NVMe drives and a GPU (at least not without a bifurcated PCIe riser card that you can use).

(This is where I would start exploring USB NVMe drive enclosures, although on old desktops you'll probably need one that doesn't require USB-C, and I don't know if a NVMe drive set up in a USB enclosure can later be smoothly moved to a direct M.2 connection without partitioning-related problems or other issues.)

(This is one of the entries I write to get this straight in my head.)

Sidebar: Generic PCIe riser cards and other weird things

The traditional 'riser card' I'm used to is a special proprietary server 'card' (ie, a chunk of PCB with connectors and other bits) that plugs into a likely custom server motherboard connector and makes a right angle turn that lets it provide one or two horizontal PCIe slots (often half-height ones) in a 1U or 2U server case, which aren't tall enough to handle PCIe cards vertically. However, the existence of PCIe bifurcation opens up an exciting world of general, generic PCIe riser cards that bifurcate a single x16 GPU slot to, say, two x8 PCIe slots. These will work (in some sense) in any x16 PCIe slot that supports bifurcation, and of course you don't have to restrict yourself to x16 slots. I believe there are also PCIe riser cards that bifurcate an x8 slot into two x4 slots.

Now, you are perhaps thinking that such a riser card puts those bifurcated PCIe slots at right angles to the slots in your case, and probably leaves any cards inserted into them with at least their tops unsupported. If you have light PCIe cards, maybe this works out. If you don't have light PCIe cards, one option is another terrifying thing, a PCIe ribbon cable with a little PCB that is just a PCIe slot on one end (the other end plugs into your real PCIe slot, such as one of the slots on the riser card). Sometimes these are even called 'riser card extenders' (or perhaps those are a sub-type of the general PCIe extender ribbon cables).

Another PCIe adapter device you can get is an x1 to x16 slot extension adapter, which plugs into an x1 slot on your motherboard and has an x16 slot (with only one PCIe lane wired through, of course). This is less crazy than it sounds; you might only have an x1 slot available, want to plug in a x4, x8, or x16 card that's short enough, and be willing to settle for x1 speeds. In theory PCIe cards are supposed to still work when their lanes are choked down this way.

The modern world of server serial ports, BMCs, and IPMI Serial over LAN

By: cks
4 December 2024 at 04:30

Once upon a time, life was relatively simple in the x86 world. Most x86 compatible PCs theoretically had one or two UARTs, which were called COM1 and COM2 by MS-DOS and Windows, ttyS0 and ttyS1 by Linux, 'ttyu0' and 'ttyu1' by FreeBSD, and so on, based on standard x86 IO port addresses for them. Servers had a physical serial port on the back and wired the connector to COM1 (some servers might have two connectors). Then life became more complicated when servers implemented BMCs (Baseboard management controllers) and the IPMI specification added Serial over LAN, to let you talk to your server through what the server believed was a serial port but was actually a connection through the BMC, coming over your management network.

Early BMCs could take very brute force approaches to making this work. The circa 2008 era Sunfire X2200s we used in our first ZFS fileservers wired the motherboard serial port to the BMC and connected the BMC to the physical serial port on the back of the server. When you talked to the serial port after the machine powered on, you were actually talking to the BMC; to get to the server serial port, you had to log in to the BMC and do an arcane sequence to 'connect' to the server serial port. The BMC didn't save or buffer up server serial output from before you connected; such output was just lost.

(Given our long standing console server, we had feelings about having to manually do things to get the real server serial console to show up so we could start logging kernel console output.)

Modern servers and their BMCs are quite intertwined, so I suspect that often both server serial ports are basically implemented by the BMC (cf), or at least are wired to it. The BMC passes one serial port through to the physical connector (if your server has one) and handles the other itself to implement Serial over LAN. There are variants on this design possible; for example, we have one set of Supermicro hardware with no external physical serial connector, just one serial header on the motherboard and a BMC Serial over LAN port. To be unhelpful, the motherboard serial header is ttyS0 and the BMC SOL port is ttyS1.

When the BMC handles both server serial ports and passes one of them through to the physical serial port, it can decide which one to pass through and which one to use as the Serial over LAN port. Being able to change this in the BMC is convenient if you want to have a common server operating system configuration but use a physical serial port on some machines and use Serial over LAN on others. With the BMC switching which server serial port comes out on the external serial connector, you can tell all of the server OS installs to use 'ttyS0' as their serial console, then connect ttyS0 to either Serial over LAN or the physical serial port as you need.

Some BMCs (I'm looking at you, Dell) go to an extra level of indirection. In these, the BMC has an idea of 'serial device 1' and 'serial device 2', with you controlling which of the server's ttyS0 and ttyS1 maps to which 'serial device', and then it has a separate setting for which 'serial device' is mapped to the physical serial connector on the back. This helpfully requires you to look at two separate settings to know if your ttyS0 will be appearing on the physical connector or as a Serial over LAN console (and gives you two settings that can be wrong).

In theory a BMC could share a single server serial port between the physical serial connector and an IPMI Serial over LAN connection, sending output to both and accepting input from each. In practice I don't think most BMCs do this and there are obvious issues of two people interfering with each other that BMCs may not want to get involved in.

PS: I expect more and more servers to drop external serial ports over time, retaining at most an internal serial header on the motherboard. That might simplify BMC and BIOS settings.

Good union types in Go would probably need types without a zero value

By: cks
3 December 2024 at 04:00

One of the classical big reason to want union types in Go is so that one can implement the general pattern of an option type, in order to force people to deal explicitly with null values. Except this is not quite true on both sides. The compiler can enforce null value checks before use already, and union and option types by themselves don't fully protect you against null values. Much like people ignore error returns (and the Go compiler allows this), people can skip over that they can't extract an underlying value from their Result value and return a zero value from their 'get a result' function.

My view is that the power of option types is what they do in the rest of the language, but they can only do this if you can express their guarantees in the type system. The important thing you need for this is non-nullable types. This is what lets you guarantee that something is a proper value extracted from an error-free Result or whatever. If you can't express this in your types, everyone has to check, one way or another, or you risk a null sneaking in.

Go doesn't currently have a type concept for 'something that can't be null', or for that matter a concept that is exactly 'null'. The closest Go equivalent is the general idea of zero values, of which nil pointers (and nil interfaces) are a special case (but you can also have zero value maps and channels, which also have special semantics; the zero value of slices is more normal). If you want to make Result and similar types particularly useful in Go, I believe that you need to change this, somehow introducing types that don't have a zero value.

(Such types would likely be a variation of existing types with zero values, and presumably you could only use values or assign to variables of that type if the compiler could prove that what you were using or assigning wasn't a zero value.)

As noted in a comment by loreb on my entry on how union types would be complicated, these 'union' or 'sum' types in Go also run into issues with their zero value, and as Ian Lance Taylor's issue comment says, zero values are built quite deeply into Go. You can define semantics for union types that allow zero values, but I don't think they're really particularly useful for anything except cramming some data structures into a few less bytes in a somewhat opaque way, and I'm not sure that's something Go should be caring about.

Given that zero values are a deep part of Go and the Go developers don't seem particularly interested in trying to change this, I doubt that we're ever going to get the powerful form of union types in Go. If anything like union types appears, it will probably be merely to save memory, and even then union types are complicated in Go's runtime.

Sidebar: the simple zero value allowed union type semantics

If you allow union types to have a zero value, the obvious meaning of a zero value is something that can't have a value of any type successfully extracted from it. If you try the union type equivalent of a type assertion you get a zero value and 'false' for all possible options. Of course this completely gives up on the 'no zero value' type side of things, but at least you have a meaning.

This makes a zero value union very similar to a nil interface, which will also fail all type assertions. At this point my feeling is that Go might as well stick with interfaces and not attempt to provide union types.

Union types ('enum types') would be complicated in Go

By: cks
2 December 2024 at 04:31

Every so often, people wish that Go had enough features to build some equivalent of Rust's Result type or Option type, often so that Go programmers could have more ergonomic error handling. One core requirement for this is what Rust calls an Enum and what is broadly known as a Union type. Unfortunately, doing a real enum or union type in Go is not particularly simple, and it definitely requires significant support by the Go compiler and the runtime.

At one level we easily do something that looks like a Result type in Go, especially now that we have generics. You make a generic struct that has private fields for an error, a value of type T, and a flag that says which is valid, and then give it some methods to set and get values and ask it which it currently contains. If you ask for a sort of value that's not valid, it panics. However, this struct necessarily has space for three fields, where the Rust enums (and generally union types) act more like C unions, only needing space for the largest type possible in them and sometimes a marker of what type is in the union right now.

(The Rust compiler plays all sorts of clever tricks to elide the enum marker if it can store this information in some other way.)

To understand why we need deep compiler and runtime support, let's ask why we can't implement such a union type today using Go's unsafe package to perform suitable manipulation of a suitable memory region. Because it will make the discussion easier, let's say that we're on a 64-bit platform and our made up Result type will contain either an error (which is an interface value) or an int64[2] array. On a 64-bit platform, both of these types occupy 16 bytes, since an interface value is two pointers in a trenchcoat, so it looks like we should be able to use the same suitably-aligned 16-byte memory area for each of them.

However, now imagine that Go is performing garbage collection. How does the Go runtime know whether or not our 16-byte memory area contains two live pointers, which it must follow as part of garbage collection, or two 64-bit integers, which it definitely cannot treat as pointers and follow? If we've implemented our Result type outside of the compiler and runtime, the answer is that garbage collection has no idea which it currently is. In the Go garbage collector, it's not values that have types, but storage locations, and Go doesn't provide an API for changing the type of a storage location.

(Internally the runtime can set and change information about what pieces of memory contain pointers, but this is not exposed to the outside world; it's part of the deep integration of runtime memory allocation and the runtime garbage collector.)

In Go, without support from the runtime and the compiler the best you can do is store an interface value or perhaps an unsafe.Pointer to the actual value involved. However, this probably forces a separate heap allocation for the value, which is less efficient in several ways that the compiler supported version that Rust has. On the positive side, if you store an interface value you don't need to have any marker for what's stored in your Result type, since you can always extract that from the interface with suitable type assertion.

The corollary to all of this is that adding union types to Go as a language feature wouldn't be merely a modest change in the compiler. It would also require a bunch of work in how such types interact with garbage collection, Go's memory allocation systems (which in the normal Go toolchain allocate things with pointers into separate memory arenas than things without them), and likely other places in the runtime.

(I suspect that Go is pretty unlikely to add union types given this, since you can have much of the API that union types present with interface types and generics. And in my view, union types like Result wouldn't be really useful without other changes to Go's type system, although that's another entry.)

PS: Something like this has come up before in generic type sets.

Using systemd-run to limit something's memory usage in cgroups v2

By: cks
1 December 2024 at 03:59

Once upon a time I wrote an entry about using systemd-run to limit something's RAM consumption. This was back in the days of cgroups v1 (also known as 'non-unified cgroups'), and we're now in the era of cgroups v2 ('unified cgroups') and also ZRAM based swap. This means we want to make some adjustments, especially if you're dealing with programs with obnoxiously large RAM usage.

As before, the basic thing you want to do is run your program or thing in a new systemd user scope, which is done with 'systemd-run --user --scope ...'. You may wish to give it a unit name as well, '--unit <name>', especially if you expect it to persist a while and you want to track it specifically. Systemd will normally automatically clean up this scope when everything in it exits, and the scope is normally connected to your current terminal and otherwise more or less acts normally as an interactive process.

To actually do anything with this, we need to set some systemd resource limits. To limit memory usage, the minimum is a MemoryMax= value. It may also work better to set MemoryHigh= to a value somewhat below the absolute limit of MemoryMax. If you're worried about whatever you're doing running your system out of memory and your system uses ZRAM based swap, you may also want to set a MemoryZSwapMax= value so that the program doesn't chew up all of your RAM by 'swapping' it to ZRAM and filling that up. Without a ZRAM swap limit, you might find that the program actually uses MemoryMax RAM plus your entire ZRAM swap RAM, which might be enough to trigger a more general OOM. So this might be:

systemd-run --user --scope -p MemoryHigh=7G -p MemoryMax=8G -p MemoryZSwapMax=1G ./mach build

(Good luck with building Firefox in merely 8 GBytes of RAM, though. And obviously if you do this regularly, you're going to want to script it.)

If you normally use ZRAM based swap and you're worried about the program running you out of memory that way, you may want to create some actual swap space that the program can be turned loose on. These days, this is as simple as creating a 'swap.img' file somewhere and then swapping onto it:

cd /
dd if=/dev/zero of=swap.img bs=1MiB count=$((4*1024))
mkswap swap.img
swapon /swap.img

(You can use swapoff to stop swapping to this image file after you're done running your big program.)

Then you may want to also limit how much of this swap space the program can use, which is done with a MemorySwapMax= value. I've read both systemd's documentation and the kernel's cgroup v2 memory controller documentation, and I can't tell whether the ZRAM swap maximum is included in the swap maximum or is separate. I suspect that it's included in the swap maximum, but if it really matters you should experiment.

If you also want to limit the program's CPU usage, there are two options. The easiest one to set is CPUQuota=. The drawback of CPU quota limits is that programs may not realize that they're being restricted by such a limit and wind up running a lot more threads (or processes) than they should, increasing the chances of overloading things. The more complex but more legible to programs way is to restrict what CPUs they can run on using taskset(1).

(While systemd has AllowedCPUs=, this is a cgroup setting and doesn't show up in the interface used by taskset and sched_getaffinity(2).)

Systemd also has CPUWeight=, but I have limited experience with it; see fair share scheduling in cgroup v2 for what I know. You might want the special value 'idle' for very low priority programs.

Python type hints are probably "worth it" in the large for me

By: cks
30 November 2024 at 04:07

I recently added type hints to a little program, and that experience wasn't entirely positive that left me feeling that maybe I shouldn't bother. Because I don't promise to be consistent, I went back and re-added type hints to the program all over again, starting from the non-hinted version. This time I did the type hints rather differently and the result came out well enough that I'm going to keep it.

Perhaps my biggest change was to entirely abandon NewType(). Instead I set up two NamedTuples and used type aliases for everything else, which amounts to three type aliases in total. Since I was using type aliases anyway, I only added them when it was annoying to enter the real type (and I was doing it often enough). I skipped doing a type alias for 'list[namedTupleType]' because I couldn't come up with a name that I liked well enough and that it's a list is fundamental to how it's interacted with in the code involved, so I didn't feel like obscuring that.

Adding type hints 'for real' had the positive aspect of encouraging me to write a bunch of comments about what things were and how they worked, which will undoubtedly help future me when I want to change something in six months. Since I was using NamedTuples, I changed to accessing the elements of the tuples through the names instead of the indexes, which improved the code. I had to give up 'list(adict.items())' in favour of a list comprehension that explicitly created the named tuple, but this is probably a good thing for the overall code quality.

(I also changed the type of one thing I had as 'int' to a float, which is what it really should have been all along even if all of the normal values were integers.)

Overall, I think I've come around to the view that doing all of this is good for me in the same way that using shellcheck is good for my shell scripts, even if I sometimes roll my eyes at things it says. I also think that just making mypy silent isn't the goal I should be aiming for. Instead, I should be aiming for what I did to my program on this second pass, doing things like introducing named tuples (in some form), adding comments, and so on. Adding final type hints should be a prompt for a general cleanup.

(Perhaps I'll someday get to a point where I add basic type hints as I write the code initially, just to codify my belief about the shape of what I'm returning and passing in, and use them to find my mistakes. But that day is probably not today, and I'll probably want better LSP integration for it in my GNU Emacs environment.)

My life has been improved by my quiet Prometheus alert status monitor

By: cks
29 November 2024 at 04:48

I recently created a setup to provide a backup for our email-based Prometheus alerts; the basic result is that if our current Prometheus alerts change, a window with a brief summary of current alerts will appear out of the way on my (X) desktop. Our alerts are delivered through email, and when I set up this system I imagined it as a backup, in case email delivery had problems that stopped me from seeing alerts. I didn't entirely realize that in the process, I'd created a simple, terse alert status monitor and summary display.

(This wasn't entirely a given. I could have done something more clever when the status of alerts changed, like only displaying new alerts or alerts that had been resolved. Redisplaying everything was just the easiest approach that minimized maintaining and checking state.)

After using my new setup for several days, I've ended up feeling that I'm more aware of our general status on an ongoing and global basis than I was before. Being more on top of things this way is a reassuring feeling in general. I know I'm not going to accidentally miss something or overlook something that's still ongoing, and I actually get early warning of situations before they trigger actual emails. To put it in trendy jargon, I feel like I have more situational awareness. At the same time this is a passive and unintrusive thing that I don't have to pay attention to if I'm busy (or pay much attention to in general, because it's easy to scan).

Part of this comes from how my new setup doesn't require me to do anything or remember to check anything, but does just enough to catch my eye if the alert situation is changing. Part of this comes from how it puts information about all current alerts into one spot, in a terse form that's easy to scan in the usual case. We have Grafana dashboards that present the same information (and a lot more), but it's more spread out (partly because I was able to do some relatively complex transformations and summarizations in my code).

My primary source for real alerts is still our email messages about alerts, which have gone through additional Alertmanager processing and which carry much more information than is in my terse monitor (in several ways, including explicitly noting resolved alerts). But our email is in a sense optimized for notification, not for giving me a clear picture of the current status, especially since we normally group alert notifications on a per-host basis.

(This is part of what makes having this status monitor nice; it's an alternate view of alerts from the email message view.)

Some notes on my experiences with Python type hints and mypy

By: cks
28 November 2024 at 04:35

As I thought I might, today I spent some time adding full and relatively honest type hints to my recent Python program. The experience didn't go entirely smoothly and it left me with a number of learning experiences and things I want to note down in case I ever do this again. The starting point is that my normal style of coding small programs is to not make classes to represent different sorts of things and instead use only basic built in collection types, like lists, tuples, dictionaries, and so on. When you use basic types this way, it's very easy to pass or return the wrong 'shape' of thing (I did it once in the process of writing my program), and I'd like Python type hints to be able to tell me about this.

(The first note I want to remember is that mypy becomes very irate at you in obscure ways if you ever accidentally reuse the same (local) variable name for two different purposes with two different types. I accidentally reused the name 'data', using it first for a str and second for a dict that came from an 'Any' typed object, and the mypy complaints were hard to decode; I believe it complained that I couldn't index a str with a str on a line where I did 'data["key"]'.)

When you work with data structures created from built in collections, you can wind up with long, tangled compound type name, like 'tuple[str, list[tuple[str, int]]]' (which is a real type in my program). These are annoying to keep typing and easy to make mistakes with, so Python type hints provide two ways of giving them short names, in type aliases and typing.NewType. These look almost the same:

# type alias:
type hostAlertsA = tuple[str, list[tuple[str, int]]]

# NewType():
hostAlertsT = NewType('hostAlertsT', tuple[str, list[tuple[str, int]]])

The problem with type aliases is that they are aliases. All aliases for a type are considered to be the same, and mypy won't warn if you call a function that expects one with a value that was declared to be another. Suppose you have two sorts of strings, ones that are a host name and ones that are an alert name, and you would like to keep them straight. Suppose that you write:

# simple type aliases
type alertName = str
type hostName = str

func manglehost(hname: hostName) -> hostName:
  [....]

Because these are only type aliases and because all type aliases are treated as the same, you have not achieved your goal of keeping you from confusing host and alert names when you call 'manglehost()'. In order to do this, you need to use NewType(), at which point mypy will complain (and also often force you to explicitly mark bare strings as one or the other, with 'alertName(yourstr)' or 'hostName(yourstr)').

If I want as much protection against this sort of type confusion, I want to make as many things as possible be NewType()s instead of type aliases. Unfortunately NewType()s have some drawbacks in mypy for my sort of usage as far as I can see.

The first drawback is that you cannot create a NewType of 'Any':

error: Argument 2 to NewType(...) must be subclassable (got "Any")  [valid-newtype]

In order to use NewType, I must specify concrete details of my actual (current) implementation, rather than saying just 'this is a distinct type but anything can be done with it'.

The second drawback is that this distinct typing is actually a problem when you do certain sorts of transformations of collections. Let's say we have alerts, which have a name and a start time, and hosts, which have a hostname and a list of alerts:

alertT  = NewType('alertT',  tuple[str, int])
hostAlT = NewType('hostAlT', tuple[str, list[alertT]])

We have a function that receives a dictionary where the keys are hosts and the values are their alerts and turns it into a sorted list of hosts and their alerts, which is to say a list[hostAlT]). The following Python code looks sensible on the surface:

def toAlertList(hosts: dict[str, list[alertT]) -> list[hostAlT]:
  linear = list(hosts.items())
  # Don't worry about the sorting for now
  return linear

If you try to check this, mypy will declare:

error: Incompatible return value type (got "list[tuple[str, list[alertT]]]", expected "list[hostAlT]")  [return-value]

Initially I thought this was mypy being limited, but in writing this entry I've realized that mypy is correct. Our .items() returns a tuple[str, list[alertT]], but while it has the same shape as our hostAlT, it is not the same thing; that's what it means for hostAlT to be a distinct type.

However, it is a problem that as far as I know, there is no type checked way to get mypy to convert the list we have into a list[hostAlT]. If you create a new NewType to be the list type, all it 'aListT', and try to convert 'linear' to it with 'l2 = aListT(linear)', you will get more or less the same complaint:

error: Argument 1 to "aListT" has incompatible type "list[tuple[str, list[alertT]]]"; expected "list[hostAlT]"  [arg-type]

This is a case where as far as I can see I must use a type alias for 'hostAlT' in order to get the structural equivalence conversion, or alternately use the wordier and as far as I know less efficient list comprehension version of list() so that I can tell mypy that I'm transforming each key/value pair into a hostAlT value:

linear = [hostAlT(x) for x in hosts.items()]

I'd have the same problem in the actual code (instead of in the type hint checking) if I was using, for example, a namedtuple to represent a host and its alerts. Calling hosts.items() wouldn't generate objects of my named tuple type, just unnamed standard tuples.

Possibly this is a sign that I should go back through my small programs after I more or less finish them and convert this sort of casual use of tuples into namedtuple (or the type hinted version) and dataclass types. If nothing else, this would serve as more explicit documentation for future me about what those tuple fields are. I would have to give up those clever 'list(hosts.items())' conversion tricks in favour of the more explicit list comprehension version, but that's not necessarily a bad thing.

Sidebar: aNewType(...) versus typing.cast(typ, ....)

If you have a distinct NewType() and mypy is happy enough with you, both of these will cause mypy to consider your value to now be of the new type. However, they have different safety levels and restrictions. With cast(), there are no type hint checking guardrails at all; you can cast() an integer literal into an alleged string and mypy won't make a peep. With, for example, 'hostAlT(...)', mypy will apply a certain amount of compatibility checking. However, as we saw above in the 'aListT' example, mypy may still report a problem on the type change and there are certain type changes you can't get it to accept.

As far as I know, there's no way to get mypy to temporarily switch to a structural compatibility checking here. Perhaps there are deep type safety reasons to disallow that.

Python type hints may not be for me in practice

By: cks
27 November 2024 at 03:58

Python 3 has optional type hints (and has had them for some time), and some time ago I was a bit tempted to start using some of them; more recently, I wrote a small amount of code using them. Recently I needed to write a little Python program and as I started, I was briefly tempted to try type hints. Then I decided not to, and I suspect that this is how it's going to go in the future.

The practical problem of type hints for me when writing the kind of (small) Python programs that I do today is that they necessarily force me to think about the types involved. Well, that's wrong, or at least incomplete; in practice, they force me to come up with types. When I'm putting together a small program, generally I'm not building any actual data structures, records, or the like (things that have a natural type); instead I'm passing around dictionaries and lists and sets and other basic Python types, and I'm revising how I use them as I write more of the program and evolve it. Adding type hints requires me to navigate assigning concrete types to all of those things, and then updating them if I change my mind as I come to a better understanding of the problem and how I want to approach it.

(In writing this it occurs to me that I do often know that I have distinct types (for example, for what functions return) and I shouldn't mix them, but I don't want to specify their concrete shape as dicts, tuples, or whatever. In looking through the typing documentation and trying some things, it doesn't seem like there's an obvious way to do this. Type aliases are explicitly equivalent to their underlying thing, so I can't create a bunch of different names for eg typing.Any and then expect type checkers to complain if I mix them.)

After the code has stabilized I can probably go back to write type hints (at least until I get into apparently tricky things like JSON), but I'm not sure that this would provide very much value. I may try it with my recent little Python thing just to see how much work it is. One possible source of value is if I come back to this code in six months or a year and want to make changes; typing hints could give me both documentation and guardrails given that I'll have forgotten about a lot of the code and structure by then.

(I think the usual advice is that you should write type hints as you write the program, rather than go back after the fact and try to add them, because incrementally writing them during development is easier. But my new Python programs tend to sufficiently short that doing all of the type hints afterward isn't too much work, and if it gets me to do it at all it may be an improvement.)

PS: It might be easier to do type hints on the fly if I practiced with them, but on the other hand I write new Python programs relatively infrequently these days, making typing hints yet another Python thing I'd have to try to keep in my mind despite it being months since I used them last.

PPS: I think my ideal type hint situation would be if I could create distinct but otherwise unconstrained types for things like function arguments and function returns, have mypy or other typing tools complain when I mixed them, and then later go back to fill in the concrete implementation details of each type hint (eg, 'this is a list where each element is a ...').

What NFS server threads do in the Linux kernel

By: cks
26 November 2024 at 03:40

If we ignore the network stack and take an abstract view, the Linux kernel NFS server needs to do things at various different levels in order to handle NFS client requests. There is NFS specific processing (to deal with things like the NFS protocol and NFS filehandles), general VFS processing (including maintaining general kernel information like dentries), then processing in whatever specific filesystem you're serving, and finally some actual IO if necessary. In the abstract, there are all sorts of ways to split up the responsibility for these various layers of processing. For example, if the Linux kernel supported fully asynchronous VFS operations (which it doesn't), the kernel NFS server could put all of the VFS operations in a queue and let the kernel's asynchronous 'IO' facilities handle them and notify it when a request's VFS operations were done. Even with synchronous VFS operations, you could split the responsibility between some front end threads that handled the NFS specific side of things and a backend pool of worker threads that handled the (synchronous) VFS operations.

(This would allow you to size the two pools differently, since ideally they have different constraints. The NFS processing is more or less CPU bound, and so sized based on how much of the server's CPU capacity you wanted to use for NFS; the VFS layer would ideally be IO bound, and could be sized based on how much simultaneous disk IO it was sensible to have. There is some hand-waving involved here.)

The actual, existing Linux kernel NFS server takes the much simpler approach. The kernel NFS server threads do everything. Each thread takes an incoming NFS client request (or a group of them), does NFS level things like decoding NFS filehandles, and then calls into the VFS to actually do operations. The VFS will call into the filesystem, still in the context of the NFS server thread, and if the filesystem winds up doing IO, the NFS server thread will wait for that IO to complete. When the thread of execution comes back out of the VFS, the NFS thread then does the NFS processing to generate replies and dispatch them to the network.

This unfortunately makes it challenging to answer the question of how many NFS server threads you want to use. The NFS server threads may be CPU bound (if they're handling NFS requests from RAM and the VFS's caches and data structures), or they may be IO bound (as they wait for filesystem IO to be performed, usually for reading and writing files). When you're IO bound, you probably want enough NFS server threads so that you can wait on all of the IO and still have some threads left over to handle the collection of routine NFS requests that can be satisfied from RAM. When you're CPU bound, you don't want any more NFS server threads than you have CPUs, and maybe you want a bit less.

If you're lucky, your workload is consistently and predictably one or the other. If you're not lucky (and we're not), your workload can be either of these at different times or (if we're really out of luck) both at once. Energetic people with NFS servers that have no other real activity can probably write something that automatically tunes the number of NFS threads up and down in response to a combination of the load average, the CPU utilization, and pressure stall information.

(We're probably just going to set it to the number of system CPUs.)

(After yesterday's question I decided I wanted to know for sure what the kernel's NFS server threads were used for, just in case. So I read the kernel code, which did have some useful side effects such as causing me to learn that the various nfsd4_<operation> functions we sometimes use bpftrace on are doing less than I assumed they were.)

The question of how many NFS server threads you should use (on Linux)

By: cks
25 November 2024 at 04:48

Today, not for the first time, I noticed that one of our NFS servers was sitting at a load average of 8 with roughly half of its overall CPU capacity used. People with experience in Linux NFS servers are now confidently predicting that this is a 16-CPU server, which is correct (it has 8 cores and 2 HT threads per core). They're making this prediction because the normal Linux default number of kernel NFS server threads to run is eight.

(Your distribution may have changed this, and if so it's most likely by changing what's in /etc/nfs.conf, which is the normal place to set this. It can be changed on the fly by writing a new value to /proc/fs/nfsd/threads.)

Our NFS server wasn't saturating its NFS server threads because someone on a NFS client was doing a ton of IO. That might actually have slowed the requests down. Instead, there were some number of programs that were constantly making some number of NFS requests that could be satisfied entirely from (server) RAM, which explains why all of the NFS kernel threads were busy using system CPU (mostly on a spinlock, apparently, according to 'perf top'). It's possible that some of these constant requests came from code that was trying to handle hot reloading, since this is one of the sources of constant NFS 'GetAttr' requests, but I believe there's other things going on.

(Since this is the research side of a university department, we have very little visibility into what the graduate students are running on places like our little SLURM cluster.)

If you search around the Internet, you can find all sorts of advice about what to to set the number of NFS server threads to on your Linux NFS server. Many of them involve relatively large numbers (such as this 2024 SuSE advice of 128 threads). Having gone through this recent experience, my current belief is that it depends on what your problem is. In our case, with the NFS server threads all using kernel CPU time and not doing much else, running more threads than we have CPUs seems pointless; all it would do is create unproductive contention for CPU time. If NFS clients are going to totally saturate the fileserver with (CPU-eating) requests even at 16 threads, possibly we should run fewer threads than CPUs, so that user level management operations have some CPU available without contending against the voracious appetite of the kernel NFS server.

(Some advice suggests some number of server NFS kernel threads per NFS client. I suspect this advice is not used in places with tens or hundreds of NFS clients, which is our situation.)

To figure out what your NFS server's problem is, I think you're going to need to look at things like pressure stall information and information on the IO rate and the number of IO requests you're seeing. You can't rely on overall iowait numbers, because Linux iowait is a conservative lower bound. IO pressure stall information is much better for telling you if some NFS threads are blocked on IO even while others are active.

(Unfortunately the kernel NFS threads are not in a cgroup of their own, so you can't get per-cgroup pressure stall information for them. I don't know if you can manually move them into a cgroup, or if systemd would cooperate with this if you tried it.)

PS: In theory it looks like a potentially reasonable idea to run roughly at least as many NFS kernel threads as you have CPUs (maybe a few less so you have some user level CPU left over). However, if you have a lot of CPUs, as you might on modern servers, this might be too many if your NFS server gets flooded with an IO-heavy workload. Our next generation NFS fileserver hardware is dual socket, 12 cores per socket, and 2 threads per core, for a total of 48 CPUs, and I'm not sure we want to run anywhere near than many NFS kernel threads. Although we probably do want to run more than eight.

The general issue of terminal programs and the Alt key

By: cks
23 November 2024 at 23:26

When you're using a terminal program (something that provides a terminal window in a GUI environment, which is now the dominant form of 'terminals'), there's a fairly straightforward answer for what should happen when you hold down the Ctrl key while typing another key. For upper and lower case letters, the terminal program generates ASCII bytes 1 through 26, for Ctrl-[ you get byte 27 (ESC), and there are relatively standard versions of some other characters. For other characters, your specific terminal program may treat them as aliases for some of the ASCII control characters or ignore the Ctrl. All of this behavior is relatively standard from the days of serial terminals, and none of it helps terminal programs decide what should be generated when you hold down the Alt key while typing another key.

(A terminal program can hijack Alt-<key> to control its behavior, but people will generally find this hostile because they want to use Alt-<key> with things running inside the terminal program. In general, terminal programs are restricted to generating things at the character layer, where what they send has to fit in a sequence of bytes and be generally comprehensible to whatever is reading those bytes.)

Historically and even currently there have been three answers. The simplest answer is that Alt sets the 8th bit on what would otherwise be a seven-bit ASCII character. This behavior is basically a relic of the days when things actually were seven bit ASCII (at least in North America) and doing this wouldn't mangle things horribly (provided that the program inside the terminal understood this signal). As a result it's not too popular any more and I think it's basically died out.

The second answer is what I'll call the Emacs answer, which is that Alt plus another key generates ESC (Escape) and then the other key. This matches how Emacs handled its Meta key binding modifier (written 'M-...' in Emacs terminology) in the days of serial terminals; if an Emacs keybinding was M-a, you typed 'ESC a' to invoke it. Even today when we have real Alt keys and some programs could see a real Meta modifier (cf), basically every Emacs or Emacs-compatible system will accept ESC as the Meta prefix even if they're not running in a terminal.

(I started with Emacs sufficiently long ago that ESC-<key> is an ingrained reflex that I still sometimes use even though Alt is right there on my keyboard.)

The third answer is that Alt-<key> generates various accented or special characters in the terminal program's current locale (or in UTF-8, because that's increasingly hard-coded). Once upon a time this was the same as the first answer, because accented and special characters were whatever was found in the upper half of ASCII single-byte characters (bytes 128 to 255). These days, with people using UTF-8, it's generally different; for example, your Alt-a might generate 'Γ‘', but the actual UTF-8 representation of this single Unicode codepoint is actually two bytes, 0xc3 0xa1.

Some terminal programs still allow you to switch between the second and the third answers (Unix xterm is one such program and can even be switched on the fly, see the 'Meta sends Escape' option in the menu you get with Ctrl-<mouse button 1>). Others are hard-coded with the second answer, where Alt-<key> sends ESC <key>. My impression is that the second answer is basically the dominant one these days and only a few terminal programs even potentially support the third option.

PS: How xterm behaves can be host specific due to different default X resources settings on different hosts. Fedora makes xterm default to Alt-<key> sending ESC-<key>, while Ubuntu leaves it with the xterm code default of Alt creating accented characters.

My new solution for quiet monitoring of our Prometheus alerts

By: cks
23 November 2024 at 03:25

Our Prometheus setup delivers all alert messages through email, because we do everything through email (as a first approximation). As we saw yesterday, doing everything through email has problems when your central email server isn't responding; Prometheus raised alerts about the problems but couldn't deliver them via email because the core system necessary to deliver email wasn't doing so. Today, I built myself a little X based system to get around that, using the same approach as my non-interrupting notification of new email.

At a high level, what I now have is an xlbiff based notification of our current Prometheus alerts. If there are no alerts, everything is quiet. If new alerts appear, xlbiff will pop up a text window over in the corner of my screen with a summary of what hosts have what alerts; I can click the window to dismiss it. If the current set of alerts changes, xlbiff will re-display the alerts. I currently have xlbiff set to check the alerts every 45 seconds, and I may lengthen that at some point.

(The current frequent checking is because of what started all of this; if there are problems with our email alert notifications, I want to know about it pretty promptly.)

The work of fetching, checking, and formatting alerts is done by a Python program I wrote. To get the alerts, I directly query our Prometheus server rather than talking to Alertmanager; as a side effect, this lets me see pending alerts as well (although then I have to have the Python program ignore a bunch of pending alerts that are too flaky). I don't try to do the ignoring with clever PromQL queries; instead the Python program gets everything and does the filtering itself.

Pulling the current alerts directly from Prometheus means that I can't readily access the explanatory text we add as annotations (and that then appears in our alert notification emails), but for the purposes of a simple notification that these alerts exist, the name of the alert or other information from the labels is good enough. This isn't intended to give me full details about the alerts, just to let me know what's out there. Most of the time I'll get email about the alert (or alerts) soon anyway, and if not I can directly look at our dashboards and Alertmanager.

To support this sort of thing, xlbiff has the notion of a 'check' program that can print out a number every time it runs, and will get passed the last invocation's number on the command line (or '0' at the start). Using this requires boiling down the state of the current alerts to a single signed 32-bit number. I could have used something like the count of current alerts, but me being me I decided to be more clever. The program takes the start time of every current alert (from the ALERTS_FOR_STATE Prometheus metric), subtracts a starting epoch to make sure we're not going to overflow, and adds them all up to be the state number (which I call a 'checksum' in my code because I started out thinking about more complex tricks like running my output text through CRC32).

(As a minor wrinkle, I add one second to the start time of every firing alert so that when alerts go from pending to firing the state changes and xlbiff will re-display things. I did this because pending and firing alerts are presented differently in the text output.)

To get both the start time and the alert state, we must use the usual trick for pulling in extra labels:

ALERTS_FOR_STATE * ignoring(alertstate) group_left(alertstate) ALERTS

I understand why ALERTS_FOR_STATE doesn't include the alert state, but sometimes it does force you to go out of your way.

PS: If we had alerts going off all of the time, this would be far too obtrusive an approach. Instead, our default state is that there are no alerts happening, so this alert notifier spends most of its time displaying nothing (well, having no visible window, which is even better).

Our Prometheus alerting problem if our central mail server isn't working

By: cks
22 November 2024 at 04:04

Over on the Fediverse, I said something:

Ah yes, the one problem that our Prometheus based alert system can't send us alert email about: when the central mail server explodes. Who rings the bell to tell you that the bell isn't working?

(This is of course an aspect of monitoring your Prometheus setup itself, and also seeing if Alertmanager is truly healthy.)

There is a story here. The short version of the story is that today we wound up with a mail loop that completely swamped our central Exim mail server, briefly running its one minute load average up to a high water mark of 3,132 before a co-worker who'd noticed the problem forcefully power cycled it. Plenty of alerts fired during the incident, but since we do all of our alert notification via email and our central email server wasn't delivering very much email (on account of that load average, among other factors), we didn't receive any.

The first thing to note is that this is a narrow and short term problem for us (which is to say, me and my co-workers). On the short term side, we send and receive enough email that not receiving email for very long during working hours is unusual enough that someone would have noticed before too long, in fact my co-worker noticed the problems even without an alert actively being triggered. On the narrow side, I failed to notice this as it was going on because the system stayed up, it just wasn't responsive. Once the system was rebooting, I noticed almost immediately because I was in the office and some of the windows on my office desktop disappeared.

(In that old version of my desktop I would have noticed the issue right away, because an xload for the machine in question was right in the middle of these things. These days it's way off to the right side, out of my routine view, but I could change that back.)

One obvious approach is some additional delivery channel for alerts about our central mail server. Unfortunately, we're entirely email focused; we don't currently use Slack, Teams, or other online chatting systems, so sending selected alerts to any of them is out as a practical option. We do have work smartphones, so in theory we could send SMS messages; in practice, free email to SMS gateways have basically vanished, so we'd have to pay for something (either for direct SMS access and we'd build some sort of system on top, or for a SaaS provider who would take some sort of notification and arrange to deliver it via SMS).

For myself, I could probably build some sort of script or program that regularly polled our Prometheus server to see if there were any relevant alerts. If there were, the program would signal me somehow, either by changing the appearance of a status window in a relatively unobtrusive way (eg turning it red) or popping up some sort of notification (perhaps I could build something around a creative use of xlbiff to display recent alerts, although this isn't as simple as it looks).

(This particular idea is a bit of a trap, because I could spend a lot of time crafting a little X program that, for example, had a row of boxes that were green, yellow, or red depending on the alert state of various really important things.)

Thinking about how to tame the interaction of conditional GET and caching

By: cks
21 November 2024 at 03:41

Due to how I do caching here, Wandering Thoughts has a long standing weird HTTP behavioral quirk where a non-conditional GET for a syndication feed here can get a different answer than a conditional GET. One (technical) way to explain this issue is that the cache validity interval for non-conditional GETs is longer than the cache validity interval for conditional GETs. In theory this could be the complete explanation of the issue, but in practice there's another part to it, which is that DWiki doesn't automatically insert responses into the cache on a cache miss.

(The cache is normally only filled for responses that were slow to generate, either due to load or because they're expensive. Otherwise I would rather dynamically generate the latest version of something and not clutter up cache space.)

There are various paths that I could take, but which ones I want to take depends on what my goals are and I'm actually not entirely certain about that. If my goal is to serve responses to unconditional GETs that are as fresh as possible but come from cache for as long as possible, what I should probably do is make conditional GETs update the cache when the cached version of the feed exists and would still have been served to an unconditional GET. I've already paid the cost to dynamically generate the feed, so I might as well serve it to unconditional GET requests. However, in my current cache architecture this would have the side effect of causing conditional GETs to get that newly updated cached copy for the conditional GET cache validity period, instead of generating the very latest feed dynamically (what would happen today).

(A sleazy approach would be to backdate the newly updated cache entry by the conditional GET validity interval. My current code architecture doesn't allow for that, so I can avoid the temptation.)

On the other hand, the entire reason I have a different (and longer) cache validity interval for unconditional GET requests is that in some sense I want to punish them. It's a deliberate feature that unconditional GETs receive stale responses, and in some sense the more stale the response the better. Even though updating the cache with a current response I've already generated is in some sense free, doing it cuts against this goal, both in general and in specific. In practice, Wandering Thoughts sees frequent enough conditional GETs for syndication feeds that making conditional GETs refresh the cached feed would effectively collapse the two cache validity intervals into one, which I can already do without any code changes. So if this is my main goal for cache handling of unconditional GETs of my syndication feed, the current state is probably fine and there's nothing to fix.

(A very approximate number is that about 15% of the syndication feed requests to Wandering Thoughts are unconditional GETs. Some of the offenders should definitely know and do better, such as 'Slackbot 1.0'.)

Two API styles of doing special things involving text in UIs

By: cks
20 November 2024 at 04:43

A lot of programs (or applications) that have a 'user interface' mostly don't have a strongly graphical one; instead, they mostly have text, although with special presentation (fonts, colours, underlines, etc) and perhaps controls and meaning attached to interacting with it (including things like buttons that are rendered as text with a border around it). All of these are not just plain text, so programs have to create and manipulate all of them through some API or collection of APIs. Over time, there have sprung up at least two styles of APIs, which I will call external and inline, after how they approach the problem.

The external style API is the older of the two. In the external API, the program makes distinct API calls to do anything other than plain text (well, it makes API calls for plain text, but you have to do something there). If you want to make some text italic or underlined, you have a special API call (or perhaps you modify the context of a 'display this text' API). If you want to attach special actions to things like clicking on a piece of text or hovering the mouse pointer over it, again, more API calls. This leads to programs that make a lot of API calls in their code and are very explicit about what they're doing in their UI. Sometimes this is bundled together with a layout model in the API, where the underlying UI library will flexibly lay out a set of controls so that they accommodate your variously sized and styled text, your buttons, your dividers, and so on.

In the inline style API, you primarily communicate all of this by passing in text that is in some way marked up, instead of plain text that is rendered literally. One form of such inline markup is HTML (and it is popularly styled by CSS). However, there have been other forms, such as XML markup, and even with HTML, you and the UI library will cooperate to attach special meanings and actions to various DOM nodes. Inline style APIs are less efficient at runtime because they have to parse the text you pass in to determine all of this, instead of your program telling the UI library directly through API calls. At the same time, inline style APIs are quite popular at a number of levels. For example, it's popular in UI toolkits to use textual formats to describe your program's UI layout (sometimes this is then compiled into a direct form of UI API calls, and sometimes you hand the textual version to the UI library for it to interpret).

Despite it being potentially less efficient at runtime, my impression is that plenty of programmers prefer the inline style to the external style for text focused applications, where styled text and text based controls are almost all of the UI. My belief is also that an inline style API is probably what's needed for an attractive text focused programming environment.

Ubuntu LTS (server) releases have become fairly similar to each other

By: cks
19 November 2024 at 04:15

Ubuntu 24.04 LTS was released this past April, so one of the things we've been doing since then is building out our install system for 24.04 and then building a number of servers using 24.04, both new servers and servers that used to be build on 20.04 or 22.04. What has been quietly striking about this process is how few changes there have been for us between 20.04, 22.04, and 24.04. Our customization scripts needed only very small changes, and many of the instructions for specific machines could be revised by just searching and replacing either '20.04' or '22.04' with '24.04'.

Some of this lack of changes is illusory, because when I actually look at the differences between our 22.04 and 24.04 postinstall scripting, there are a number of changes, adjustments, and new fixes (and a big change in having to install Python 2 ourselves). Even when we didn't do anything there were decisions to be made, like whether or not we would stick with the Ubuntu 24.04 default of socket activated SSH (our decision so far is to stick with 24.04's default for less divergence from upstream). And there were also some changes to remove obsolete things and restructure how we change things like the system-wide SSH configuration; these aren't forced by the 22.04 to 24.04 change, but building the install setup for a new release is the right time to rethink existing pieces.

However, plenty of this lack of changes is real, and I credit a lot of that to systemd. Systemd has essentially standardized a lot of the init process and in the process, substantially reduced churn in it. For a relevant example, our locally developed systemd units almost never need updating between Ubuntu versions; if it worked in 20.04, it'll still work just as well in 24.04 (including its relationships to various other units). Another chunk of this lack of changes is that the current 20.04+ Ubuntu server installer has maintained a stable configuration file and relatively stable feature set (at least of features that we want to use), resulting in very little needing to be modified in our spin of it as we moved from 20.04 to 22.04 to 24.04. And the experience of going through the server installer has barely changed; if you showed me an installer screen from any of the three releases, I'm not sure I could tell you which it's from.

I generally feel that this is a good thing, at least on servers. A normal Linux server setup and the software that you run on it has broadly reached a place of stability, where there's no particular need to make really visible changes or to break backward compatibility. It's good for us that moving from 20.04 to 22.04 to 24.04 is mostly about getting more recent kernels and more up to date releases of various software packages, and sometimes having bugs fixed so that things like bpftrace work better.

(Whether this is 'welcome maturity' or 'unwelcome statis' is probably somewhat in the eye of the observer. And there are quiet changes afoot behind the scenes, like the change from iptables to nftables.)

(Some) spammers will keep trying old, no longer in DNS IPv6 addresses

By: cks
18 November 2024 at 03:57

As I mentioned the other day, in late September my home ISP changed my IPv6 allocation from a /64 to a different /56, but kept the old /64 still routing to me. I promptly changed all DNS entries that referred to the old IPv6 address to the new IPv6 address. One of the things that my home machine runs is my 'sinkhole' SMTP server, which has a DNS MX entry pointing to it. This server tracks which local IP address was connected to, and it does periodically receive spam and see probes.

Since this server was most recently restarted on November 10th, it's seen about the same volume of connections to each IPv6 address, the old one (which hasn't been present in DNS for more than a month) and the new one (present in DNS). Some of this activity appears to be from Internet scanning efforts, which I will charitably assume are intending to do good and which have arguable reasons to keep scanning any IPv6 address that they've seen respond. Other connections seem less likely to be innocent.

I'm pretty certain I've seen this behavior for IPv4 addresses long ago (I might even have written it up here, although I can't find an entry right now), so in a sense it doesn't surprise me. Some spammers and other systems apparently do DNS lookups only infrequently and save the IP addresses (both IPv4 and apparently IPv6) that they see, then use them for a long time. Still, it's a more modern world, so I'd sort of hoped that any spammer with software that could deal with IPv6 would handle DNS lookups better.

On the one hand, it's not like holding on to the IP addresses of old mail servers is likely to do spammers much good. If the IP address of a mail server changes, it's very likely that the old IP address will stop working before too long. On the other hand, presumably this mostly doesn't hurt because most mail servers don't change IP addresses very often. Usually the IP address you looked up two months ago (or more) is still good.

The missing text focused programming environment

By: cks
17 November 2024 at 03:59

On the Fediverse, I had a hot take:

Hot take: the enduring popularity of writing applications in a list of environments that starts with Emacs Lisp and goes on to encompass things like Electron shows that we've persistently failed to create a good high level programming system for writing text-focused applications.

(Plan 9's Acme had some good ideas but it never caught on, partly because Plan 9 didn't.)

(By 'text focused' here I mean things that want primarily to display text and have some controls and user interface elements; this is somewhat of a superset of 'TUI' ideas.)

People famously have written a variety of what are effectively applications inside GNU Emacs; there are multiple mail readers, the Magit Git client, at least one news reader, at least one syndication feed reader, and so on. Some of this might be explained by the 'I want to do everything in GNU Emacs' crowd writing things to scratch their itch even if the result is merely functional enough, but several of these applications are best in class, such as Magit (among the best Git clients as far as I know) and MH-E (the best NMH based mail reading environment, although there isn't much competition, and a pretty good Unix mail reading environment in general). Many of these applications could in theory be stand alone programs, but instead they've been written in GNU Emacs Lisp to run inside an editor even if they don't have much to do with Emacs in any regular sense.

(In GNU Emacs, many of these applications extensively rebind regular keys to effectively create their own set of keyboard commands that have nothing to do with how regular Emacs behaves. They sometimes still do take advantage of regular Emacs key bindings for things like making selections, jumping to the start and end of displayed text, or searching.)

A similar thing goes on with Electron-based applications, a fair number of which are fairly text-focused things (especially if you extend text focused things to cover emojis, a certain amount of images, and so on). For a prominent example, VSCode is a GUI text editor and IDE, so much of what it deals with is text, although sometimes somewhat fancied up text (with colours, font choices, various line markings, and so on).

On the Internet, you can find a certain amount of people mocking these applications for the heavy-weight things that they use as host environments. It's my hot take that this is an unproductive and backward view. Programmers don't necessarily like using such big, complex host environments and turn to them by preference; instead, that they turn to them shows that we've collectively failed to create better, more attractive alternatives.

It's possible that this use of heavy weight environments is partly because parts of what modern applications want and need to do are intrinsically complex. For example, a lot of text focused applications want to lay out text in somewhat complex, HTML-like ways and also provide the ability to have interactive controls attached to various text elements. Some of them need to handle and render actual HTML. Using an environment like GNU Emacs or Electron gets you a lot of support for this right away (effectively you get a lot of standard libraries to make use of), and that support is itself complex to implement (so the standard libraries are substantial).

However, I also think we're lacking text focused environments for smaller scale programs, the equivalent of shell scripts or BASIC programs. There have been some past efforts toward things that could be used for this, such as Acme and Tcl/Tk, but they didn't catch on for various reasons.

(At this point I think any viable version of this probably needs to be based around HTML and CSS, although hopefully we don't need a full sized browser rendering engine for it, and I certainly hope we can use a different language than JavaScript. Not necessarily because JavaScript is a bad language or reasonably performing JavaScript engines are themselves big, but partly because using JavaScript raises expectations about the API surface, the performance, the features, and so on, all of which push toward a big environment.)

IPv6 networks do apparently get probed (and implications for address assignment)

By: cks
16 November 2024 at 03:30

For reasons beyond the scope of this entry, my home ISP recently changed my IPv6 assignment from a /64 to a (completely different) /56. Also for reasons beyond the scope of this entry, they left my old /64 routing to me along with my new /56, and when I noticed I left my old IPv6 address on my old /64 active, because why not. Of course I changed my DNS immediately, and at this point it's been almost two months since my old /64 appeared in DNS. Today I decided to take a look at network traffic to my old /64, because I knew there was some (which is actually another entry), and to my surprise much more appeared than I expected.

On my old /64, I used ::1/64 and ::2/64 for static IP addresses, of which the first was in DNS, and the other IPv6 addresses in it were the usual SLAAC assignments. The first thing I discovered in my tcpdump was a surprisingly large number of cloud-based IPv6 addresses that were pinging my ::1 address. Once I excluded that traffic, I was left with enough volume of port probes that I could easily see them in a casual tcpdump.

The somewhat interesting thing is that these IPv6 port probes were happening at all. Apparently there is enough out there on IPv6 that it's worth scraping IPv6 addresses from DNS and then probing potentially vulnerable ports on them to see if something responds. However, as I kept watching I discovered something else, which is that a significant number of these probes were not to my ::1 address (or to ::2). Instead they were directed to various (very) low-number addresses on my /64. Some went to the ::0 address, but I saw ones to ::3, ::5, ::7, ::a, ::b, ::c, ::f, ::15, and a (small) number of others. Sometimes a sequence of source addresses in the same /64 would probe the same port on a sequence of these addresses in my /64.

(Some of this activity is coming from things with DNS, such as various shadowserver.org hosts.)

As usual, I assume that people out there on the IPv6 Internet are doing this sort of scanning of low-numbered /64 IPv6 addresses because it works. Some number of people put additional machines on such low-numbered addresses and you can discover or probe them this way even if you can't find them in DNS.

One of the things that I take away from this is that I may not want to put servers on these low IPv6 addresses in the future. Certainly one should have firewalls and so on, even on IPv6, but even then you may want to be a little less obvious and easily found. Or at the least, only use these IPv6 addresses for things you're going to put in DNS anyway and don't mind being randomly probed.

PS: This may not be news to anyone who's actually been using IPv6 and paying attention to their traffic. I'm late to this particular party for various reasons.

Your options for displaying status over time in Grafana 11

By: cks
15 November 2024 at 03:41

A couple of years ago I wrote about your options for displaying status over time in Grafana 9, which discussed the problem of visualizing things how many (firing) Prometheus alerts there are of each type over time. Since then, some things have changed in the Grafana ecosystem, and especially some answers have recently become clearer to me (due to an old issue report), so I have some updates to that entry.

The generally best panel type you want to use for this is a state timeline panel, with 'merge equal consecutive values' turned on. State timelines are no longer 'beta' in Grafana 11 and they work for this, and I believe they're Grafana's more or less officially recommended solution for this problem. By default a state timeline panel will show all labels, but you can enable pagination. The good news (in some sense) is that Grafana is aware that people want a replacement for the old third party Discrete panel (1, 2, 3) and may at some point do more to move toward this.

You can also use bar graphs and line graphs, as mentioned back then, which continue to have the virtue that you can selectively turn on and off displaying the timelines of some alerts. Both bar graphs and line graphs continue to have their issues for this, although I think they're now different issues than they had in Grafana 9. In particular I think (stacked) line graphs are now clearly less usable and harder to read than stacked bar graphs, which is a pity because they used to work decently well apart from a few issues.

(I've been impressed, not in a good way, at how many different ways Grafana has found to make their new time series panel worse than the old graph panel in a succession of Grafana releases. All I can assume is that everyone using modern Grafana uses time series panels very differently than we do.)

As I found out, you don't want to use the status history panel for this. The status history panel isn't intended for this usage; it has limits on the number of results it can represent and it lacks the 'merge equal consecutive values' option. More broadly, Grafana is apparently moving toward merging all of the function of this panel into the Heatmap panel (also). If you do use the status history panel for anything, you want to set a general query limit on the number of results returned, and this limit is probably best set low (although how many points the panel will accept depends on its size in the browser, so life is fun here).

Since the status history panel is basically a variant of heatmaps, you don't really want to use heatmaps either. Using Heatmaps to visualize state over time in Grafana 11 continue to have the issues that I noted in Grafana 9, although some of them may be eliminated at some point in the future as the status history panel is moved further out. Today, if for some reason you have to choose between Heatmaps and Status History for this, I think you should use Status History with a query limit.

If we ever have to upgrade from our frozen Grafana version, I would expect to keep our line graph alert visualizations and replace our Discrete panel usage with State Timeline panels with pagination turned on.

Implementing some Git aliases indirectly, in shell scripts

By: cks
14 November 2024 at 04:10

Recently I wrote about two ways to (maybe) skip 'Dependabot' commits when using git log, and said at the end that I was probably going to set up Git aliases for both approaches. I've now both done that and failed to do that, at the same time. While I have Git aliases for both approaches, the actual git aliases just shell out to shell scripts.

The simpler and more frustrating case is for only seeing authors that aren't Dependabot:

git log --perl-regexp --author='^((?!dependabot\[bot]).*)$'

This looks like it should be straightforward as an alias, but I was unable to get the alias quoting right in my .gitconfig. No matter what I did it either produced syntax errors from Git or didn't work. So I punted by putting the 'git log ...' bit in a shell script (where I can definitely understand the quoting requirements and get them right) and making the actual alias be in the magic git-config format that runs an external program:

[alias]
  ....
  ndlog = !gitndeplog

The reason this case works as a simple alias is that all of the arguments I'd supply (such as a commit range) come after the initial arguments to 'git log'. This isn't the case for the second approach, with attempts to exclude go.mod and go.sum from file paths:

git log -- ':!:go.mod' ':!:go.sum'

The moment I started thinking about how to use this alias, I realized that I'd sometimes want to supply a range of commits (for example, because I just did a 'git pull' and want to see what the newly pulled commits changed). This range has to go in the middle of the command line, which means that a Git alias doesn't really work. And sometimes I might want to supply additional 'git log' switches, like '-p', or maybe supply a file or path (okay, probably I'll never do that). There are probably some sophisticated ways to make this work as an alias, especially if I assume that all of the arguments I supply will go before the '--', but the simple approach was to write a shell script that did the argument handling and invoke it via an alias in the same way as 'git ndlog' does.

Right now the scripts are named in a terse way as if I might want to run them by hand someday, but I should probably rename them both to 'git-<something>'. In practice I'm probably always going to run them as 'git ...', and a git-<something> name makes it clearer what's going on, and easier to find by command completion in my shell if I forget.

Finding a good use for keep_firing_for in our Prometheus alerts

By: cks
13 November 2024 at 04:06

A while back (in 2.42.0), Prometheus introduced a feature to artificially keep alerts firing for some amount of time after their alert condition had cleared; this is 'keep_firing_for'. At the time, I said that I didn't really see a use for it for us, but I now have to change that. Not only do we have a use for it, it's one that deals with a small problem in our large scale alerts.

Our 'there is something big going on' alerts exist only to inhibit our regular alerts. They trigger when there seems to be 'too much' wrong, ideally fast enough that their inhibition effect stops the normal alerts from going out. Because normal alerts from big issues being resolved don't necessarily clean out immediately, we want our large scale alerts to linger on for some time after the amount of problems we have drop below their trigger point. Among other things, this avoids a gotcha with inhibitions and resolved alerts. Because we created these alerts before v2.42.0, we implemented the effect of lingering on by using max_over_time() on the alert conditions (this was the old way of giving an alert a minimum duration).

The subtle problem with using max_over_time() this way is that it means you can't usefully use a 'for:' condition to de-bounce your large scale alert trigger conditions. For example, if one of the conditions is 'there are too many ICMP ping probe failures', you'd potentially like to only declare a large scale issue if this persisted for more than one round of pings; otherwise a relatively brief blip of a switch could trigger your large scale alert. But because you're using max_over_time(), no short 'for:' will help; once you briefly hit the trigger number, it's effectively latched for our large scale alert lingering time.

Switching to extending the large scale alert directly with 'keep_firing_for' fixes this issue, and also simplifies the alert rule expression. Once we're no longer using max_over_time(), we can set 'for: 1m' or another useful short number to de-bounce our large scale alert trigger conditions.

(The drawback is that now we have a single de-bounce interval for all of the alert conditions, whereas before we could possibly have a more complex and nuanced set of conditions. For us, this isn't a big deal.)

I suspect that this may be generic to most uses of max_over_time() in alert rule expressions (fortunately, this was our only use of it). Possibly there are reasonable uses for it in sub-expressions, clever hacks, and maybe also using times and durations (eg, also, also).

Prometheus makes it annoyingly difficult to add more information to alerts

By: cks
12 November 2024 at 03:58

Suppose, not so hypothetically, that you have a special Prometheus meta-alert about large scale issues, that exists to avoid drowning you in alerts about individual hosts or whatever when you have a large scale issue. As part of that alert's notification message, you'd like to include some additional information about things like why you triggered the alert, how many down things you detected, and so on.

While Alertmanager creates the actual notification messages by expanding (Go) templates, it doesn't have direct access to Prometheus or any other source of external information, for relatively straightforward reasons. Instead, you need to pass any additional information from Prometheus to Alertmanager in the form (generally) of alert annotations. Alert annotations (and alert labels) also go through template expansion, and in the templates for alert annotations, you can directly make Prometheus queries with the query function. So on the surface this looks relatively simple, although you're going to want to look carefully at YAML string quoting.

I did some brief experimentation with this today, and it was enough to convince me that there are some issues with doing this in practice. The first issue is that of quoting. Realistic PromQL queries often use " quotes because they involve label values, and the query you're doing has to be a (Go) template string, which probably means using Go raw quotes unless you're unlucky enough to need ` characters, and then there's YAML string quoting. At a minimum this is likely to be verbose.

A somewhat bigger problem is that straightforward use of Prometheus template expansion (using a simple pipeline) is generally going to complain in the error log if your query provides no results. If you're doing the query to generate a value, there are some standard PromQL hacks to get around this. If you want to find a label, I think you need to use a more complex template with operation; on the positive side, this may let you format a message fragment with multiple labels and even the value.

More broadly, if you want to pass multiple pieces of information from a single query into Alertmanager (for example, the query value and some labels), you have a collection of less than ideal approaches. If you create multiple annotations, one for each piece of information, you give your Alertmanager templates the maximum freedom but you have to repeat the query and its handling several times. If you create a text fragment with all of the information that Alertmanager will merely insert somewhere, you basically split writing your alerts between Alertmanager and Prometheus alert rules, And if you encode multiple pieces of information into a single annotation with some scheme, you can use one query in Prometheus and not lock yourself into how the Alertmanager template will use the information, but your Alertmanager template will have to parse that information out again with Go template functions.

What all of this is a symptom of is that there's no particularly good way to pass structured information between Prometheus and Alertmanager. Prometheus has structured information (in the form of query results) and your Alertmanager template would like to use it, but today you have to smuggle that through unstructured text. It would be nice if there was a better way.

(Prometheus doesn't quite pass through structured information from a single query, the alert rule query, but it does make all of the labels and annotations available to Alertmanager. You could imagine a version where this could be done recursively, so some annotations could themselves have labels and etc.)

Syndication feed fetchers and their behavior on HTTP 429 status responses

By: cks
11 November 2024 at 04:09

For reasons outside of the scope of this entry, recently I've been looking at the behavior of syndication feed fetchers here on Wandering Thoughts (which are generally from syndication feed readers), and in the process I discovered some that were making repeated requests at a quite aggressive rate, such as every five minutes. Until recently there was some excuse for this, because I wasn't setting a 'Cache-Control: max-age=...' header (also), which is (theoretically) used to tell Atom feed fetchers how soon they should re-fetch. I feel there was not much of an excuse because no feed reader should default to fetching every five minutes, or even every fifteen, but after I set my max-age to an hour there definitely should be no excuse.

Since sometimes I get irritated with people like this, I arranged to start replying to such aggressive feed featchers with a HTTP 429 "Too Many Requests" status response (the actual implementation is a hack because my entire software is more or less stateless, which makes true rate limiting hard). What I was hoping for is that most syndication feed fetching software would take this as a signal to slow down how often it tried to fetch the feed, and I'd see excessive sources move from one attempt every five minutes to (much) slower rates.

That basically didn't happen (perhaps this is no surprise). I'm sure there's good syndication feed fetching software that probably would behave that way on HTTP 429 responses, but whatever syndication feed software was poking me did not react that way. As far as I can tell from casually monitoring web access logs, almost no mis-behaving feed software paid any attention to the fact that it was specifically getting a response that normally means "you're doing this too fast". In some cases, it seems to have caused programs to try to fetch even more than before.

(Perhaps some of this is because I didn't add a 'Retry-After' header to my HTTP 429 responses until just now, but even without that, I'd expect clients to back off on their own, especially after they keep getting 429s when they retry.)

Given the HTTP User-Agents presented by feed fetchers, some of this is more or less expected, for two reasons. First, some of the User-Agents are almost certainly deliberate lies, and if a feed crawler is going to actively lie about what it is there's no reason for it to respect HTTP 429s either. Second, some of the feed fetching is being done by stateless programs like curl, where the people building ad-hoc feed fetching systems around them would have to go (well) out of their way to do the right thing. However, a bunch of the aggressive feed fetching is being done by either real feed fetching software with a real user-agent (such as "RSS Bot" or the Universal Feed Parser) or by what look like browser addons running in basically current versions of Firefox. I'd expect both of these to respect HTTP 429s if they're programmed decently. But then, if they were programmed decently they probably wouldn't be trying every five minutes in the first place.

(Hopefully the ongoing feed reader behavior project by rachelbythebay will fix some of this in the long run; there are encouraging signs, as covered in eg the October 25th score report.)

A rough guess at how much IPv6 address space we might need

By: cks
10 November 2024 at 03:54

One of the reactions I saw to my entry on why NAT might be inevitable (at least for us) even with IPv6 was to ask if there really was a problem with being generous with IPv6 allocations, since they are (nominally) so large. Today I want to do some rough calculations on this, working backward from what we might reasonably assign to end user devices. There's a lot of hand-waving and assumptions here, and you can question a lot of them.

I'll start with the assumption that the minimum acceptable network size is a /64, for various reasons including SLAAC. As discussed, end devices presenting themselves on our network may need some number of /64s for internal use. Let's assume that we'll allocate sixteen /64s to each device, meaning that we give out /60s to each device on each of our subnets.

I think it's unlikely we'll want to ever have a subnet with more than 2048 devices on it (and even that's generous). That many /60s is a /49. However, some internal groups have more than one IPv4 subnet today, so for future expansion let's say that each group gets eight IPv6 subnets, so we give out /46s to research groups (or we could trim some of these sizes and give out /48s, which seems to be a semi-standard allocation size that various software may be more happy with).

We have a number of IPv4 subnets (and of research groups). If we want to allow for growth, various internal uses, and so on, we want some extra room, so I think we'd want space for at least 128 of these /46 allocations, which gets us to an overall allocation for our department of a /39 (a /38 if we want 256 just to be sure). The University of Toronto currently has a /32, so we actually have some allocation problems. For a start, the university has three campuses and it might reasonably want to split its /32 allocation into four and give one /34 to each campus. At a /34 for the campus, there's only 32 /39s and the university has many more departments and groups than that.

If the university starts with a /32, splits it to /34s for campuses, and wants to have room for 1024 or 2048 allocations within a campus, each department or group can get only a /44 or a /45 and all of our sizes would have to shrink accordingly; we'd need to drop at least five or six bits somewhere (say four subnets per group, eight or even four /64s per device, maybe 1024 devices maximum per subnet, etc).

If my understanding of how you're supposed to do IPv6 is correct, what makes all of this more painful in a purist IPv6 model is that you're not supposed to allocate multiple, completely separate IPv6 subnets to someone, unlike in the IPv4 world. Instead, everything is supposed to live under one IPv6 prefix. This means that the IPv6 prefix absolutely has to have enough room for future growth, because otherwise you have to go through a very painful renumbering to move to another prefix.

(For instance, today the department has multiple IPv4 /24s allocated to it, not all of them contiguous. We also work this way with our internal use of RFC 1918 address space, where we just allocate /16s as we need them.)

Being able to allocate multiple subnets of some size (possibly a not that large one) to departments and groups would make it easier to not over-allocate to deal with future growth. We might still have problems with the 'give every device eight /64s' plan, though.

(Of course we could do this multiple subnets allocation internally even if the university gives us only a single IPv6 prefix. Probably everything can deal with IPv6 used this way, and it would certainly reduce the number of bits we need to consume.)

Maybe skipping 'Dependabot' commits when using 'git log'

By: cks
9 November 2024 at 04:15

I follow a number of projects written in Go that are hosted on Github. Many of these projects enable Github's "Dependabot" feature (also). This use of Dependabot, coupled with the overall Go ecology's habit of relatively frequent small updates to packages, creates a constant stream of Dependabot commits that update the project's go.mod and go.sum files with small version updates of some dependency, sometimes intermixed with people merging those commits (for example, the Cloudflare eBPF Prometheus exporter).

As someone who reads the commit logs of these repositories to stay on top of significant changes, these Dependabot dependency version bumps are uninteresting to me and, like any noise, they make it harder to see what I'm interested in (and more likely that I'll accidentally miss a commit I want to read about that's stuck between two Dependabot updates I'm skipping with my eyes glazed over). What I'd like to be able to do is to exclude these commits from what 'git log' or some equivalent is showing me.

There are two broad approaches. The straightforward and more or less workable approach is to exclude commits from specific authors, as covered in this Stack Overflow question and answer:

git log --perl-regexp --author='^((?!dependabot\[bot]).*)$'

However, this doesn't exclude the commits of people merging these Dependabot commits into the repository, which happens in some (but not all) of the repositories I track. A better approach would be to get 'git log' to ignore all commits that don't change anything other than go.mod and go.sum. I don't think Git can quite do this, at least not without side effects, but we can get close with some pathspecs:

git log -- ':!:go.mod' ':!:go.sum'

(I think this might want to be '!/' for full correctness instead of just '!'.)

For using plain 'git log', this is okay, but it has the side effect that if you use, eg, 'git log -p' to see the changes, any changes a listed commit makes to go.mod or go.sum will be excluded.

The approach of excluding paths can be broadened beyond go.mod and go.sum to include things like commits that update various administrative files, such as things that control various automated continuous integration actions. In repositories with a lot of churn and updates to these, this could be useful; I care even less about a project's use of CI infrastructure than I care about their Dependabot go.mod and go.sum updates.

(I suspect I'll set up Git aliases for both approaches, since they each have their own virtues.)

Complications in supporting 'append to a file' in a NFS server

By: cks
8 November 2024 at 04:14

In the comments of my entry on the general problem of losing network based locks, an interesting side discussion has happened between commentator abel and me over NFS servers (not) supporting the Unix O_APPEND feature. The more I think about it, the more I think it's non-trivial to support well in an NFS server and that there are some subtle complications (and probably more than I haven't realized). I'm mostly going to restrict this to something like NFS v3, which is what I'm familiar with.

The basic Unix semantics of O_APPEND are that when you perform a write(), all of your data is immediately and atomically put at the current end of the file, and the file's size and maximum offset are immediately extended to the end of your data. If you and I do a single append write() of 128 Mbytes to the same file at the same time, either all of my 128 Mbytes winds up before yours or vice versa; your and my data will never wind up intermingled.

This basic semantics is already a problem for NFS because NFS (v3) connections have a maximum size for single NFS 'write' operations and that size may be (much) smaller than the user level write(). Without a multi-operation transaction of some sort, we can't reliably perform append write()s of more data than will fit in a NFS write operation; either we fail those 128 Mbyte writes, or we have the possibility that data from you and I will be intermingled in the file.

In NFS v2, all writes were synchronous (or were supposed to be, servers sometimes lied about this). NFS v3 introduced the idea of asynchronous, buffered writes that were later committed by clients. NFS servers are normally permitted to discard asynchronous writes that haven't yet been committed by the client; when the client tries to commit them later, the NFS server rejects the commit and the client resends the data. This works fine when the client's request has a definite position in the file, but it has issues if the client's request is a position-less append write. If two clients do append writes to the same file, first A and then B after it, the server discards both, and then client B is the first one to go through the 'COMMIT, fail, resend' process, where does its data wind up? It's not hard to wind up with situations where a third client that's repeatedly reading the file will see inconsistent results, where first it sees A's data then B's and then later either it sees B's data before A's or B's data without anything from A (not even a zero-filled gap in the file, the way you'd get with ordinary writes).

(While we can say that NFS servers shouldn't ever deliberately discard append writes, one of the ways that this happens is that the server crashes and reboots.)

You can get even more fun ordering issues created by retrying lost writes if there is another NFS client involved that is doing manual append writes by finding out the current end of file and writing at it. If A and B do append writes, C does a manual append write, all writes are lost before they're committed, B redoes, C redoes, and then A redoes, a natural implementation could easily wind up with B's data, an A data sized hole, C's data, and then A's data appended after C's.

This also creates server side ordering dependencies for potentially discarding uncommitted asynchronous write data, ones that a NFS server can normally make independently. If A appended a lot of data and then B appended a little bit, you probably don't want to discard A's data but not B's, because there's no guarantee that A will later show up to fail a COMMIT and resend it (A could have crashed, for example). And if B requests a COMMIT, you probably want to commit A's data as well, even if there's much more of it.

One way around this would be to adopt a more complex model of append writes over NFS, where instead of the client requesting an append write, it requests 'write this here but fail if this is not the current end of file'. This would give all NFS writes a definite position in the file at the cost of forcing client retries on the initial request (if the client later has to repeat the write because of a failed commit, it must carefully strip this flag off). Unfortunately a file being appended to from multiple clients at a high rate would probably result in a lot of client retries, with no guarantee that a given client would ever actually succeed.

(You could require all append writes to be synchronous, but then this would do terrible things to NFS server performance for potentially common use of append writes, like appending log lines to a shared log file from multiple machines. And people absolutely would write and operate programs like that if append writes over NFS were theoretically reliable.)

Losing NFS locks and the SunOS SIGLOST signal

By: cks
7 November 2024 at 02:48

NFS is a network filesystem that famously also has a network locking protocol associated with it (or part of it, for NFSv4). This means that NFS has to consider the issue of the NFS client losing a lock that it thinks it holds. In NFS, clients losing locks normally happens as part of NFS(v3) lock recovery, triggered when a NFS server reboots. On server reboot, clients are told to re-acquire all of their locks, and this re-acquisition can explicitly fail (as well as going wrong in various ways that are one way to get stuck NFS locks). When a NFS client's kernel attempts to reclaim a lock and this attempt fails, it has a problem. Some process on the local machine thinks that it holds a (NFS) lock, but as far as the NFS server and other NFS clients are concerned, it doesn't.

Sun's original version of NFS dealt with this problem with a special signal, SIGLOST. When the NFS client's kernel detected that a NFS lock had been lost, it sent SIGLOST to whatever process held the lock. SIGLOST was a regular signal, so by default the process would exit abruptly; a process that wanted to do something special could register a signal handler for SIGLOST and then do whatever it could. SIGLOST appeared no later than SunOS 3.4 (cf) and still lives on today in Illumos, where you can find this discussed in uts/common/klm/nlm_client.c and uts/common/fs/nfs/nfs4_recovery.c (and it's also mentioned in fcntl(2)). The popularity of actually handling SIGLOST may be indicated by the fact that no program in the Illumos source tree seems to set a signal handler for it.

Other versions of Unix mainly ignore the situation. The Linux kernel has a specific comment about this in fs/lockd/clntproc.c, which very briefly talks about the issue and picks ignoring it (apart from logging the kernel message "lockd: failed to reclaim lock for ..."). As far as I can tell from reading FreeBSD's sys/nlm/nlm_advlock.c, FreeBSD silently ignores any problems when it goes through the NFS client process of reclaiming locks.

(As far as I can see, NetBSD and OpenBSD don't support NFS locks on clients at all, rendering the issue moot. I don't know if POSIX locks fail on NFS mounted filesystems or if they work but create purely local locks on that particular NFS client, although I think it's the latter.)

On the surface this seems rather bad, and certainly worse than the Sun approach of SIGLOST. However, I'm not sure that SIGLOST is all that great either, because it has some problems. First, what you can do in a signal handler is very constrained; basically all that a SIGLOST handler can do is set a variable and hope that the rest of the code will check it before it does anything dangerous. Second, programs may hold multiple (NFS) locks and SIGLOST doesn't tell you which lock you lost; as far as I know, there's no way of telling. If your program gets a SIGLOST, all you can do is assume that you lost all of your locks. Third, file locking may quite reasonably be used inside libraries in a way that is hidden from callers by the library's API, but signals and handling signals is global to the entire program. If taking a file lock inside a library exposes the entire program to SIGLOST, you have a collection of problems (which ones depend on whether the program has its own file locks and whether or not it has installed a SIGLOST handler).

This collection of problems may go part of the way to explain why no Illumos programs actually set a SIGLOST handler and why other Unixes simply ignore the issue. A kernel that uses SIGLOST essentially means 'your program dies if it loses a lock', and it's not clear that this is better than 'your program optimistically continues', especially in an environment where a NFS client losing a NFS lock is rare (and letting the program continue is certainly simpler for the kernel).

The general problem of losing network based locks

By: cks
6 November 2024 at 03:38

There are many situations and protocols where you want to hold some sort of lock across a network between, generically, a client (who 'owns' the lock) and a server (who manages the locks on behalf of clients and maintains the locking rules). Because a network is involved, one of the broad problems that can happen in such a protocol is that the client can have a lock abruptly taken away from it by the server. This can happen because the server was instructed to break the lock, or the server restarted in some way and notified the clients that they had lost some or all of their locks, or perhaps there was a network partition that led to a lock timeout.

When the locking protocol and the overall environment is specifically designed with this in mind, you can try to require clients to specifically think about the possibility. For example, you can have an API that requires clients to register a callback for 'you lost a lock', or you can have specific error returns to signal this situation, or at the very least you can have a 'is this lock still valid' operation (or 'I'm doing this operation on something that I think I hold a lock for, give me an error if I'm wrong'). People writing clients can still ignore the possibility, just as they can ignore the possibility of other network errors, but at least you tried.

However, network locking is sometimes added to things that weren't originally designed for it. One example is (network) filesystems. The basic 'filesystem API' doesn't really contemplate locking and especially it doesn't consider that you can suddenly have access to a 'file' taken away from you in mid-flight. If you add network locking you don't have a natural answer to handling losing locks and there's no obvious point in the API to add it, especially if you want to pretend that your network filesystem is the same as a local filesystem. This makes it much easier for people writing programs to not even think about the possibility of losing a network lock during operation.

(If you're designing a purely networked filesystem-like API, you have more freedom; for example, you can make locking operations turn a regular 'file descriptor' into a special 'locked file descriptor' that you have to do subsequent IO through and that will generate errors if the lock is lost.)

One of the meta-problems with handling losing a network lock is that there's no single answer for what you should do about it. In some programs, you've violated an invariant and the only safe move for the program is to exit or crash. In some programs, you can pause operations until you can re-acquire the lock. In other programs you need to bail out to some sort of emergency handler that persists things in another way or logs what should have been done if you still held the lock. And when designing your API (or APIs) for losing locks, how likely you think each option is will influence what features you offer (and it will also influence how interested programs are in handling losing locks).

PS: A contributing factor to programmers and programs not being interested in handling losing network locks is that they're generally somewhere between uncommon and rare. If lots of people are writing code to deal with your protocol and losing locks are uncommon enough, some amount of those people will just ignore the possibility, just like some amount of programmers ignore the possibility of IO errors.

A rough equivalent to "return to last power state" for libvirt virtual machines

By: cks
5 November 2024 at 04:13

Physical machines can generally be set in their BIOS so that if power is lost and then comes back, the machine returns to its previous state (either powered on or powered off). The actual mechanics of this are complicated (also), but the idealized version is easily understood and convenient. These days I have a revolving collection of libvirt based virtual machines running on a virtualization host that I periodically reboot due to things like kernel updates, and for a while I have quietly wished for some sort of similar libvirt setting for its virtual machines.

It turns out that this setting exists, sort of, in the form of the libvirt-guests systemd service. If enabled, it can be set to restart all guests that were running when the system was shut down, regardless of whether or not they're set to auto-start on boot (none of my VMs are). This is a global setting that applies to all virtual machines that were running at the time the system went down, not one that can be applied to only some VMs, but for my purposes this is sufficient; it makes it less of a hassle to reboot the virtual machine host.

Linux being Linux, life is not quite this simple in practice, as is illustrated by comparing my Ubuntu VM host machine with my Fedora desktops. On Ubuntu, libvirt-guests.service defaults to enabled, it is configured through /etc/default/libvirt-guests (the Debian standard), and it defaults to not not automatically restarting virtual machines. On my Fedora desktops, libvirt-guests.service is not enabled by default, it is configured through /etc/sysconfig/libvirt-guests (as in the official documentation), and it defaults to automatically restarting virtual machines. Another difference is that Ubuntu has a /etc/default/libvirt-guests that has commented out default values, while Fedora has no /etc/sysconfig/libvirt-guests so you have to read the script to see what the defaults are (on Fedora, this is /usr/libexec/libvirt-guests.sh, on Ubuntu /usr/lib/libvirt/libvirt-guests.sh).

I've changed my Ubuntu VM host machine so that it will automatically restart previously running virtual machines on reboot, because generally I leave things running intentionally there. I haven't touched my Fedora machines so far because by and large I don't have any regularly running VMs, so if a VM is still running when I go to reboot the machine, it's most likely because I forgot I had it up and hadn't gotten around to shutting it off.

(My pre-libvirt virtualization software was much too heavy-weight for me to leave a VM running without noticing, but libvirt VMs have a sufficiently low impact on my desktop experience that I can and have left them running without realizing it.)

The history of Unix's ioctl and signal about window sizes

By: cks
4 November 2024 at 03:38

One of the somewhat obscure features of Unix is that the kernel has a specific interface to get (and set) the 'window size' of your terminal, and can also send a Unix signal to your process when that size changes. The official POSIX interface for the former is tcgetwinsize(), but in practice actual Unixes have a standard tty ioctl for this, TIOCGWINSZ (see eg Linux ioctl_tty(2) (also) or FreeBSD tty(4)). The signal is officially standardized by POSIX as SIGWINCH, which is the name it always has had. Due to a Fediverse conversation, I looked into the history of this today, and it turns out to be more interesting than I expected.

(The inclusion of these interfaces in POSIX turns out to be fairly recent.)

As far as I can tell, 4.2 BSD did not have either TIOCGWINSZ or SIGWINCH (based on its sigvec(2) and tty(4) manual pages). Both of these appear in the main BSD line in 4.3 BSD, where sigvec(2) has added SIGWINCH (as the first new signal along with some others) and tty(4) has TIOCGWINSZ. This timing makes a certain amount of sense in Unix history. At the time of 4.2 BSD's development and release, people were connecting to Unix systems using serial terminals, which had more or less fixed sizes that were covered by termcap's basic size information. By the time of 4.3 BSD in 1986, Unix workstations existed and with them, terminal windows that could have their size changed on the fly; a way of finding out (and changing) this size was an obvious need, along with a way for full-screen programs like vi to get notified if their terminal window was resized on the fly.

However, as far as I can tell 4.3 BSD itself did not originate SIGWINCH, although it may be the source of TIOCGWINSZ. The FreeBSD project has manual pages for a variety of Unixes, including 'Sun OS 0.4', which seems to be an extremely early release from early 1983. This release has a signal(2) with a SIGWINCH signal (using signal number 28, which is what 4.3 BSD will use for it), but no (documented) TIOCGWINSZ. However, it does have some programs that generate custom $TERMCAP values with the right current window sizes.

The Internet Archives has a variety of historical material from Sun Microsystems, including (some) documentation for both SunOS 2.0 and SunOS 3.0. This documentation makes it clear that the primary purpose of SIGWINCH was to tell graphical programs that their window (or one of them) had been changed, and they should repaint the window or otherwise refresh the contents (a program with multiple windows didn't get any indication of which window was damaged; the programming advice is to repaint them all). The SunOS 2.0 tgetent() termcap function will specifically update what it gives you with the current size of your window, but as far as I can tell there's no other documented support of getting window sizes; it's not mentioned in tty(4) or pty(4). Similar wording appears in the SunOS 3.0 Unix Interface Reference Manual.

(There are PDFs of some SunOS documentation online (eg), and up through SunOS 3.5 I can't find any mention of directly getting the 'window size'. In SunOS 4.0, we finally get a TIOCGWINSZ, documented in termio(4). However, I have access to SunOS 3.5 source, and it does have a TIOCGWINSZ ioctl, although that ioctl isn't documented. It's entirely likely that TIOCGWINSZ was added (well) before SunOS 3.5.)

According to this Git version of the original BSD development history, BSD itself added both SIGWINCH and TIOCGWINSZ at the end of 1984. The early SunOS had SIGWINCH and it may well have had TIOCGWINSZ as well, so it's possible that BSD got both from SunOS. It's also possible that early SunOS had a different (terminal) window size mechanism than TIOCGWINSZ, one more specific to their window system, and the UCB CSRG decided to create a more general mechanism that Sun then copied back by the time of SunOS 3.5 (possibly before the official release of 4.3 BSD, since I suspect everyone in the BSD world was talking to each other at that time).

PS: SunOS also appears to be the source of the mysteriously missing signal 29 in 4.3 BSD (mentioned in my entry on how old various Unix signals are). As described in the SunOS 3.4 sigvec() manual page, signal 29 is 'SIGLOST', "resource lost (see lockd(8C))". This appears to have been added at some point between the initial SunOS 3.0 release and SunOS 3.4, but I don't know exactly when.

I feel that NAT is inevitable even with IPv6

By: cks
3 November 2024 at 02:23

Over on the Fediverse, I said something unpopular about IPv6 and NAT:

Hot take: NAT is good even in IPv6, because otherwise you get into recursive routing and allocation problems that have been made quite thorny by the insistence of so many things that a /64 is the smallest block they will work with (SLAAC, I'm looking at you).

Consider someone's laptop running multiple VMs and/or containers on multiple virtual subnets, maybe playing around with (virtual) IPv6 routers too.

(Partly in re <other Fediverse post>.)

The basic problem is straightforward. Imagine that you're running a general use wired or wireless network, where people connect their devices. One day, someone shows up with a (beefy) laptop that they've got some virtual machines (or container images) with a local (IPv6) network that is 'inside' their laptop. What IPv6 network addresses do these virtual machines get when the laptop is connected to your network and how do you make this work?

In a world where IPv6 devices and software reliably worked on subnet sizes smaller than a /64, this would be sort of straightforward. Your overall subnet might be a /64, and you would give each device connecting to it a /96 via some form of prefix delegation. This would allow a large number of devices on your network and also for each device to sub-divide its own /96 for local needs, with lots of room for multiple internal subnets for virtual machines, containers, or whatever else.

(And if a device didn't signal a need for a prefix delegation, you could give it a single IPv6 address from the /64, which would probably be the common case.)

In a world where lots of things insist on being on an IPv6 /64, this is extremely not trivial. Hosts will show up that want zero, one, or several /64s delegated to them, and both you and they may need those multiple /64s to fit into the same larger allocation of a /63, a /62, or so on. Worse, if more hosts than you expected show up asking for more delegations than you budgeted for, you'll need to expand the overall allocation to the entire network and everything under it, which at a minimum may be disruptive. Also, the IPv6 address space is large, but if you chop off half of it it's not that large, especially when you need to consume large blocks of it for contiguous delegations and sub-delegations and sub-sub delegations and so on.

I've described this as a laptop but there are other scenarios that are also perfectly reasonable. For example, suppose that you're setting up a subnet for a university research group that currently operates zero containers, virtual machine hosts, and the like (each of which would require at least one /64). Considering that research groups can and do change their mind on what they're running, how many additional /64s should you budget for them eventually needing, and what do you do when it turns out that they want to operate more than that?

IPv6 NAT gets you out of all of this. You assign an IPv6 address on your subnet's /64 to that laptop or server (or it SLAAC's one for itself), and everything else is its problem, not yours. Its containers and virtual machines get IPv6 addresses from some address space that's not your problem, and the laptop (or server) NATs all of their traffic back and forth. You don't have to know or care about how many internal networks the laptop (or server) is hiding, if it's got some sort of internal routing hierarchy, or anything.

I expect this use of IPv6 NAT to primarily be driven by the people with these laptops and servers, not by the people in charge of IPv6 network design. If you're someone with a laptop that has some containers or VMs that you need to work with, and you plug in to a network that isn't already specifically designed to accommodate you (for example it's just a /64), your practical choices are either IPv6 NAT or containers that can't talk to anything. The people running the network are pretty unlikely to redesign it for you (often their answer will be 'that's not supported on this network'), and if they do, the new network design is unlikely to be deployed immediately (or even very soon).

(I don't believe that delegating a single /64 to each machine is a particularly workable solution. It still leaves you with problems if any machine wants multiple internal IPv6 subnets, and it consumes your IPv6 address space at a prodigious rate if you're designing for a reasonable number of machines on each subnet. I'm also not sure how everyone on the subnet is supposed to know how to talk to each other, which is something that people often do on subnets.)

Notes on the compatibility of crypted passwords across Unixes in late 2024

By: cks
2 November 2024 at 02:31

For years now, all sorts of Unixes have been able to support better password 'encryption' schemes than the basic old crypt(3) salted-mutant-DES approach that Unix started with (these days it's usually called 'password hashing'). However, the support for specific alternate schemes varies from Unix to Unix, and has for many years. Back in 2010 I wrote some notes on the situation at the time; today I want to look at the situation again, since password hashing is on my mind right now.

The most useful resource for cross-Unix password hash compatibility is Wikipedia's comparison table. For Linux, support varies by distribution based on their choice of C library and what version of libxcrypt they use, and you can usually see a list in crypt(5), and pam_unix may not support using all of them for new passwords. For FreeBSD, their support is documented in crypt(3). In OpenBSD, this is documented in crypt(3) and crypt_newhash(3), although there isn't much to read since current OpenBSD only lists support for 'Blowfish', which for password hashing is also known as bcrypt. On Illumos, things are more or less documented in crypt(3), crypt.conf(5), and crypt_unix(7) and associated manual pages; the Illumos section 7 index provides one way to see what seems to be supported.

System administrators not infrequently wind up wanting cross-Unix compatibility of their local encrypted passwords. If you don't care about your shared passwords working on OpenBSD (or NetBSD), then the 'sha512' scheme is you best bet; it basically works everywhere these days. If you do need to include OpenBSD or NetBSD, you're stuck with bcrypt but even then there may be problems because bcrypt is actually several schemes, as Wikipedia covers.

Some recent Linux distributions seem to be switching to 'yescrypt' by default (including Debian, which means downstream distributions like Ubuntu have also switched). Yescrypt in Ubuntu is now old enough that it's probably safe to use in an all-Ubuntu environment, although your distance may vary if you have 18.04 or earlier systems. Yescrypt is not yet available in FreeBSD and may never be added to OpenBSD or NetBSD (my impression is that OpenBSD is not a fan of having lots of different password hashing algorithms and prefers to focus on one that they consider secure).

(Compared to my old entry, I no longer particularly care about the non-free Unixes, including macOS. Even Wikipedia doesn't bother trying to cover AIX. For our local situation, we may someday want to share passwords to FreeBSD machines, but we're very unlikely to care about sharing passwords to OpenBSD machines since we currently only use them in situations where having their own stand-alone passwords is a feature, not a bug.)

Pam_unix and your system's supported password algorithms

By: cks
1 November 2024 at 03:15

The Linux login passwords that wind up in /etc/shadow can be encrypted (well, hashed) with a variety of algorithms, which you can find listed (and sort of documented) in places like Debian's crypt(5) manual page. Generally the choice of which algorithm is used to hash (new) passwords (for example, when people change them) is determined by an option to the pam_unix PAM module.

You might innocently think, as I did, that all of the algorithms your system supports will all be supported by pam_unix, or more exactly will all be available for new passwords (ie, what you or your distribution control with an option to pam_unix). It turns out that this is not the case some of the time (or if it is actually the case, the pam_unix manual page can be inaccurate). This is surprising because pam_unix is the thing that handles hashed passwords (both validating them and changing them), and you'd think its handling of them would be symmetric.

As I found out today, this isn't necessarily so. As documented in the Ubuntu 20.04 crypt(5) manual page, 20.04 supports yescrypt in crypt(3) (sadly Ubuntu's manual page URL doesn't seem to work). This means that the Ubuntu 20.04 pam_unix can (or should) be able to accept yescrypt hashed passwords. However, the Ubuntu 20.04 pam_unix(8) manual page doesn't list yescrypt as one of the available options for hashing new passwords. If you look only at the 20.04 pam_unix manual page, you might (incorrectly) assume that a 20.04 system can't deal with yescrypt based passwords at all.

At one level, this makes sense once you know that pam_unix and crypt(3) come from different packages and handle different parts of the work of checking existing Unix password and hashing new ones. Roughly speaking, pam_unix can delegate checking passwords to crypt(3) without having to care how they're hashed, but to hash a new password with a specific algorithm it has to know about the algorithm, have a specific PAM option added for it, and call some functions in the right way. It's quite possible for crypt(3) to get ahead of pam_unix for a new password hashing algorithm, like yescrypt.

(Since they're separate packages, pam_unix may not want to implement this for a new algorithm until a crypt(3) that supports it is at least released, and then pam_unix itself will need a new release. And I don't know if linux-pam can detect whether or not yescrypt is supported by crypt(3) at build time (or at runtime).)

PS: If you have an environment with a shared set of accounts and passwords (whether via LDAP or your own custom mechanism) and a mixture of Ubuntu versions (maybe also with other Linux distribution versions), you may want to be careful about using new password hashing schemes, even once it's supported by pam_unix on your main systems. The older some of your Linuxes are, the more you'll want to check their crypt(3) and crypt(5) manual pages carefully.

Keeping your site accessible to old browsers is non-trivial

By: cks
31 October 2024 at 03:13

One of the questions you could ask about whether or not to block HTTP/1.0 requests is what this does to old browsers and your site's accessibility to (or from) them (see eg the lobste.rs comments on my entry). The reason one might care about this is that old systems can usually only use old browsers, so to keep it possible to still use old systems you want to accommodate old browsers. Unfortunately the news there is not really great, and taking old browsers and old systems seriously has a lot of additional effects.

The first issue is that old systems generally can't handle modern TLS and don't recognize modern certificate authorities, like Let's Encrypt. This situation is only going to get worse over time, as websites increasingly require TLS 1.2 or better (and then in the future, TLS 1.3 or better). If you seriously care about keeping your site accessible to old browsers, you need to have a fully functional HTTP version. Increasingly, it seems that modern browsers won't like this, but so far they're willing to put up with it. I don't know if there's any good way to steer modern visitors to your HTTPS version instead of your HTTP version.

(This is one area where modern browsers preemptively trying HTTPS may help you.)

Next, old browsers obviously only support old versions of CSS, if they have very much CSS support at all (very old browsers probably won't). This can present a real conflict; you can have an increasingly basic site design that sticks within the bounds of what will render well on old browsers, or you can have one that looks good to what's probably the vast majority of your visitors and may or may not degrade gracefully on old browsers. Your CSS, if any, will probably also be harder to write, and it may be hard to test how well it actually works on old browsers. Some modern accessibility features, such as adjusting to screen sizes, may be (much) harder to get. If you want a multi-column layout or a sidebar, you're going to be back in the era of table based layouts (which this blog has never left, mostly because I'm lazy). And old browsers also mean old fonts, although with fonts it may be easier to degrade gracefully down to whatever default fonts the browser has.

(If you use images, there's the issue of image sizes and image formats. Old browsers are generally used on low resolution screens and aren't going to be the fastest or the best at scaling images down, if you can get them to do it as well. And you need to stick to image formats that they support.)

It's probably not impossible to do all of this, and you can test some of it by seeing how your site looks in text mode browsers like Lynx (which only really supports HTTP/1.0, as it turns out). But's certainly constraining; you have to really care, and it will cut you off from some things that are important and useful.

PS: I'm assuming that if you intend to be as fully usable as possible by old browsers, you're not even going to try to have JavaScript on your site.

Doing general address matching against varying address lists in Exim

By: cks
30 October 2024 at 02:23

In various Exim setups, you sometimes want to match an email address against a file (or in general a list) of addresses and some sort of address patterns; for example, you might have a file of addresses and so on that you will never accept as sender addresses. Exim has two different mechanisms for doing this, address lists and nwildlsearch lookups in files that are performed through the '${lookup}' string expansion item. Generally it's better to use address lists, because they have a wildcard syntax that's specifically focused on email addresses, instead of the less useful nwildlsearch lookup wildcarding.

Exim has specific features for matching address lists (including in file form) against certain addresses associated with the email message; for example, both ACLs and routers can match against the envelope sender address (the SMTP MAIL FROM) using 'senders = ...'. If you want to match against message addresses that are not available this way, you must use a generic 'condition =' operation and either '${lookup}' or '${if match_address {..}{...}}', depending on whether you want to use a nwildlsearch lookup or an actual address list (likely in a file). As mentioned, normally you'd prefer to use an actual address list.

Now suppose that your file of addresses is, for example, per-user. In a straight 'senders =' match this is no problem, you can just write 'senders = /some/where/$local_part_data/addrs'. Life is not as easy if you want to match a message address that is not directly supported, for example the email address of the 'From:' header. If you have the user (or whatever other varying thing) in $acl_m0_var, you would like to write:

condition = ${if match_address {${address:$h_from:}} {/a/dir/$acl_m0_var/fromaddrs} }

However, match_address (and its friends) have a deliberate limitation, which is that in common Exim build configurations they don't perform string expansion on their second argument.

The way around this turns out to be to use an explicitly defined and named 'addresslist' that has the string expansion:

addresslist badfromaddrs = /a/dir/$acl_m0_var/fromaddrs
[...]
  condition = ${if match_address {${address:$h_from:}} {+badfromaddrs} }

This looks weird, since at the point we're setting up badfromaddrs the $acl_m0_var is not even vaguely defined, but it works. The important thing that makes this go is a little sentence at the start of the Exim documentation's Expansion of lists:

Each list is expanded as a single string before it is used. [...]

Although the second argument of match_address is not string-expanded when used, if it specifies a named address list, that address list is string expanded when used and so our $acl_m0_var variable is substituted in and everything works.

Speaking from personal experience, it's easy to miss this sentence and its importance, especially if you normally use address lists (and domain lists and so on) without any string expansion, with fixed arguments.

(Probably the only reason I found it was that I was in the process of writing a question to the Exim mailing list, which of course got me to look really closely at the documentation to make sure I wasn't asking a stupid question.)

The question of whether to still allow HTTP/1.0 requests or block them

By: cks
29 October 2024 at 02:28

Recently, I discovered something and noted it on the Fediverse:

There are still a small number of things making HTTP/1.0 requests to my techblog. Many of them claim to be 'Chrome/124.<something>'. You know, I don't think I believe you, and I'm not sure my techblog should still accept HTTP/1.0 requests if all or almost all of them are malicious and/or forged.

The pure, standards-compliant answer to this is that of course you should still allow HTTP/1.0 requests. It remains a valid standard, and apparently some things may still default to it, and one part of the web's strength is its backward compatibility.

The pragmatic answer starts with the observation that HTTP/1.1 is now 25 years old, and any software that is talking HTTPS to you is demonstrably able to deal with standards that are more recent than that (generally much more recent, as sites require TLS 1.2 or better). And as a practical matter, pure HTTP/1.0 clients can't talk to many websites because such websites are name-based virtual hosts where the web server software absolutely requires a HTTP Host header before it will serve the website to you. If you leave out the Host header, at best you will get some random default site, perhaps a stub site.

(In a HTTPS context, web servers will also require TLS SNI and some will give you errors if the HTTP Host doesn't match the TLS SNI or is missing entirely. These days this causes HTTP/0.9 requests to be not very useful.)

If HTTP/1.0 requests were merely somewhere between a partial lie (in that everything that worked was actually supplying a Host header too) and useless (for things that didn't supply a Host), you could simply leave them be, especially if the volume was low. But my examination suggests strongly that approximately everything that is making HTTP/1.0 requests to Wandering Thoughts is actually up to no good; at a minimum they're some form of badly coded stealth spiders, quite possibly from would-be comment spammers that are trawling for targets. On a spot check, this seems to be true of another web server as well.

(A lot of the IPs making HTTP/1.0 requests provide claimed User-Agent headers that include ' Not-A.Brand/99 ', which appears to have been a Chrome experiment in putting random stuff in the User-Agent header. I don't see that in modern real Chrome user-agent strings, so I believe it's been dropped or de-activated since then.)

My own answer is that for now at least, I've blocked HTTP/1.0 requests to Wandering Thoughts. I'm monitoring what User-Agents get blocked, partly so I can perhaps exempt some if I need to, and it's possible I'll rethink the block entirely.

(Before you do this, you should certainly look at your own logs. I wouldn't expect there to be very many real HTTP/1.0 clients still out there, but the web has surprised me before.)

Linux's /dev/disk/by-id unfortunately often puts the transport in the name

By: cks
28 October 2024 at 03:24

Filippo Valsorda ran into an issue that involved, in part, the naming of USB disk drives. To quote the relevant bit:

I can't quite get my head around the zfs import/export concept.

When I replace a drive I like to first resilver the new one as a USB drive, then swap it in. This changes the device name (even using by-id).

[...]

My first reaction was that something funny must be going on. My second reaction was to look at an actual /dev/disk/by-id with a USB disk, at which point I got a sinking feeling that I should have already recognized from a long time ago. If you look at your /dev/disk/by-id, you will mostly see names that start with things like 'ata-', 'scsi-OATA-', 'scsi-1ATA', and maybe 'usb-' (and perhaps 'nvme-', but that's a somewhat different kettle of fish). All of these names have the problem that they burn the transport (how you talk to the disk) into the /dev/disk/by-id, which is supposed to be a stable identifier for the disk as a standalone thing.

As Filippo Valsorda's case demonstrates, the problem is that some disks can move between transports. When this happens, the theoretically stable name of the disk changes; what was 'usb-' is now likely 'ata-' or vice versa, and in some cases other transformations may happen. Your attempt to use a stable name has failed and you will likely have problems.

Experimentally, there seem to be some /dev/disk/by-id names that are more stable. Some but not all of our disks have 'wwn-' names (one USB attached disk I can look at doesn't). Our Ubuntu based systems have 'scsi-<hex digits>' and 'scsi-SATA-<disk id>' names, but one of my Fedora systems with SATA drives has only the 'scsi-<hex>' names and the other one has neither. One system we have a USB disk on has no names for the disk other than 'usb-' ones. It seems clear that it's challenging at best to give general advice about how a random Linux user should pick truly stable /dev/disk/by-id names, especially if you have USB drives in the picture.

(See also Persistent block device naming in the Arch Wiki.)

This whole current situation seems less than ideal, to put it one way. It would be nice if disks (and partitions on them) had names that were as transport independent and usable as possible, especially since most disks have theoretically unique serial numbers and model names available (and if you're worried about cross-transport duplicates, you should already be at least as worried as duplicates within the same type of transport).

PS: You can find out what information udev knows about your disks with 'udevadm info --query=all --name=/dev/...' (from, via, by coincidence). The information for a SATA disk differs between my two Fedora machines (one of them has various SCSI_* and ID_SCSI* stuff and the other doesn't), but I can't see any obvious reason for this.

The importance of name-based virtual hosts (websites)

By: cks
27 October 2024 at 03:25

I recently read Geoff Huston's The IPv6 Transition, which is actually about why that transition isn't happening. A large reason for that is that we've found ways to cope with the shortage of IPv4 addresses, and one of the things Huston points to here is the introduction of the TLS Server Name Indicator (SNI) as drastically reducing the demand for IPv4 addresses for web servers. This is a nice story, but in actuality, TLS SNI was late to the party. The real hero (or villain) in taming what would otherwise have been a voracious demand for IPv4 addresses for websites is the HTTP Host header and the accompanying idea of name-based virtual hosts. TLS SNI only became important much later, when a mass movement to HTTPS hosts started to happen, partly due to various revelations about pervasive Internet surveillance.

In what is effectively the pre-history of the web, each website had to have its own IP(v4) address (an 'IP-based virtual host', or just your web server). If a single web server was going to support multiple websites, it needed a bunch of IP aliases, one per website. You can still do this today in web servers like Apache, but it has long since been superseded with name-based virtual hosts, which require the browser to send a Host: header with the other HTTP headers in the request. HTTP Host was officially added in HTTP/1.1, but I believe that back in the days basically everything accepted it even for HTTP 1.0 requests and various people patched it into otherwise HTTP/1.0 libraries and clients, possibly even before HTTP/1.1 was officially standardized.

(Since HTTP/1.1 dates from 1999 or so, all of this is ancient history by now.)

TLS SNI only came along much later. The Wikipedia timeline suggests the earliest you might have reasonably been able to use it was in 2009, and that would have required you to use a bleeding edge Apache; if you were using an Apache provided by your 'Long Term Support' Unix distribution, it would have taken years more. At the time that TLS SNI was introduced this was okay, because HTTPS (still) wasn't really seen as something that should be pervasive; instead, it was for occasional high-importance sites.

One result of this long delay for TLS SNI is that for years, you were forced to allocate extra IPv4 addresses and put extra IP aliases on your web servers in order to support multiple HTTPS websites, while you could support all of your plain-HTTP websites from a single IP. Naturally this served as a subtle extra disincentive to supporting HTTPS on what would otherwise be simple name-based virtual hosts; the only websites that it was really easy to support were ones that already had their own IPs (sometimes because they were on separate web servers, and sometimes for historical reasons if you'd been around long enough, as we had been).

(For years we had a mixed tangle of name-based and ip-based virtual hosts, and it was often difficult to recover the history of just why something was ip-based instead of name-based. We eventually managed to reform it down to only a few web servers and a few IP addresses, but it took a while. And even today we have a few virtual hosts that are deliberately ip-based for reasons.)

Using pam_access to sometimes not use another PAM module

By: cks
26 October 2024 at 02:40

Suppose that you want to authenticate SSH logins to your Linux systems using some form of multi-factor authentication (MFA). The normal way to do this is to use 'password' authentication and then in the PAM stack for sshd, use both the regular PAM authentication module(s) of your system and an additional PAM module that requires your MFA (in another entry about this I used the module name pam_mfa). However, in your particular MFA environment it's been decided that you don't have to require MFA for logins from some of your other networks or systems, and you'd like to implement this.

Because your MFA happens through PAM and the details of this are opaque to OpenSSH's sshd, you can't directly implement skipping MFA through sshd configuration settings. If sshd winds up doing password based authentication at all, it will run your full PAM stack and that will challenge people for MFA. So you must implement sometimes skipping your MFA module in PAM itself. Fortunately there is a PAM module we can use for this, pam_access.

The usual way to use pam_access is to restrict or allow logins (possibly only some logins) based on things like the source address people are trying to log in from (in this, it's sort of a superset of the old tcpwrappers). How this works is configured through an access control file. We can (ab)use this basic matching in combination with the more advanced form of PAM controls to skip our PAM MFA module if pam_access matches something.

What we want looks like this:

auth  [success=1 default=ignore]  pam_access.so noaudit accessfile=/etc/security/access-nomfa.conf
auth  requisite  pam_mfa

Pam_access itself will 'succeed' as a PAM module if the result of processing our access-nomfa.conf file is positive. When this happens, we skip the next PAM module, which is our MFA module. If it 'fails', we ignore the result, and as part of ignoring the result we tell pam_access to not report failures.

Our access-nomfa.conf file will have things like:

# Everyone skips MFA for internal networks
+:ALL:192.168.0.0/16 127.0.0.1

# Insure we fail otherwise.
-:ALL:ALL

We list the networks we want to allow password logins without MFA from, and then we have to force everything else to fail. (If you leave this off, everything passes, either explicitly or implicitly.)

As covered in the access.conf manual page, you can get quite sophisticated here. For example, you could have people who always had to use MFA, even from internal machines. If they were all in a group called 'mustmfa', you might start with:

-:(mustmfa):ALL

If you get at all creative with your access-nomfa.conf, I strongly suggest writing a lot of comments to explain everything. Your future self will thank you.

Unfortunately but entirely reasonably, the information about the remote source of a login session doesn't pass through to later PAM authentication done by sudo and su commands that you do in the session. This means that you can't use pam_access to not give MFA challenges on su or sudo to people who are logged in from 'trusted' areas.

(As far as I can tell, the only information ``pam_access' gets about the 'origin' of a su is the TTY, which is generally not going to be useful. You can probably use this to not require MFA on su or sudo that are directly done from logins on the machine's physical console or serial console.)

Having an emergency backup DNS resolver with systemd-resolved

By: cks
25 October 2024 at 03:08

At work we have a number of internal DNS resolvers, which you very much want to use to resolve DNS names if you're inside our networks for various reasons (including our split-horizon DNS setup). Purely internal DNS names aren't resolvable by the outside world at all, and some DNS names resolve differently. However, at the same time a lot of the host names that are very important to me are in our public DNS because they have public IPs (sort of for historical reasons), and so they can be properly resolved if you're using external DNS servers. This leaves me with a little bit of a paradox; on the one hand, my machines must resolve our DNS zones using our internal DNS servers, but on the other hand if our internal DNS servers aren't working for some reason (or my home machine can't reach them) it's very useful to still be able to resolve the DNS names of our servers, so I don't have to memorize their IP addresses.

A while back I switched to using systemd-resolved on my machines. Systemd-resolved has a number of interesting virtues, including that it has fast (and centralized) failover from one upstream DNS resolver to another. My systemd-resolved configuration is probably a bit unusual, in that I have a local resolver on my machines, so resolved's global DNS resolution goes to it and then I add a layer of (nominally) interface-specific DNS domain overrides that point to our internal DNS resolvers.

(This doesn't give me perfect DNS resolution, but it's more resilient and under my control than routing everything to our internal DNS resolvers, especially for my home machine.)

Somewhat recently, it occurred to me that I could deal with the problem of our internal DNS resolvers all being unavailable by adding '127.0.0.1' as an additional potential DNS server for my interface specific list of our domains. Obviously I put it at the end, where resolved won't normally use it. But with it there, if all of the other DNS servers are unavailable I can still try to resolve our public DNS names with my local DNS resolver, which will go out to the Internet to talk to various authoritative DNS servers for our zones.

The drawback with this emergency backup approach is that systemd-resolved will stick with whatever DNS server it's currently using unless that DNS server stops responding. So if resolved switches to 127.0.0.1 for our zones, it's going to keep using it even after the other DNS resolvers become available again. I'll have to notice that and manually fiddle with the interface specific DNS server list to remove 127.0.0.1, which would force resolved to switch to some other server.

(As far as I can tell, the current systemd-resolved correctly handles the situation where an interface says that '127.0.0.1' is the DNS resolver for it, and doesn't try to force queries to 127.0.0.1:53 to go out that interface. My early 2013 notes say that this sometimes didn't work, but I failed to write down the specific circumstances.)

Doing basic policy based routing on FreeBSD with PF rules

By: cks
24 October 2024 at 03:26

Suppose, not hypothetically, that you have a FreeBSD machine that has two interfaces and these two interfaces are reached through different firewalls. You would like to ping both of the interfaces from your monitoring server because both of them matter for the machine's proper operation, but to make this work you need replies to your pings to be routed out the right interface on the FreeBSD machine. This is broadly known as policy based routing and is often complicated to set up. Fortunately FreeBSD's version of PF supports a basic version of this, although it's not well explained in the FreeBSD pf.conf manual page.

To make our FreeBSD machine reply properly to our monitoring machine's ICMP pings, or in general to its traffic, we need a stateful 'pass' rule with a 'reply-to':

B_IF="emX"
B_IP="10.x.x.x"
B_GW="10.x.x.254"
B_SUBNET="10.x.x.0/24"

pass in quick on $B_IF \
  reply-to ($B_IF $B_GW) \
  inet from ! $B_SUBNET to $B_IP \
  keep state

(Here $B_IP is the machine's IP on this second interface, and we also need the second interface, the gateway for the second interface's subnet, and the subnet itself.)

As I discovered, you must put the 'reply-to' where it is here, although as far as I can tell the FreeBSD pf.conf manual page will only tell you that if you read the full BNF. If you put it at the end the way you might read the text description, you will get only opaque syntax errors.

We must specifically exclude traffic from the subnet itself to us, because otherwise this rule will faithfully send replies to other machines on the same subnet off to the gateway, which either won't work well or won't work at all. You can restrict the PF rule more narrowly, for example 'from { IP1 IP2 IP3 }' if those are the only off-subnet IPs that are supposed to be talking to your secondary interface.

(You may also want to match only some ports here, unless you want to give all incoming traffic on that interface the ability to talk to everything on the machine. This may require several versions of this rule, basically sticking the 'reply-to ...' bit into every 'pass in quick on ...' rule you have for that interface.)

This PF rule only handles incoming connections (including implicit ones from ICMP and UDP traffic). If we want to be able to route our outgoing traffic over our secondary interface by selecting a source address when you do things, we need a second PF rule:

pass out quick \
  route-to ($B_IF $B_GW) \
  inet from $B_IP to ! $B_SUBNET \
  keep state

Again we must specifically exclude traffic to our local network, because otherwise it will go flying off to our gateway, and also you can be more specific if you only want this machine to be able to connect to certain things using this gateway and firewall (eg 'to { IP1 IP2 SUBNET3/24 }', or you could use a port-based restriction).

(The PF rule can't be qualified with 'on $B_IF', because the situation where you need this rule is where the packet would not normally be going out that interface. Using 'on <the interface with your default route's gateway>' has some subtle differences in the semantics if you have more than two interfaces.)

Although you might innocently think otherwise, the second rule by itself isn't sufficient to make incoming connections to the second interface work correctly. If you want both incoming and outgoing connections to work, you need both rules. Possibly it would work if you matched incoming traffic on $B_IF without keeping state.

Having rate-limits on failed authentication attempts is reassuring

By: cks
23 October 2024 at 03:24

A while back I added rate-limits to failed SMTP authentication attempts. Mostly I did it because I was irritated at seeing all of the failed (SMTP) authentication attempts in logs and activity summaries; I didn't think we were in any actual danger from the usual brute force mass password guessing attacks we see on the Internet. To my surprise, having this rate-limit in place has been quite reassuring, to the point where I no longer even bother looking at the overall rate of SMTP authentication failures or their sources. Attackers are unlikely to make much headway or have much of an impact on the system.

Similarly, we recently updated an OpenBSD machine that has its SSH port open to the Internet from OpenBSD 7.5 to OpenBSD 7.6. One of the things that OpenBSD 7.6 brings with it is the latest version of OpenSSH, 9.8, which has per-source authentication rate limits (although they're not quite described that way and the feature is more general). This was also a reassuring change. Attackers wouldn't be getting into the machine in any case, but I have seen the machine use an awful lot of CPU at times when attackers were pounding away, and now they're not going to be able to do that.

(We've long had firewall rate limits on connections, but they have to be set high for various reasons including that the firewall can't tell connections that fail to authenticate apart from brief ones that did.)

I can wave my hands about why it feels reassuring (and nice) to know that we have rate-limits in place for (some) commonly targeted authentication vectors. I know it doesn't outright eliminate the potential exposure, but I also know that it helps reduce various risks. Overall, I think of it as making things quieter, and in some sense we're no longer getting constantly attacked as much.

(It's also nice to hope that we're frustrating attackers and wasting their time. They do sort of have limits on how much time they have and how many machines they can use and so on, so our rate limits make attacking us more 'costly' and less useful, especially if they trigger our rate limits.)

PS: At the same time, this shows my irrationality, because for a long time I didn't even think about how many SSH or SMTP authentication attempts were being made against us. It was only after I put together some dashboards about this in our metrics system that I started thinking about it (and seeing temporary changes in SSH patterns and interesting SMTP and IMAP patterns). Had I never looked, I would have never thought about it.

Quoting and not quoting command substitution in the Bourne shell

By: cks
22 October 2024 at 02:49

Over on the Fediverse, I said something:

Bourne shell trivia of the day:
  var=$(program ...)
is the same as
  var="$(program ...)"
so the quotes are unnecessary.

But:
  program2 $(program ...)
is not the same as:
  program2 "$(program ..)"
and often the quotes are vital.

(I have been writing the variable assignment as var="$(...)" for ages without realizing that the quotes were unnecessary.)

This came about because I ran an old shell script through shellcheck, which recommended replacing its use of var=`...` with var=$(...), and then I got to wondering why shellcheck wasn't telling me to write the second as var="$(...)" for safety against multi-word expansions. The answer is of course that multi-word expansion doesn't happen in this context; even if the $(...) produces what would normally be multiple words of output, they're all assigned to 'var' as a single word.

On the one hand, this is what you want; there's almost no circumstance where you want a command that produces multiple words of output to have the first word assigned to 'var' and then the rest interpreted as a command and its arguments. On the other hand, the Bourne shell is generally not known for being friendly about its quoting. It would be perfectly in character for the Bourne shell to require you to quote the '$(...)' even in variable assignment.

On the one hand, shellcheck doesn't complain about the quoted version and it's consistent with quoting $(...) in other circumstances (when it really does matter). On the other hand, you can easily forget or not know (as I did) that the quoting is unnecessary here, and then you can be alarmed when you see an unquoted 'var=$(...)' in the wild or have it suggested. Since I've mostly written the quoted version, I'll probably continue doing so in my scripts unless I'm dealing with a script that already has some unquoted examples, where I should probably make everything unquoted so that no one reading the script in the future ever thinks there's a difference between the two.

Two visions of 'software supply chain security'

By: cks
21 October 2024 at 03:04

Although the website that is insisting I use MFA if I want to use it to file bug reports doesn't use the words in its messages to me, we all know that the reason it is suddenly demanding I use MFA is what is broadly known as "software supply chain security" and the 'software supply chain' (which is a contentious name for deciding that you're going to rely on other people's open source code). In thinking about this, I feel that you can have (at least) two visions of "software supply chain security".

In one vision, software supply chain security is a collection of well intentioned moves and changes that are intended to make it harder for bad actors to compromise open source projects and their source code. For instance, all of the package repositories and other places where software is distributed try to get everyone to use multi-factor authentication, so people with the ability to publish new versions of packages can't get their (single) password compromised and have that password used by an attacker to publish a compromised version of their package. You might also expect to see people looking into heavily used, security critical projects to see if they have enough resources and then some moves to provide those resources.

In the other vision, software supply chain security is a way for corporations to avoid being blamed when there's a security issue in open source software that they've pulled into their products or their operations (or both). Corporations mostly don't really care about achieving actual security, especially since real security may not be legibly secure, but they are sensitive to blame, especially because it can result in lawsuits, fines, and other consequences. If a corporation can demonstrate that it was following convincing best practices to obtain secure (open source) software, maybe it can deflect the blame. And when doing this, it's useful if the 'best practices' are clearly legible and easy to assess, such as 'where we get open source software from insists on MFA'.

In the second vision, you might expect a big (corporate) push for visible but essentially performative 'security' steps, with relatively little difficult analysis of underlying root causes of various security risks, much less much of an attempt to address deep structural issues like sustainable open source maintenance.

(If you want an extremely crude measuring stick, you can simply ask "would this measure have prevented the XZ Utils backdoor". Generally the answer is 'no'.)

Forced MFA is effectively an annoying, harder to deal with second password

By: cks
20 October 2024 at 02:32

Suppose, not hypothetically, that some random web site you use is forcing you to enable MFA on your account, possibly an account that in practice you use only to do unimportant things like report issues on other people's open source software. I've written before how MFA is both 'simple' and non-trivial work, but that entry half assumed that you might actually care about the extra security benefits of MFA. If some random unimportant (to you) website is forcing you to get MFA, this goes out the window.

What the website is really doing is forcing you to enable a second password for your account, one that you must use in addition to your first password. Instead of using a password saved in your password manager of choice, you must now use the same saved password plus an additional password that is invariably slower and more work to produce. We understand today that websites that prevent you (or your password manager) from pasting in passwords and force you to type them out by hand are doing it wrong; well, that's what MFA is doing, except that often you're going to need a second device to get that password (whether that is a phone or a security key).

(For extra bonus points, losing the second 'password' alone may be enough to permanently lose your account on the website. At the very least, you're going to need to do a number of extra things to avoid this.)

My view is that if something unimportant is forcing MFA on you you don't feel like giving up on the site entirely, you might as well use the simplest, easiest to use MFA approach that you can. If the website will never let you in with the second factor alone, then it's perfectly okay for it to be relatively or completely insecure, and in any case you don't need to make it any more secure than your existing password management. In fact you might as well put it in your existing password management if possible, although I suspect that there are no current password managers that will both hold your password for a site and (automatically) generate the related TOTP MFA codes to go with it.

(You can get this on the same device, when you log in from your smartphone using its saved passwords and whatever authenticator app you're using. Don't ask how this is actually 'multi-factor', since anyone with your unlocked phone can use both factors; almost everyone in the MFA space is basically ignoring the issue because it would be too inconvenient to take it seriously.)

Will this defeat the website's security goals for forcing MFA down your throat? Yes, absolutely. But that's their problem, not yours. You are under no obligation to take any website (or your presence on it) as seriously as it takes itself. MFA that is not helping anything you care about is an obstacle, not a service.

Of course, sauce for the goose is sauce for the gander, so if you're implementing MFA for your good local security needs, you should be considering if the people who have to use it are going to think of your MFA in this way. Maybe they shouldn't, but remember, people don't actually care about security (and people matter because security is people).

The Go module proxy and forcing Go to actually update module versions

By: cks
19 October 2024 at 03:05

Suppose, not hypothetically, that you have two modules, such as a program and a general module that it uses. Through working on the program, you realize that there are some bugs in the general module, so you fix them and then test them in the program by temporarily using a replace directive, or perhaps a workspace. Eventually you're satisfied with the changes to your module, so you commit them and push the change to the public repository. Now you want to update your program's go.mod to use the module version you've just pushed.

As lots of instructions will tell you, this is straightforward; you want some version of 'go get -u', perhaps 'go get -u .'. However, if you try this immediately, you may discover that Go is not updating the module's version. No matter what you do, not even removing the module from 'go.mod' and then go-get'ing it again, will make Go budge. As far as Go seems to be concerned, your module has not updated and the only available version is the previous one.

(It's possible that 'go get -u <module>@latest' will work here, I didn't think to try it when this happened to me.)

As far as I can tell, what is going on here is the Go module proxy. By default, 'go get' will consult the (public) Go module proxy, and the Go module proxy can have a delay between when you push an update to the public repositories and when the module proxy sees it. I assume that under the hood there's various sorts of rate limiting and other caching, since I expect neither the Go proxy nor the various forges out there want the Go proxy to query forges on every single request just in case an infrequently updated module has been updated this time around.

The blunt hammer way of defeating this is to force 'go get -u' to not use the Go module proxy, with 'GOPROXY=direct go get -u'. This will force Go to directly query the public source and so make it notice your just-pushed update.

PS: If you tagged a new version I believe you can hand edit your go.mod to have the new version. This is more difficult if your module is not officially released, has no version tags, and is using the 'v0.0.0-<git information>' format in go.mod.

PPS: Possibly there is another way you're supposed to do this. If so, it doesn't seem to be well documented.

Syndication feed readers now seem to leave Last-Modified values alone

By: cks
18 October 2024 at 03:08

A HTTP conditional GET is a way for web clients, such as syndication feed readers, to ask for a new copy of a URL only if the URL has changed since they last fetched it. This is obviously appealing for things, like syndication feed readers, that repeatedly poll URLs that mostly don't change, although syndication feed readers not infrequently get parts of this wrong. When a client makes a conditional GET, it can present an If-Modified-Since header, an If-None-Match header, or both. In theory, the client's If-None-Match value comes from the server's ETag, which is an opaque value, and the If-Modified-Since comes from the server's Last-Modified, which is officially a timestamp but which I maintain is hard to compare except literally.

I've long believed and said that many clients treat the If-Modified-Since header as a timestamp and so make up their own timestamp values; one historical example is Tiny Tiny RSS, and another is NextCloud-News. This belief led me to consider pragmatic handling of partial matches for HTTP conditional GET, and due to writing that entry, it also led me to actually instrument DWiki so I could see when syndication feed clients presented If-Modified-Since timestamps that were after my feed's Last-Modified. The result has surprised me. Out of the currently allowed feed fetchers, almost no syndication feed fetcher seems to present its own, later timestamp in requests, and on spot checks, most of them don't use too-old timestamps either.

(Even Tiny Tiny RSS may have changed its ways since I last looked at its behavior, although I'm keeping my special hack for it in place for now.)

Out of my reasonably well behaved, regular feed fetchers (other than Tiny Tiny RSS), only two uncommon ones regularly present timestamps after my Last-Modified value. And there are a lot of different User-Agents that managed to do a successful conditional GET of my syndication feed.

(There are, unfortunately, quite a lot of User-Agents that fetched my feed but didn't manage even a single successful conditional GET. But that's another matter, and some of them may have an extremely low polling interval. It would take me a lot more work to correlate this with which requests didn't even try any conditional GETs.)

This genuinely surprises me, and means I have to revise my belief that everyone mangles If-Modified-Since. Mostly they don't. As a corollary, parsing If-Modified-Since strings into timestamps and doing timestamp comparisons on them is probably not worth it, especially if Tiny Tiny RSS has genuinely changed.

(My preliminary data also suggests that almost no one has a different timestamp but a matching If-None-Match value, so my whole theory on pragmatic partial matches is irrelevant. As mentioned in an earlier entry, some feed readers get it wrong the other way around.)

PS: I believe that rachelbythebay's more systematic behavioral testing of feed readers has unearthed a variety of feed readers that have more varied If-Modified-Since behavior than I'm seeing; see eg this recent roundup. So actual results on your website may vary significantly depending on your readers and what they use.

Our various different types of Ubuntu installs

By: cks
17 October 2024 at 02:15

In my entry on how we have lots of local customizations I mentioned that the amount of customization we do to any particular Ubuntu server depends on what class or type of machine they are. That's a little abstract, so let's talk about how our various machines are split up by type.

Our general install framework has two pivotal questions that categorize machines. The first question is what degree of NFS mounting the machine will do, with the choices being all of the NFS filesystems from our fileservers (more or less), NFS mounting just our central administrative filesystem either with our full set of accounts or with just staff accounts, rsync'ing that central administrative filesystem (which implies only staff accounts), or being a completely isolated machine that doesn't have even the central administrative filesystem.

Servers that people will use have to have all of our NFS filesystems mounted, as do things like our Samba and IMAP servers. Our fileservers don't cross-mount NFS filesystems from each other, but they do need a replicated copy of our central administrative filesystem and they have to have our full collection of logins and groups for NFS reasons. Many of our more stand-alone, special purpose servers only need our central administrative filesystem, and will either NFS mount it or rsync it depending on how fast we want updates to propagate. For example, our local DNS resolvers don't particularly need fast updates, but our external mail gateway needs to be up to date on what email addresses exist, which is propagated through our central administrative filesystem.

On machines that have all of our NFS mounts, we have a further type choice; we can install them either as a general login server (called an 'apps' server for historical reasons), as a 'comps' compute server (which includes our SLURM nodes), or only install a smaller 'base' set of packages on them (which is not all that small; we used to try to have a 'core' package set and a larger 'base' package set but over time we found we never installed machines with only the 'core' set). These days the only difference between general login servers and compute servers is some system settings, but in the past they used to have somewhat different package sets.

The general login servers and compute servers are mostly not further customized (there are a few exceptions, and SLURM nodes need a bit of additional setup). Almost all machines that get only the base package set are further customized with additional packages and specific configuration for their purpose, because the base package set by itself doesn't make the machine do anything much or be particularly useful. These further customizations mostly aren't scripted (or otherwise automated) for various reasons. The one big exception is installing our NFS fileservers, which we decided was both large enough and we had enough of that we wanted to script it so that everything came out the same.

As a practical matter, the choice between NFS mounting our central administrative filesystem (with only staff accounts) and rsync'ing it makes almost no difference to the resulting install. We tend to think of the two types of servers it creates as almost equivalent and mostly lump them together. So as far as operating our machines goes, we mostly have 'all NFS mounts' machines and 'only the administrative filesystem' machines, with a few rare machines that don't have anything (and our NFS fileservers, which are special in their own way).

(In the modern Linux world of systemd, much of our customizations aren't Ubuntu specific, or even specific to Debian and derived systems that use apt-get. We could probably switch to Debian relatively easily with only modest changes, and to an RPM based distribution with more work.)

A surprise with /etc/cron.daily, run-parts, and files with '.' in their name

By: cks
16 October 2024 at 03:30

Linux distributions have a long standing general cron feature where there is are /etc/cron.hourly, /etc/cron.daily, and /etc/cron.weekly directories and if you put scripts in there, they will get run hourly, daily, or weekly (at some time set by the distribution). The actual running is generally implemented by a program called 'run-parts'. Since this is a standard Linux distribution feature, of course there is a single implementation of run-parts and its behavior is standardized, right?

Since I'm asking the question, you already know the answer: there are at least two different implementations of run-parts, and their behavior differs in at least one significant way (as well as several other probably less important ones).

In Debian, Ubuntu, and other Debian-derived distributions (and also I think Arch Linux), run-parts is a C program that is part of debianutils. In Fedora, Red Hat Enterprise Linux, and derived RPM-based distributions, run-parts is a shell script that's part of the crontabs package, which is part of cronie-cron. One somewhat unimportant way that these two versions differ is that the RPM version ignores some extensions that come from RPM packaging fun (you can see the current full list in the shell script code), while the Debian version only skips the Debian equivalents with a non-default option (and actually documents the behavior in the manual page).

A much more important difference is that the Debian version ignores files with a '.' in their name (this can be changed with a command line switch, but /etc/cron.daily and so on are not processed with this switch). As a non-hypothetical example, if you have a /etc/cron.daily/backup.sh script, a Debian based system will ignore this while a RHEL or Fedora based system will happily run it. If you are migrating a server from RHEL to Ubuntu, this may come as an unpleasant surprise, partly since the Debian version doesn't complain about skipping files.

(Whether or not the restriction could be said to be clearly documented in the Debian manual page is a matter of taste. Debian does clearly state the allowed characters, but it does not point out that '.', a not uncommon character, is explicitly not accepted by default.)

We have lots of local customizations (and how we keep track of them)

By: cks
15 October 2024 at 03:02

In a comment on my entry on forgetting some of our local changes to our Ubuntu installs, pk left an interesting and useful comment on how they manage changes so that the changes are readily visible in one place. This is a very good idea and we do something similar to it, but a general limitation of all such approaches is that it's still hard to remember all of your changes off the top of your head once you've made enough of them. Once you're changing enough things, you generally can't put them all in one directory that you can simply 'ls' to be reminded of everything you change; at best, you're looking at a list of directories where you change things.

Our system for customizing Ubuntu stores the master version of customizations in our central administrative filesystem, although split across several places for convenience. We broadly have one directory hierarchy for Ubuntu release specific files (or at least ones that are potentially version specific; in practice a lot are the same between different Ubuntu releases), a second hierarchy (or two) for files that are generic across Ubuntu versions (or should be), and then a per-machine hierarchy for things specific to a single machine. Each hierarchy mirrors the final filesystem location, so that our systemd unit files will be in, for example, <hierarchy root>/etc/systemd/system.

Our current setup embeds the knowledge of what files will or won't be installed on any particular class of machines into the Ubuntu release specific 'postinstall' script that we run to customize machines, in the form of a whole bunch of shell commands to copy each of the files (or collections of files). This gives us straightforward handling of files that aren't always installed (or that vary between types of machines), at the cost of making it a little unclear if a particular file in the master hierarchy will actually be installed. We could try to do something more clever, but it would be less obvious that tne current straightforward approach where the postinstall script has a lot of 'cp -a <src>/etc/<file> /etc/<file>' and it's easy to see what you need to do to add one or specially handle one.

(The obvious alternate approach would be to have a master file that listed all of the files to be installed on each type of machine. However, one advantage of the current approach is that it's easy to have various commentary about the files being installed and why, and it's also easy to run commands, install packages, and so on in between installing various files. We don't install them all at once.)

Based on some brute force approximation, it appears that we install around 100 customization files on a typical Ubuntu machine (we install more on some types of machines than on other types, depending on whether the machine will have all of our NFS mounts and whether or not it's a machine regular people will log in to). Specific machines can be significantly customized beyond this; for example, our ZFS fileservers get an additional scripted customization pass.

PS: The reason we have this stuff scripted and stored in a central filesystem is that we have over a hundred servers and a lot of them are basically identical to each other (most obviously, our SLURM nodes). In aggregate, we install and reinstall a fair number of machines and almost all of them have this common core.

Our local changes to standard (Ubuntu) installs are easy to forget

By: cks
14 October 2024 at 03:08

We have been progressively replacing a number of old one-off Linux machines with up to date replacements that run Ubuntu and so are based on our standard Ubuntu install. One of those machines has a special feature where a group of people are allowed to use passworded sudo to gain access to a common holding account. After we deployed the updated machine, these people got in touch with us to report that something had gone wrong with the sudo system. This was weird to me, because I'd made sure to faithfully replicate the old system's sudo customizations to the new one. When I did some testing, things got weirder; I discovered that sudo was demanding the root password instead of my password. This was definitely not how things were supposed to work for this sudo access (especially since the people with sudo access don't know the root password for the machine).

Whether or not sudo does this is controlled by the setting of 'rootpw' in sudoers or one of the files it includes (at least with Ubuntu's standard sudo.conf). The stock Ubuntu sudoers doesn't set 'rootpw', and of course this machine's sudoers customizations didn't set them either. But when I looked around, I discovered that we had long ago set up an /etc/sudoers.d customization file to set 'rootpw' and made it part of our standard Ubuntu install. When I rebuilt this machine based on our standard Ubuntu setup, the standard install stuff had installed this sudo customization. Since we'd long ago completely forgotten about its existence, I hadn't remembered it while customizing the machine to its new purpose, so it had stayed.

(We don't normally use passworded sudo, and we definitely want access to root to require someone to know the special root password, not just the password to a sysadmin's account.)

There are probably a lot of things that we've added to our standard install over the years that are like this sudo customization. They exist to make things work (or not work), and as long as they keep quietly doing their jobs it's very easy to forget them and their effects. Then we do something exceptional on a machine and they crop up, whether it's preventing sudo from working like we want it to or almost giving us a recursive syslog server.

(I don't have any particular lesson to draw from this, except that it's surprisingly difficult to de-customize a machine. One might think the answer is to set up the machine from scratch outside our standard install framework, but the reality is that there's a lot from the standard framework that we still want on such machines. Even with issues like this, it's probably easier to install them normally and then fix the issues than do a completely stock Ubuntu server install.)

Some thoughts on why 'inetd activation' didn't catch on

By: cks
13 October 2024 at 02:06

Inetd is a traditional Unix 'super-server' that listens on multiple (IP) ports and runs programs in response to activity on them; it dates from the era of 4.3 BSD. In theory inetd can act as a service manager of sorts for daemons like the BSD r* commands, saving them from having to implement things like daemonization, and in fact it turns out that one version of this is how these daemons were run in 4.3 BSD. However, running daemons under inetd never really caught on (even in 4.3 BSD some important daemons ran outside of inetd), and these days it's basically dead. You could ask why, and I have some thoughts on that.

The initial version of inetd only officially supported running TCP services in a mode where each connection ran a new instance of the program (call this the CGI model). On the machines of the 1980s and 1990s, this wasn't a particularly attractive way to run anything but relatively small and simple programs (and ones that didn't have to do much work on startup). In theory you could possibly run TCP services in a mode where they were passed the server socket and then accepted new connections themselves for a while; in practice, no one seems to have really written daemons that supported this. Daemons that supported an 'inetd mode' generally meant the 'run a copy of the program for each connection' mode.

(Possibly some of them supported both modes of inetd operation, but system administrators would pretty much assume that if a daemon's documentation said just 'inetd mode' that it meant the CGI model.)

Another issue is that inetd is not a service manager. It will start things for you, but that's it; it won't shut down things for you (although you can get it to stop listening on a port), and it won't tell you what's running (you get to inspect the process list). On Unixes with a System V init system or something like it, running your daemons as standalone things gave you access to start, stop, restart, status, and so on service management options that might even work (depending on the quality of the init.d scripts involved). Since daemons had better usability when run as standalone services, system administrators and others had relatively little reason to push for inetd support, especially in the second mode.

In general, running any important daemon under inetd has many of the same downside as systemd socket activation of services. As a practical matter, system administrators like to know that important daemons are up and running right away, and they don't have some hidden issue that will cause them to fail to start just when you want them. The normal CGI-like inetd mode also means that any changes to configuration files and the like take effect right away, which may not be what you want; system administrators tend to like controlling when daemons restart with new configurations.

All of this is likely tied to what we could call 'cultural factors'. I suspect that authors of daemons perceived running standalone as the more serious and prestigious option, the one for serious daemons like named and sendmail, and inetd activation to be at most a secondary feature. If you wrote a daemon that only worked with inetd activation, you'd practically be proclaiming that you saw your program as a low importance thing. This obviously reinforces itself, to the degree that I'm surprised sshd even has an option to run under inetd.

(While some Linuxes are now using systemd socket activation for sshd, they aren't doing it via its '-i' option.)

PS: There are some services that do still generally run under inetd (or xinetd, often the modern replacement, cf). For example, I'm not sure if the Amanda backup system even has an option to run its daemons as standalone things.

Potential pragmatic handling of partial matches for HTTP conditional GET

By: cks
12 October 2024 at 02:02

In HTTP, a conditional GET is a GET request that potentially can be replied with a HTTP '304 Not Modified' status; this is quite useful for polling relatively unchanging resources like syndication feeds (although syndication feed readers don't always do so well at it). Generally speaking, there are two potential validators for conditional GET requests; the If-None-Match header, validated against the ETag of the reply, and the If-Modified-Since header, validated against the Last-Modified of the reply. A HTTP client can remember and use either or both of your ETag and your Last-Modified values (assuming you provide both).

When a HTTP client sends both If-Modified-Since and If-None-Match, the fully correct, specifications compliant validation is to require both to match. This makes intuitive sense; both your ETag and your Last-Modified values are part of the state of what you're replying with, and if one doesn't match, the client has a different view of the URL's state than you do so you shouldn't claim it's 'not modified' from their state. Instead you should return the entire response so that they can update their view of your Last-Modified state.

In practice, two things potentially get in the way. First, it's common for syndication feed readers and other things to treat the 'If-Modified-Since' value they provide as a timestamp, not as an opaque string that echoes back your previous Last-Modified. Programs will put in what's probably some default time value, they'll use timestamps from internal events, and various other fun things. By contrast, your ETag value is opaque and has no meaning for programs to interpret, guess at, and make up; if a HTTP client sends an ETag, it's very likely to be one you provided (although this isn't certain). Second, it's not unusual for your ETag to be a much stronger validator than your Last-Modified; for example, your ETag may be a cryptographic hash of the contents and will definitely change if they do, while your Last-Modified is an imperfect approximation and may not change even if the content does.

In this situation, if a client presents an If-None-Match header that matches your current ETag and a If-Modified-Since that doesn't match your Last-Modified, it's extremely likely that they have your current content but have done one of the many things that make their 'timestamp' not match your Last-Modified. If you know you have a strong validator in your ETag and they're doing something like fetching your syndication feed (where it's very likely that they're going to do this a lot), it's pragmatically tempting to give them a HTTP 304 response even though you're technically not supposed to.

To reduce the temptation, you can change to comparing your Last-Modified value against people's If-Modified-Since as a timestamp (if you can parse their value that way), and giving people a HTTP 304 response if their timestamp is equal to or after yours. This is what I'd do today given how people actually handle If-Modified-Since, and it would work around many of the bad things that people do with If-Modified-Since (since usually they'll create timestamps that are more recent than your Last-Modified, although not always).

Despite everything I've written above, I don't know if this happens all that often. It's entirely possible that syndication feed readers and other programs that invent things for their If-Modified-Since values are also not using If-None-Match and ETag values. I've recently added instrumentation to the software here so that I can tell, so maybe I'll have more to report soon.

(If I was an energetic person I would hunt through the data that rachelbythebay has accumulated in their feed reader behavioral testing project to see what it has to say about this (the most recent update for which is here and I don't know of an overall index, see their archives). However, I'm not that energetic.)

Linux software RAID and changing your system's hostname

By: cks
11 October 2024 at 03:46

Today, I changed the hostname of an old Linux system (for reasons) and rebooted it. To my surprise, the system did not come up afterward, but instead got stuck in systemd's emergency mode for a chain of reasons that boiled down to there being no '/dev/md0'. Changing the hostname back to its old value and rebooting the system again caused it to come up fine. After some diagnostic work, I believe I understand what happened and how to work around it if it affects us in the future.

One of the issues that Linux RAID auto-assembly faces is the question of what it should call the assembled array. People want their RAID array names to stay fixed (so /dev/md0 is always /dev/md0), and so the name is part of the RAID array's metadata, but at the same time you have the problem of what happens if you connect up two sets of disks that both want to be 'md0'. Part of the answer is mdadm.conf, which can give arrays names based on their UUID. If your mdadm.conf says 'ARRAY /dev/md10 ... UUID=<x>' and mdadm finds a matching array, then in theory it can be confident you want that one to be /dev/md10 and it should rename anything else that claims to be /dev/md10.

However, suppose that your array is not specified in mdadm.conf. In that case, another software RAID array feature kicks in, which is that arrays can have a 'home host'. If the array is on its home host, it will get the name it claims it has, such as '/dev/md0'. Otherwise, well, let me quote from the 'Auto-Assembly' section of the mdadm manual page:

[...] Arrays which do not obviously belong to this host are given names that are expected not to conflict with anything local, and are started "read-auto" so that nothing is written to any device until the array is written to. i.e. automatic resync etc is delayed.

As is covered in the documentation for the '--homehost' option in the mdadm manual page, on modern 1.x superblock formats the home host is embedded into the name of the RAID array. You can see this with 'mdadm --detail', which can report things like:

Name : ubuntu-server:0
Name : <host>:25  (local to host <host>)

Both of these have a 'home host'; in the first case the home host is 'ubuntu-server', and in the second case the home host is the current machine's hostname. Well, its 'hostname' as far as mdadm is concerned, which can be set in part through mdadm.conf's 'HOMEHOST' directive. Let me repeat that, mdadm by default identifies home hosts by their hostname, not by any more stable identifier.

So if you change a machine's hostname and you have arrays not in your mdadm.conf with home hosts, their /dev/mdN device names will get changed when you reboot. This is what happened to me, as we hadn't added the array to the machine's mdadm.conf.

(Contrary to some ways to read the mdadm manual page, arrays are not renamed if they're in mdadm.conf. Otherwise we'd have noticed this a long time ago on our Ubuntu servers, where all of the arrays created in the installer have the home host of 'ubuntu-server', which is obviously not any machine's actual hostname.)

Setting the home host value to the machine's current hostname when an array is created is the mdadm default behavior, although you can turn this off with the right mdadm.conf HOMEHOST setting. You can also tell mdadm to consider all arrays to be on their home host, regardless of the home host embedded into their names.

(The latter is 'HOMEHOST <ignore>', the former by itself is 'HOMEHOST <none>', and it's currently valid to combine them both as 'HOMEHOST <ignore> <none>', although this isn't quite documented in the manual page.)

PS: Some uses of software RAID arrays won't care about their names. For example, if they're used for filesystems, and your /etc/fstab specifies the device of the filesystem using 'UUID=' or with '/dev/disk/by-id/md-uuid-...' (which seems to be common on Ubuntu).

PPS: For 1.x superblocks, the array name as a whole can only be 32 characters long, which obviously limits how long of a home host name you can have, especially since you need a ':' in there as well and an array number or the like. If you create a RAID array on a system with a too long hostname, the name of the resulting array will not be in the '<host>:<name>' format that creates an array with a home host; instead, mdadm will set the name of the RAID to the base name (either whatever name you specified, or the N of the 'mdN' device you told it to use).

(It turns out that I managed to do this by accident on my home desktop, which has a long fully qualified name, by creating an array with the name 'ssd root'. The combination turns out to be 33 characters long, so the RAID array just got the name 'ssd root' instead of '<host>:ssd root'.)

The history of inetd is more interesting than I expected

By: cks
10 October 2024 at 03:11

Inetd is a traditional Unix 'super-server' that listens on multiple (IP) ports and runs programs in response to activity on them. When inetd listens on a port, it can act in two different modes. In the simplest mode, it starts a separate copy of the configured program for every connection (much like the traditional HTTP CGI model), which is an easy way to implement small, low volume services but usually not good for bigger, higher volume ones. The second mode is more like modern 'socket activation'; when a connection comes in, inetd starts your program and passes it the master socket, leaving it to you to keep accepting and processing connections until you exit.

(In inetd terminology, the first mode is 'nowait' and the second is 'wait'; this describes whether inetd immediate resumes listening on the socket for connections or waits until the program exits.)

Inetd turns out to have a more interesting history than I expected, and it's a history that's entwined with daemonization, especially with how the BSD r* commands daemonize themselves in 4.2 BSD. If you'd asked me before I started writing this entry, I'd have said that inetd was present in 4.2 BSD and was being used for various low-importance services. This turns out to be false in both respects. As far as I can tell, inetd was introduced in 4.3 BSD, and when it was introduced it was immediately put to use for important system daemons like rlogind, telnetd, ftpd, and so on, which were surprisingly run in the first style (with a copy of the relevant program started for each connection). You can see this in the 4.3 BSD /etc/inetd.conf, which has the various TCP daemons and lists them as 'nowait'.

(There are still network programs that are run as stand-alone daemons, per the 4.3 BSD /etc/rc and the 4.3 BSD /etc/rc.local. If we don't count syslogd, the standard 4.3 BSD tally seems to be rwhod, lpd, named, and sendmail.)

While I described inetd as having two modes and this is the modern state, the 4.3 BSD inetd(8) manual page says that only the 'start a copy of the program every time' mode ('nowait') is to be used for TCP programs like rlogind. I took a quick read over the 4.3 BSD inetd.c and it doesn't seem to outright reject a TCP service set up with 'wait', and the code looks like it might actually work with that. However, there's the warning in the manual page and there's no inetd.conf entry for a TCP service that is 'wait', so you'd be on your own.

The corollary of this is that in 4.3 BSD, programs like rlogind don't have the daemonization code that they did in 4.2 BSD. Instead, the 4.3 BSD rlogind.c shows that it can only be run under inetd or some equivalent, as rlogind immediately aborts if its standard input isn't a socket (and it expects the socket to be connected to some other end, which is true for the 'nowait' inetd mode but not how things would be for the 'wait' mode).

This 4.3 BSD inetd model seems to have rapidly propagated into BSD-derived systems like SunOS and Ultrix. I found traces that relatively early on, both of them had inherited the 4.3 style non-daemonizing rlogind and associated programs, along with an inetd-based setup for them. This is especially interesting for SunOS, because it was initially derived from 4.2 BSD (I'm less sure of Ultrix's origins, although I suspect it too started out as 4.2 BSD derived).

PS: I haven't looked to see if the various BSDs ever changed this mode of operation for rlogind et al, or if they carried the 'per connection' inetd based model all through until each of them removed the r* commands entirely.

OpenBSD kernel messages about memory conflicts on x86 machines

By: cks
9 October 2024 at 02:44

Suppose you boot up an OpenBSD machine that you think may be having problems, and as part of this boot you look at the kernel messages for the first time in a while (or perhaps ever), and when doing so you see messages that look like this:

3:0:0: rom address conflict 0xfffc0000/0x40000
3:0:1: rom address conflict 0xfffc0000/0x40000

Or maybe the messages are like this:

memory map conflict 0xe00fd000/0x1000
memory map conflict 0xfe000000/0x11000
[...]
3:0:0: mem address conflict 0xfffc0000/0x40000
3:0:1: mem address conflict 0xfffc0000/0x40000

This sounds alarming, but there's almost certainly no actual problem, and if you check logs you'll likely find that you've been getting messages like this for as long as you've had OpenBSD on the machine.

The short version is that both of these are reports from OpenBSD that it's finding conflicts in the memory map information it is getting from your BIOS. The messages that start with 'X:Y:Z' are about PCI(e) device memory specifically, while the 'memory map conflict' errors are about the general memory map the BIOS hands the system.

Generally, OpenBSD will report additional information immediately after about what the PCI(e) devices in question are. Here are the full kernel messages around the 'rom address conflict':

pci3 at ppb2 bus 3
3:0:0: rom address conflict 0xfffc0000/0x40000
3:0:1: rom address conflict 0xfffc0000/0x40000
bge0 at pci3 dev 0 function 0 "Broadcom BCM5720" rev 0x00, BCM5720 A0 (0x5720000), APE firmware NCSI 1.4.14.0: msi, address 50:9a:4c:xx:xx:xx
brgphy0 at bge0 phy 1: BCM5720C 10/100/1000baseT PHY, rev. 0
bge1 at pci3 dev 0 function 1 "Broadcom BCM5720" rev 0x00, BCM5720 A0 (0x5720000), APE firmware NCSI 1.4.14.0: msi, address 50:9a:4c:xx:xx:xx
brgphy1 at bge1 phy 2: BCM5720C 10/100/1000baseT PHY, rev. 0

Here these are two network ports on the same PCIe device (more or less), so it's not terribly surprising that the same ROM is maybe being reused for both. I believe the two messages mean that both ROMs (at the same address) are conflicting with another unmentioned allocation. I'm not sure how you find out what the original allocation and device is that they're both conflicting with.

The PCI related messages come from sys/dev/pci/pci.c and in current OpenBSD come in a number of variations, depending on what sort of PCI address space is detected as in conflict in pci_reserve_resources(). Right now, I see 'mem address conflict', 'io address conflict', the already mentioned 'rom address conflict', 'bridge io address conflict', 'bridge mem address conflict' (in several spots in the code), and 'bridge bus conflict'. Interested parties can read the source for more because this exhausts my knowledge on the subject.

The 'memory map conflict' message comes from a different place; for most people it will come from sys/arch/amd64/pci/pci_machdep.c, in pci_init_extents(). If I'm understanding the code correctly, this is creating an initial set of reserved physical address space that PCI devices should not be using. It registers each piece of bios_memmap, which according to comments in sys/arch/amd64/amd64/machdep.c is "the memory map as the bios has returned it to us". I believe that a memory map conflict at this point says that two pieces of the BIOS memory map overlap each other (or one is entirely contained in the other).

I'm not sure it's correct to describe these messages as harmless. However, it's likely that they've been there for as long as your system's BIOS has been setting up its general memory map and the PCI devices as it has been, and you'd likely see the same address conflicts with another system (although Linux doesn't seem to complain about it; I don't know about FreeBSD).

Things syndication feed readers do with 'conditional GET'

By: cks
8 October 2024 at 02:54

In HTTP, a conditional GET is a nice way of saving bandwidth (but not always work) when a web browser or other HTTP agent requests a URL that hasn't changed. Conditional GET is very useful for things that fetch syndication feeds (Atom or RSS), because they often try fetches much more often than the syndication feed actually changes. However, just because it would be a good thing if feed readers and other things did conditional GETs to fetch feeds doesn't mean that they actually do it. And when feed readers do try conditional GETs, they don't always do it right; for instance, Tiny Tiny RSS at least used to basically make up the 'If-Modified-Since' timestamps it sent (which I put in a hack for).

For reasons beyond the scope of this entry, I recently looked at my feed fetching logs for Wandering Thoughts. As usually happens when you turn over any rock involving web server logs, I discovered some multi-legged crawling things underneath, and in this case I was paying attention to what feed readers do (or don't do) for conditional GETs. Consider this a small catalog.

  • Some or perhaps all versions of NextCloud-News send an If-Modified-Since header with the value 'Wed, 01 Jan 1800 00:00:00 GMT'. This is always going to fail validation and turn into a regular GET request, whether you compare If-Modified-Since values literally or consider them as a timestamp and do timestamp comparisons. NextCloud-News might as well not bother sending an If-Modified-Since header at all.

  • A number of feed readers appear to only update their stored ETag value for your feed if your Last-Modified value also changes. In practice there are a variety of things that can change the ETag without changing the Last-Modified value, and some of them regularly happen here on Wandering Thoughts, which causes these feed readers to effectively decay into doing unconditional GET requests the moment, for example, someone leaves a new comment.

  • One feed reader sends If-Modified-Since values that use a numeric time offset, as in 'Mon, 07 Oct 2024 12:00:07 -0000'. This is also not a reformatted version of a timestamp I've ever given out, and is after the current Last-Modified value at the time the request was made. This client reliably attempts to pull my feed three times a day, at 02:00, 08:00, and 20:00, and the times of the If-Modified-Since values for those fetches are reliably 00:00, 06:00, and 12:00 respectively.

    (I believe it may be this feed fetcher, but I'm not going to try to reverse engineer its If-Modified-Since generation.)

  • Another feed fetcher, possibly Firefox or an extension, made up its own timestamps that were set after the current Last-Modified of my feed at the time it made the request. It didn't send an If-None-Match header on its requests (ie, it didn't use the ETag I return). This is possibly similar to the Tiny Tiny RSS case, with the feed fetcher remembering the last time it fetched the feed and using that as the If-Modified-Since value when it makes another request.

All of this is what I turned over in a single day of looking at feed fetchers that got a lot of HTTP 200 results (as opposed to HTTP 304 results, which shows a conditional GET succeeding). Probably there are more fun things lurking out there.

(I'm happy to have people read my feeds and we're not short on bandwidth, so this is mostly me admiring the things under the rock rather than anything else. Although, some feed readers really need to slow down the frequency of their checks; my feed doesn't update every few minutes.)

DKIM signatures from mailing list providers don't mean too much

By: cks
7 October 2024 at 02:43

Suppose, hypothetically, that you're a clever email spammer and you'd like to increase the legitimacy of your (spam) email by giving it a good DKIM signature, such as a DKIM signature from a reasonably reputable provider of mailing list services. The straightforward way to do this is to sign up to the provider, upload your spam list, and send your email to it; the provider will DKIM sign your message on the way through. However, if you do this you'll generally get your service cancelled and have to go through a bunch of hassles to get your next signup set up. Unfortunately for everyone else, it's possible for spammers to do better.

The spammer starts by signing up to the provider and setting up a mailing list. However, they don't upload a bunch of addresses to it. Instead, they set the list to be as firmly anti-spam, 'confirmed opt in through the provider' as the provider supports. Then they use a bunch of email addresses under their own control to sign up to the mailing list, opt in to everything, and so on. They may then even spend a bit of time sending marketing emails to their captive mailing list of their own addresses, which will of course not complain in the least.

Then the spammer sends their real spam mailing to the mailing list, goes to one of their captive addresses, copies the entire raw message, headers and all, and strips out the 'Received:' headers that come from it leaving the mailing list provider. Then they go to their (rented) spam sending infrastructure and queue up a bunch of spam sending of this message to the real targets, setting it to have a '<>' null SMTP MAIL FROM. This message has a valid DKIM signature put on by the mailing list provider and its SMTP envelope sender is not (quite) in conflict with it. The only thing that will give the game away is inspecting the Received: headers, which will say it came from some random IP with no listed headers that say how it got from the mailing list provider to the random IP.

The spammer set up their mailing list to be so strictly anti-spam in order to deflect complaints submitted to the mailing list provider, especially more or less automatic ones created by people clicking on 'report this as spam' in their mail environment (which will often use headers put in by the mailing list provider). The mailing list provider will get the complaint(s) and hopefully not do much to the spammer overall, because all of the list members have fully confirmed subscriptions, a history of successful deliveries of past messages that look much like the latest one, and so on.

I don't know if any spammers are actively doing this, but I have recently seen at least one spammer that's doing something like it. Our mail system has logged a number of incoming (spam) messages with a null SMTP envelope sender that come from random IPs but that have valid DKIM signatures for various places. In some cases we have captured headers that suggest a pattern like this.

(You can also play this trick with major providers of free mail services; sign up, send email from them to some dummy mail address, and take advantage of the DKIM signature that they'll put on their outgoing messages. The abuse handling groups at those places will most likely take a look at the 'full' message headers and say 'it's obviously not from us', and they may not have the tools to even try to verify the DKIM signature to see that actually, it is from them.)

Daemonization in Unix programs is probably about restarting programs

By: cks
6 October 2024 at 02:55

It's standard for Unix daemon programs to 'daemonize' themselves when they start, completely detaching from how they were run; this behavior is quite old and these days it's somewhat controversial and sometimes considered undesirable. At this point you might ask why programs even daemonize themselves in the first place, and while I don't know for sure, I do have an opinion. My belief is that daemonization is because of restarting daemon programs, not starting them at boot.

During system boot, programs don't need to daemonize in order to start properly. The general Unix boot time environment has long been able to detach programs into the background (although the V7 /etc/rc didn't bother to do this with /etc/update and /etc/cron, the 4.2BSD /etc/rc did do this for the new BSD network daemons). In general, programs started at boot time don't need to worry that they will be inheriting things like stray file descriptors or a controlling terminal. It's the job of the overall boot time environment to insure that they start in a clean environment, and if there's a problem there you should fix it centrally, not make it every program's job to deal with the failure of your init and boot sequence.

However, init is not a service manager (not historically), which meant that for a long time, starting or restarting daemons after boot was entirely in your hands with no assistance from the system. Even if you remembered to restart a program as 'daemon &' so that it was backgrounded, the newly started program could inherit all sorts of things from your login session. It might have some random current directory, it might have stray file descriptors that were inherited from your shell or login environment, its standard input, output, and error would be connected to your terminal, and it would have a controlling terminal, leaving it exposed to various bad things happening to it when, for example, you logged out (which often would deliver a SIGHUP to it).

This is the sort of thing that even very old daemonization code deals with, which is to say that it fixes. The 4.2BSD daemonization code closes (stray) file descriptors and removes any controlling terminal the process may have, in addition to detaching itself from your shell (in case you forgot or didn't use the '&' when starting it). It's also easy to see how people writing Unix daemons might drift into adding this sort of code to them as people restarted the daemons (by hand) and ran into the various problems (cf). In fact the 4.2BSD code for it is conditional on 'DEBUG' not being defined; presumably if you were debugging, say, rlogind, you'd build a version that didn't detach itself on you so you could easily run it under a debugger or whatever.

It's a bit of a pity that 4.2 BSD and its successors didn't create a general 'daemonize' program that did all of this for you and then told people to restart daemons with 'daemonize <program>' instead of '<program>'. But we got the Unix that we have, not the Unix that we'd like to have, and Unixes did eventually grow various forms of service management that tried to encapsulate all of the things required to restart daemons in one place.

(Even then, I'm not sure that old System V init systems would properly daemonize something that you restarted through '/etc/init.d/<whatever> restart', or if it was up to the program to do things like close extra file descriptors and get rid of any controlling terminal.)

PS: Much later, people did write tools for this, such as daemonize. It's surprisingly handy to have such a program lying around for when you want or need it.

Traditionally, init on Unix was not a service manager as such

By: cks
5 October 2024 at 03:05

Init (the process) has historically had a number of roles but, perhaps surprisingly, being a 'service manager' (or a 'daemon manager') was not one of them in traditional init systems. In V7 Unix and continuing on into traditional 4.x BSD, init (sort of) started various daemons by running /etc/rc, but its only 'supervision' was of getty processes for the console and (other) serial lines. There was no supervision or management of daemons or services, even in the overall init system (stretching beyond PID 1, init itself). To restart a service, you killed its process and then re-ran it somehow; getting even the command line arguments right was up to you.

(It's conventional to say that init started daemons during boot, even though technically there are some intermediate processes involved since /etc/rc is a shell script.)

The System V init had a more general /etc/inittab that could in theory handle more than getty processes, but in practice it wasn't used for managing anything more than them. The System V init system as a whole did have a concept of managing daemons and services, in the form of its multi-file /etc/rc.d structure, but stopping and restarting services was handled outside of the PID 1 init itself. To stop a service you directly ran its init.d script with 'whatever stop', and the script used various approaches to find the processes and get them to stop. Similarly, (re)starting a daemon was done directly by its init.d script, without PID 1 being involved.

As a whole system the overall System V init system was a significant improvement on the more basic BSD approach, but it (still) didn't have init itself doing any service supervision. In fact there was nothing that actively did service supervision even in the System V model. I'm not sure what the first system to do active service supervision was, but it may have been daemontools. Extending the init process itself to do daemon supervision has a somewhat controversial history; there are Unix systems that don't do this through PID 1, although doing a good job of it has clearly become one of the major jobs of the init system as a whole.

That init itself didn't do service or daemon management is, in my view, connected to the history of (process) daemonization. But that's another entry.

(There's also my entry on how init (and the init system as a whole) wound up as Unix's daemon manager.)

(Unix) daemonization turns out to be quite old

By: cks
4 October 2024 at 02:51

In the Unix context, 'daemonization' means a program that totally detaches itself from how it was started. It was once very common and popular, but with modern init systems they're often no longer considered to be all that good an idea. I have some views on the history here, but today I'm going to confine myself to a much smaller subject, which is that in Unix, daemonization goes back much further than I expected. Some form of daemonization dates to Research Unix V5 or earlier, and an almost complete version appears in network daemons in 4.2 BSD.

As far back as Research Unix V5 (from 1974), /etc/rc is starting /etc/update (which does a periodic sync()) without explicitly backgrounding it. This is the giveaway sign that 'update' itself forks and exits in the parent, the initial version of daemonization, and indeed that's what we find in update.s (it wasn't yet a C program). The V6 update is still in assembler, but now the V6 update.s is clearly not just forking but also closing file descriptors 0, 1, and 2.

In the V7 /etc/rc, the new /etc/cron is also started without being explicitly put into the background. The V7 update.c seems to be a straight translation into C, but the V7 cron.d has a more elaborate version of daemonization. V7 cron forks, chdir's to /, does some odd things with standard input, output, and error, ignores some signals, and then starts doing cron things. This is pretty close to what you'd do in modern daemonization.

The first 'network daemons' appeared around the time of 4.2 BSD. The 4.2BSD /etc/rc explicitly backgrounds all of the r* daemons when it starts them, which in theory means they could have skipped having any daemonization code. In practice, rlogind.c, rshd.c, rexecd.c, and rwhod.c all have essentially identical code to do daemonization. The rlogind.c version is:

#ifndef DEBUG
	if (fork())
		exit(0);
	for (f = 0; f < 10; f++)
		(void) close(f);
	(void) open("/", 0);
	(void) dup2(0, 1);
	(void) dup2(0, 2);
	{ int tt = open("/dev/tty", 2);
	  if (tt > 0) {
		ioctl(tt, TIOCNOTTY, 0);
		close(tt);
	  }
	}
#endif

This forks with the parent exiting (detaching the child from the process hierarchy), then the child closes any (low-numbered) file descriptors it may have inherited, sets up non-working standard input, output, and error, and detaches itself from any controlling terminal before starting to do rlogind's real work. This is pretty close to the modern version of daemonization.

(Today, the ioctl() stuff is done by calling setsid() and you'd probably want to close more than the first ten file descriptors, although that's still a non-trivial problem.)

Go's new small language features from 1.22 and 1.23 are nice

By: cks
3 October 2024 at 01:42

Recently I was writing some Go code involving goroutines. After I was done, I realized that I had used some new small language features added in Go 1.21 and Go 1.22, without really thinking about it, despite not having paid much attention when the features were added. Specifically, what I used are the new builtins of max() and min(), and 'range over integers' (and also a use of clear(), but only in passing).

Ranging over integers may have sounded a bit silly to me when I first read about it, but it turns out that there is one situation where it's a natural idiom, and that's spawning a certain number of goroutines:

for range min(maxpar, len(args)) {
   wg.Add(1)
   go func() {
     resolver()
     wg.Done()
   }()
}

Before Go 1.21, I would have wound up writing this as:

for i := 0; i < maxpar; i++ {
  [...]
}

I wouldn't have bothered writing and using the function equivalent of min(), because it wouldn't be worth the extra hassle for my small scale usage, so I'd always have started maxpar goroutines even if some of them would wind up doing nothing.

The new max() and min() builtins aren't anything earthshaking, and you could do them as generic functions, but they're a nice little ergonomic improvement in Go. Ranging over integers is something you could always do but it's more compact now and it's nice to directly see what the loop is doing (and also that I'm not actually using the index variable for anything in the loop).

(The clear() builtin is nice, but it also has a good reason for existing. I was only using it on a slice, though, where you can fully duplicate its effects.)

Go doesn't strictly need max(), min(), and range over integers (although the latter is obviously connected to ranging over functions, which is important for putting user container types closer to par with builtin ones). But adding them makes it nicer, and they're small (although growing the language and its builtins does have a quiet cost), and Go has never presented itself as a mathematically minimal language.

(Go will have to draw the line somewhere, because there are a lot of little conveniences that could be added to the language. But the Go team is generally conservative and they're broadly in a position to not do things, so I expect it to be okay.)

Two views of what a TLS certificate verifies

By: cks
2 October 2024 at 01:58

One of the things that you could ask about TLS is what a validated TLS certificate means or is verifying. Today there is a clear answer, as specified by the CA/Browser Forum, and that answers is that when you successfully connect to https://microsoft.com/, you are talking to the "real" microsoft.com, not an impostor who is intercepting your traffic in some way. This is known as 'domain control' in the jargon; to get a TLS certificate for a domain, you must demonstrate that you have control over the domain. The CA/Browser Forum standards (and the browsers) don't require anything else.

Historically there has been a second answer, what TLS (then SSL) sort of started with. A TLS certificate was supposed to verify that not just the domain but that you were talking to the real "Microsoft" (which is to say the large, world wide corporation with its headquarters in Redmond WA, not any other "Microsoft" that might exist). More broadly, it was theoretically verifying that you were talking to a legitimate and trustworthy site that you could, for example, give your credit card number to over the Internet, which used to be a scary idea.

This second answer has a whole raft of problems in practice, which is why the CA/Browser Forum has adopted the first answer, but it started out and persists because it's much more useful to actual people. Most people care about talking to (the real) Google, not some domain name, and domain names are treacherous things as far as identity goes (consider IDN homograph attacks, or just 'facebook-auth.com'). We rather want this human version of identity and it would be very convenient if we could have it. But we can't. The history of TLS certificates has convincingly demonstrated that this version of identity has comprehensively failed for a collection of reasons including that it's hard, expensive, difficult or impossible to automate, and (quite) fallible.

(The 'domain control' version of what TLS certificates mean can be automated because it's completely contained within the Internet. The other version is not; in general you can't verify that sort of identity using only automated Internet resources.)

A corollary of this history is that no Internet protocol that's intended for wide spread usage can assume a 'legitimate identity' model of participants. This includes any assumption that people can only have one 'identity' within your system; in practice, since Internet identity can only verify that you are something, not that you aren't something, an attacker can have as many identities as they want (including corporate identities).

PS: The history of commercial TLS certificates also demonstrates that you can't use costing money to verify legitimacy. It sounds obvious to say it, but all that charging someone money demonstrates is that they willing and able to spend some money (perhaps because they have a pet cause), not that they're legitimate.

Resetting the backoff restart delay for a systemd service

By: cks
1 October 2024 at 02:48

Suppose, not hypothetically, that your Linux machine is your DSL PPPoE gateway, and you run the PPPoE software through a simple script to invoke pppd that's run as a systemd .service unit. Pppd itself will exit if the link fails for some reason, but generally you want to automatically try to establish it again. One way to do this (the simple way) is to set the systemd unit to 'Restart=always', with a restart delay.

Things like pppd generally benefit from a certain amount of backoff in their restart attempts, rather than restarting either slowly or rapidly all of the time. If your PPP(oE) link just dropped out briefly because of a hiccup, you want it back right away, not in five or ten minutes, but if there's a significant problem with the link, retrying every second doesn't help (and it may trigger things in your service provider's systems). Systemd supports this sort of backoff if you set 'RestartSteps' and 'RestartMaxDelaySec' to appropriate values. So you could wind up with, for example:

Restart=always
RestartSec=1s
RestartSteps=10
RestartMaxDelaySec=10m

This works fine in general, but there is a problem lurking. Suppose that one day you have a long outage in your service but it comes back, and then a few stable days later you have a brief service blip. To your surprise, your PPPoE session is not immediately restarted the way you expect. What's happened is that systemd doesn't reset its backoff timing just because your service has been up for a while.

To see the current state of your unit's backoff, you want to look at its properties, specifically 'NRestarts' and especially 'RestartUSecNext', which is the delay systemd will put on for the next restart. You see these with 'systemctl show <unit>', or perhaps 'systemctl show -p NRestarts,RestartUSecNext <unit>'. To reset your unit's dynamic backoff time, you run 'systemctl reset-failed <unit>'; this is the same thing you may need to do if you restart a unit too fast and the start stalls.

(I don't know if manually restarting your service with 'systemctl restart <unit>' bumps up the restart count and the backoff time, the way it can cause you to run into (re)start limits.)

At the moment, simply doing 'systemctl reset-failed' doesn't seem to be enough to immediately re-activate a unit that is slumbering in a long restart delay. So the full scale, completely reliable version is probably 'systemctl stop <unit>; systemctl reset-failed <unit>; systemctl start <unit>'. I don't know how you see that a unit is currently in a 'RestartUSecNext' delay, or how much time is left on the delay (such a delay doesn't seem to be a 'job' that appears in 'systemctl list-jobs', and it's not a timer unit so it doesn't show up in 'systemctl list-timers').

If you feel like making your start script more complicated (and it runs as root), I believe that you could keep track of how long this invocation of the service has been running, and if it's long enough, run a 'systemctl reset-failed <unit>' before the script exits. This would (manually) reset the backoff counter if the service has been up for long enough, which is often what you really want.

(If systemd has a unit setting that will already do this, I was unable to spot it.)

Brief notes on making Prometheus's SNMP exporter use additional SNMP MIB(s)

By: cks
30 September 2024 at 03:13

Suppose, not entirely hypothetically, that you have a DSL modem that exposes information about the state of your DSL link through SNMP, and you would like to get that information into Prometheus so that you could track it over time (for reasons). You could scrape this information by 'hand' using scripts, but Prometheus has an officially supported SNMP exporter. Unfortunately, in practice the Prometheus SNMP exporter pretty much has a sign on the front door that says "no user serviceable parts, developer access only"; how you do things with it if its stock configuration doesn't meet your needs is what I would call rather underdocumented.

The first thing you'll need to do is find out what generally known and unknown SNMP attributes ('OIDs') your device exposes. You can do this using tools like snmpwalk, and see also some general information on reading things over SNMP. Once you've found out what OIDs your device supports, you need to find out if there are public MIBs for them. In my case, my DSL modem exposed information about network interfaces in the standard and widely available 'IF-MIB', and ADSL information in the standard but not widely available 'ADSL-LINE-MIB'. For the rest of this entry I''ll assume that you've managed to fetch the ADSL-LINE-MIB and everything it depends on and put them in a directory, /tmp/adsl-mibs.

The SNMP exporter effectively has two configuration files (as I wrote about recently); a compiled ('generated') configuration file (or set of them) that lists in exhausting detail all of the SNMP OIDs to be collected, and an input file to a separate tool, the generator, that creates the compiled main file. To collect information from a new MIB, you need to set up a new SNMP exporter 'module' for it, and specify the root OID or OIDs involved to walk. This looks like:

---
modules:
  # The ADSL-LINE-MIB MIB
  adsl_line_mib:
    walk:
      - 1.3.6.1.2.1.10.94
      # or:
      #- adslMIB

Here adsl_line_mib is the name of the new SNMP exporter module, and we give it the starting OID of the MIB. You can't specify the name of the MIB itself as the OID to walk, although this is how 'snmpwalk' will present it. Instead you have to use the MIB's 'MODULE-IDENTITY' line, such as 'adslMIB'. Alternately, perusal of your MIB and snmpwalk results may suggest alternate names to use, such as 'adslLineMib'. Using the top level OID is probably easier.

The name of your new module is arbitrary, but it's conventional to use the name of the MIB in this form. You can do other things in your module; reading the existing generator.yml is probably the most useful documentation. As various existing modules show, you can walk multiple OIDs in one module.

This configuration file leaves out the 'auths:' section from the main generator.yml, because we only need one of them, and what we're doing is generating an additional configuration file for snmp_exporter that we'll use along with the stock snmp.yml. To actually generate our new snmp-adsl.yml, we do:

cd snmp_exporter/generator
go build
make # builds ./mibs
./generator generate \
   -m ./mibs \
   -m /tmp/adsl-mibs \
   -g generator-adsl.yml
   -o /tmp/snmp-adsl.yml

We give the generator both its base set of MIBs, which will define various common things, and the directory with our ADSL-LINE-MIB and all of the MIBs it may depend on. Although the input is small, the snmp-adsl.yml will generally be quite big; in my case, over 2,000 lines.

As I mentioned the other day, you may find that some of the SNMP OIDs actually returned by your device don't conform to the SNMP MIB. When this happens, your scrape results will not be a success but instead a HTTP 500 error with text that says things like:

An error has occurred while serving metrics:
error collecting metric Desc{fqName: "snmp_error", help: "BITS type was not a BISTRING on the wire.", constLabels: {}, variableLabels: {}}: error for metric adslAturCurrStatus with labels [1]: <nil>

This says that the the actual OID(s) for adslAturCurrStatus from my actual device didn't match what the MIB claimed. In this case, my raw snmpwalk output for this OID is:

.1.3.6.1.2.1.10.94.1.1.3.1.6.1 = BITS: 00 00 00 01 31

(I don't understand what this means, since I'm not anywhere near an SNMP expert.)

If the information is sufficiently important, you'll need to figure out how to modify either the MIB or the generated snmp-adsl.yml to get the information without snmp_exporter errors. Doing so is far beyond the scope of this entry. If the information is not that important, the simple way is to exclude it with a generator override:

---
modules:
  adsl_line_mib:
    walk:
      # ADSL-LINE-MIB
      #- 1.3.6.1.2.1.10.94
      - adslMIB
    overrides:
     # My SmartRG SR505N produces values for this metric
     # that make the SNMP exporter unhappy.
     adslAturCurrStatus:
       ignore: true

You can at least get the attribute name you need to ignore from the SNMP exporter's error message. Unfortunately this error message is normally visible only in scrape output, and you'll only see it if you scrape manually with something like 'curl'.

Options for adding IPv6 networking to your libvirt based virtual machines

By: cks
29 September 2024 at 02:47

Recently, my home ISP switched me from an IPv6 /64 allocation to a /56 allocation, which means that now I can have a bunch of proper /64s for different purposes. I promptly celebrated this by, in part, extending IPv6 to my libvirt based virtual machine, which is on a bridged internal virtual network (cf). Libvirt provides three different ways to provide (public) IPv6 to such virtual machines, all of which will require you to edit your network XML (either inside the virt-manager GUI or directly with command line tools). The three ways aren't exclusive; you can use two of them or even all three at the same time, in which case your VMs will have two or three public IPv6 addresses (at least).

(None of this applies if you're directly bridging your virtual machines onto some physical network. In that case, whatever the physical network has set up for IPv6 is what your VMs will get.)

First, in all cases you're probably going to want an IPv6 '<ip>' block that sets the IPv6 address for your host machine and implicitly specifies your /64. This is an active requirement for two of the options, and typically looks like this:

<ip family='ipv6' address='2001:19XX:0:1102::1' prefix='64'>
[...]
</ip>

Here my desktop will have 2001:19XX:0:1102::1/64 as its address on the internal libvirt network.

The option that is probably the least hassle is to give static IPv6 addresses to your VMs. This is done with <host> elements inside a <dhcp> element (inside your IPv6 <ip>, which I'm not going to repeat):

<dhcp>
  <host name='hl-fedora-36' ip='2001:XXXX:0:1102::189'/>
</dhcp>

Unlike with IPv4, you can't identify VMs by their MAC address because, to quote the network XML documentation:

[...] The IPv6 host element differs slightly from that for IPv4: there is no mac attribute since a MAC address has no defined meaning in IPv6. [...]

Instead you probably need to identify your virtual machines by their (DHCP) hostname. Libvirt has another option for this but it's not really well documented and your virtual machine may not be set up with the necessary bits to use it.

The second least hassle option is to provide a DHCP dynamic range of IPv6 addresses. In the current Fedora 40 libvirt, this has the undocumented limitation that the range can't include more than 65,535 IPv6 addresses, so you can't cover the entire /64. Instead you wind up with something like this:

<dhcp>
  <range start='2001:XXXX:0:1102::1000' end='2001:XXXX:0:1102::ffff'/>
</dhcp>

Famously, not everything in the world does DHCP6; some things only do SLAAC, and in general SLAAC will allocate random IPv6 IPs across your entire /64. Libvirt uses dnsmasq (also) to provide IP addresses to virtual machines, and dnsmasq can do SLAAC (see the dnsmasq manual page). However, libvirt currently provides no directly exposed controls to turn this on; instead, you need to use a special libvirt network XML namespace to directly set up the option in the dnsmasq configuration file that libvirt will generate.

What you need looks like:

<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
[...]
  <dnsmasq:options>
    <dnsmasq:option value='dhcp-range=2001:XXXX:0:1102::,slaac,64'/>
  </dnsmasq:options>
</network>

(The 'xmlns:dnsmasq=' bit is what you have to add to the normal <network> element.)

I believe that this may not require you to declare an IPv6 <ip> section at all, although I haven't tested that. In my environment I want both SLAAC and a static IPv6 address, and I'm happy to not have DHCP6 as such, since SLAAC will allocate a much wider and more varied range of IPv6 addresses.

(You can combine a dnsmasq SLAAC dhcp-range with a regular DHCP6 range, in which case SLAAC-capable IPv6 virtual machines will get an IP address from both, possibly along with a third static IPv6 address.)

PS: Remember to set firewall rules to restrict access to those public IPv6 addresses, unless you want your virtual machines fully exposed on IPv6 (when they're probably protected on IPv4 by virtue of being NAT'd).

Brief notes on how the Prometheus SNMP exporter's configurations work

By: cks
28 September 2024 at 03:19

A variety of devices (including DSL modems) expose interesting information via SNMP (which is not simple, despite its name). If you have a Prometheus environment, it would be nice to get (some of) this information from your SNMP capable devices into Prometheus. You could do this by hand with scripts and commands like 'snmpget', but there is also the officially supported SNMP exporter. Unfortunately, in practice the Prometheus SNMP exporter pretty much has a sign on the front door that says "no user serviceable parts, developer access only". Understanding how to do things even a bit out of standard with it is, well, a bit tricky. So here are some notes.

The SNMP exporter ships with a 'snmp.yml' configuration file that's what the actual 'snmp_exporter' program uses at runtime (possibly augmented by additional files you provide). As you'll read when you look at the file, this file is machine generated. As far as I can tell, the primary purpose of this file is to tell the exporter what SNMP OIDs it could try to read from devices, what metrics generated from them should be called, and how to interpret the various sorts of values it gets back over SNMP (for instance, network interfaces have a 'ifType' that in raw format is a number, but where the various values correspond to different types of physical network types). These SNMP OIDs are grouped into 'modules', with each module roughly corresponding to a SNMP MIB (the correspondence isn't necessarily exact). When you ask the SNMP exporter to query a SNMP device, you normally tell the exporter what modules to use, which determines what OIDs will be retrieved and what metrics you'll get back.

The generated file is very verbose, which is why it's generated, and its format is pretty underdocumented, which certainly does help contribute to the "no user serviceable parts" feeling. There is very little support for directly writing a new snmp.yml module (which you can at least put in a separate 'snmp-me.yml' file) if you happen to have a few SNMP OIDs that you know directly, don't have a MIB for, and want to scrape and format specifically. Possibly the answer is to try to write a MIB yourself and generate a snmp-me.yml from it, but I haven't had to do this so I have no opinions on which way is better.

The generated file and its modules are created from various known MIBs by a separate program, the generator. The generator has its own configuration file to describe what modules to generate, what OIDs go into each module, and so on. This means that reading generator.yml is the best way to find out what MIBs the SNMP exporter already supports. As far as I know, although generator.yml doesn't necessarily specify OIDs by name, the generator requires MIBs for everything you want to be in the generated snmp.yml file and generate metrics for.

The generator program and its associated data isn't available as part of the pre-built binary SNMP exporter packages. If you need anything beyond the limited selection of MIBs that are compiled into the stock snmp.yml, you need to clone the repository, go to the 'generator' subdirectory, build the generator with 'go build' (currently), run 'make' to fetch and process the MIBs it expects, get (or write) MIBs for your additional metrics, and then write yourself a minimal generator-me.yml of your own to add one or more (new) modules for your new MIBs. You probably don't want to regenerate the main snmp.yml; you might as well build a 'snmp-me.yml' that just has your new modules in it, and run the SNMP exporter with snmp-me.yml as an additional configuration file.

As a practical matter, you may find that your SNMP capable device doesn't necessarily conform to the MIB that theoretically describes it, including OIDs with different data formats (or data) than expected. In the simple case, you can exclude OIDs or named attributes from being fetched so that the non-conformance doesn't cause the SNMP exporter to throw errors:

modules:
  adsl_line_mib:
[...]
    overrides:
     adslAturCurrStatus:
       ignore: true

More complex mis-matches between the MIB and your device will have you reading whatever you can find for the available options for generator.yml or even for snmp.yml itself. Or you can change your mind and scrape through scripts or programs in other languages instead of the SNMP exporter (it's what we do for some of our machine room temperature sensors).

(I guess another option is editing the MIB so that it corresponds to what your device returns, which should make the generator produce a snmp-me.yml that matches what the SNMP exporter sees from the device.)

PS: A peculiarity of the SNMP exporter is that the SNMP metrics it generates are all named after their SNMP MIB names, which produce metric names that are not at all like conventional Prometheus metric names. It's possible to put a common prefix, such as 'snmp_metric_', on all SNMP metrics to make them at least a little bit better. Technically this is a peculiarity of snmp.yml, but changing it is functionally impossible unless you hand-edit your own version.

The impact of the September 2024 CUPS CVEs depends on your size

By: cks
27 September 2024 at 03:16

The recent information security news is that there are a series of potentially serious issues in CUPS (via), but on the other hand a lot of people think that this isn't an exploit with a serious impact because, based on current disclosures, someone has to print something to a maliciously added new 'printer' (for example). My opinion is that how potentially serious this issue is for you depends on the size and scope of your environment.

Based on what we know, the vulnerability requires the CUPS server to also be running 'cups-browsed'. One of the things that cups-browsed does is allow remote printers to register themselves on the CUPS server; you set up your new printer, point it at your local CUPS print server, and everyone can now use it. As part of this registration, the collection of CUPS issues allows a malicious 'printer' to set up server side data (a CUPS PPD) that contains things that will run commands on the print server when a print job is sent to this malicious 'printer'. In order to get anything to happen, an attacker needs to get someone to do this.

In a personal environment or a small organization, this is probably unlikely. Either you know all the printers that are supposed to be there and a new one showing up is alarming, or at the very least you'll probably assume that the new printer is someone's weird experiment or local printer or whatever, and printing to it won't make either you or the owner very happy. You'll take your print jobs off to the printers you know about, and ignore the new one.

(Of course, an attacker with local knowledge could target their new printer name to try to sidestep this; for example, calling it 'Replacement <some existing printer>' or the like.)

In a larger organization, such as ours, people don't normally know all of the printers that are around and don't generally know when new printers show up. In such an environment, it's perfectly reasonable for people to call up a 'what printer do you want to use' dialog, see a new to them printer with an attractive name, and use it (perhaps thinking 'I didn't know they'd put a printer in that room, that's conveniently close'). And since printer names that include locations are perpetually misleading or wrong, most of the time people won't be particularly alarmed if they go to the location where they expect the printer (and their print job) to be and find nothing. They'll shrug, go back, and re-print their job to a regular printer they know.

(There are rare occasions here where people get very concerned when print output can't be found, but in most cases the output isn't sensitive and people don't care if there's an extra printed copy of a technical paper or the like floating around.)

Larger scale environments, possibly with an actual CUPS print server, are also the kind of environment where you might deliberately run cups-browsed. This could be to enable easy addition of new printers to your print server or to allow people's desktops to pick up what printers were available out there without you needing to even have a central print server.

My view is that this set of CVEs shows that you probably can't trust cups-browsed in general and need to stop running it, unless you're very confident that your environment is entirely secure and will never have a malicious attacker able to send packets to cups-browsed.

(I said versions of this on the Fediverse (1, 2), so I might as well elaborate on it here.)

Using a small ZFS recordsize doesn't save you space (well, almost never)

By: cks
26 September 2024 at 01:54

ZFS filesystems have a famously confusing 'recordsize' property, which in the past I've summarized as the maximum logical block size of a filesystem object. Sometimes I've seen people suggest that if you want to save disk space, you should reduce your 'recordsize' from the default 128 KBytes. This is almost invariably wrong; in fact, setting a low 'recordsize' is more likely to cost you space.

How a low recordsize costs you space is straightforward. In ZFS, every logical block requires its own DVA to point to it and contain its checksum. The more logical blocks you have, the more DVAs you require and the more space they take up. As you decrease the 'recordsize' of a filesystem, files (well, filesystem objects in general) that are larger than your recordsize will use more and more logical blocks for their data and have more and more DVAs, taking up more and more space.

In addition, ZFS compression operates on logical blocks and must save at least one disk block's worth of space to be considered worthwhile. If you have compression turned on (and if you care about space usage, you should), the closer your 'recordsize' gets to the vdev's disk block size, the harder it is for compression to save space. The limit case is when you make 'recordsize' be the same size as the disk block size, at which point ZFS compression can't do anything.

(This is the 'physical disk block size', or more exactly the vdev's 'ashift', which these days should basically always be 4 KBytes or greater, not the disk's 'logical block size', which is usually still 512 bytes.)

The one case where a large recordsize can theoretically cost you disk space is if you have large files that are mostly holes and you don't have any sort of compression turned on (which these days means specifically turning it off). If you have a (Unix) file that has 1 KByte of data every 128 KBytes and is otherwise not written to, without compression and with the default 128 KByte 'recordsize', you'll get a bunch of 128 KByte blocks that have 1 KByte of actual data and 127 KBytes of zeroes. If you reduced your "recordsize', you would still waste some space but more of it would be actual holes, with no space allocated. However, even the most minimal compression (a setting of 'compression=zle') will entirely eliminate this waste.

(The classical case of reducing 'recordsize' is helping databases out. More generally, you reduce 'recordsize' when you're rewriting data in place in small sizes (such as 4 KBytes or 16 KBytes) or appending data to a file in small sizes, because ZFS can only read and write entire logical blocks.)

PS: If you need a small 'recordsize' for performance, you shouldn't worry about the extra space usage, partly because you should also have a reasonable amount of free disk space to improve the performance of ZFS's space allocation.

Go and my realization about what I'll call the 'Promises' pattern

By: cks
25 September 2024 at 03:23

Over on the Fediverse, I had a belated realization:

This is my face when I realize I have a situation that 'promises'/asynchronously waitable objects would be great for, but I would have to build them by hand in Go. Oh well.

(I want asynchronous execution but to report the results in order, as each becomes available. With promises as I understand them, generate all the promises in an array, wait for each one in order, report results from it, done.)

A common pattern with work(er) pools in Go and elsewhere is that you want to submit requests to a pool of asynchronous workers and you're happy to handle the completion of that work in any order. This is easily handled in Go with a pair of channels, one for requests and the other for completions. However, this time around I wanted asynchronous requests but to be able to report on completed work in order.

(The specific context is that I've got a little Go program to do IP to name DNS lookups (it's in Go for reasons), and on the one hand it would be handy to do several DNS lookups in parallel because sometimes they take a while, but on the other hand I want to print the results in command line order because otherwise it gets confusing.)

In an environment with 'promises' or some equivalent, asynchronous work with ordered reporting of completion is relatively straightforward. You submit all the work and get an ordered collection of Promises or the equivalent, and then you go through in order harvesting results from each Promise in turn. In Go, I think there are two plausible alternatives; you can use a single common channel for results but put ordering information in them, or you can use a separate reply channel for each request. Having done scratch implementations of both, my conclusion is that the separate reply channel version is simpler for me (and in the future I'm not going to be scared off by thoughts of how many channels it can create).

For the common reply channel version, your requests must include a sequence number and then the replies from the workers will also include that sequence number. You'll receive the replies in some random sequence and then it's on you to reassemble them into order. If you want to start processing replies in order before everything has completed, you have to do additional work (you may want, for example, a container/heap).

For the separate reply channel version, you'll be creating a lot of channels (one per request) and passing them to workers as part of the request; remember to give them a one element buffer size, so that workers never block when they 'complete' each request and send the answer down the request's reply channel. However, handling completed requests in order is simple once you've accumulated a block of them:

var replies []chan ...
for _, req := range worktodo {
  // 'pool' is your worker pool
  replies = append(replies, pool.submit(req))
}

for i := range replies {
  v := <- replies[i]
  // process v
}

If a worker has not yet finished processing request number X when you get to trying to use the reply, you simply block on the channel read. If the worker has already finished, it will have sent the reply into the (buffered, remember) channel and moved on, and the reply is ready for you to pick up immediately.

(In both versions, if you have a lot of things to process, you probably want to handle them in blocks, submitting and then draining N items, repeating until you've handled all items. I think this is probably easier to do in the separate reply channel version, although I haven't implemented it yet.)

Mostly getting redundant UEFI boot disks on modern Ubuntu (especially 24.04)

By: cks
24 September 2024 at 02:44

When I wrote about how our primary goal for mirrored (system) disks is increased redundancy, including being able to reboot the system after the primary disk failed, vowhite asked in a comment if there was any trick to getting this working with UEFI. The answer is sort of, and it's mostly the same as you want to do with BIOS MBR booting.

In the Ubuntu installer, when you set up redundant system disks it's long been the case that you wanted to explicitly tell the installer to use the second disk as an additional boot device (in addition to setting up a software RAID mirror of the root filesystem across both disks). In the BIOS MBR world, this installed GRUB bootblocks on the disk; in the UEFI world, this causes the installer to set up an extra EFI System Partition (ESP) on the second drive and populate it with the same sort of things as the ESP on the first drive.

(The 'first' and the 'second' drive are not necessarily what you think they are, since the Ubuntu installer doesn't always present drives to you in their enumeration order.)

I believe that this dates from Ubuntu 22.04, when Ubuntu seems to have added support for multi-disk UEFI. Ubuntu will mount one of these ESPs (the one it considers the 'first') on /boot/efi, and as part of multi-disk UEFI support it will also arrange to update the other ESP. You can see what other disk Ubuntu expects to find this ESP on by looking at the debconf selection 'grub-efi/install_devices'. For perfectly sensible reasons this will identify disks by their disk IDs (as found in /dev/disk/by-id), and it normally lists both ESPs.

All of this is great but it leaves you with two problems if the disk with your primary ESP fails. The first is the question of whether your system's BIOS will automatically boot off the second ESP. I believe that UEFI firmware will often do this, and you can specifically set this up with EFI boot entries through things like efibootmgr (also); possibly current Ubuntu installers do this for you automatically if it seems necessary.

The bigger problem is the /boot/efi mount. If the primary disk fails, a mounted /boot/efi will start having disk IO errors and then if the system reboots, Ubuntu will probably be unable to find and mount /boot/efi from the now gone or error-prone primary disk. If this is a significant concern, I think you need to make the /boot/efi mount 'nofail' in /etc/fstab (per fstab(5)). Energetic people might want to go further and make it either 'noauto' so that it's not even mounted normally, or perhaps mark it as a systemd automounted filesystem with 'x-systemd.automount' (per systemd.mount).

(The disclaimer is that I don't know how Ubuntu will react if /boot/efi isn't mounted at all or is a systemd automount mountpoint. I think that GRUB updates will cope with having it not mounted at all.)

If any disk with an ESP on it fails and has to be replaced, you have to recreate a new ESP on that disk and then, I believe, run 'dpkg-reconfigure grub-efi-amd64', which will ask you to select the ESPs you want to be automatically updated. You may then need to manually run '/usr/lib/grub/grub-multi-install --target=x86_64-efi', which will populate the new ESP (or it may be automatically run through the reconfigure). I'm not sure about this because we haven't had any UEFI system disks fail yet.

(The ESP is a vfat formatted filesystem, which can be set up with mkfs.vfat, and has specific requirements for its GUIDs and so on, which you'll have to set up by hand in the partitioning tool of your choice or perhaps automatically by copying the partitioning of the surviving system disk to your new disk.)

If it was the primary disk that failed, you will probably want to update /etc/fstab to get /boot/efi from a place that still exists (probably with 'nofail' and perhaps with 'noauto'). This might be somewhat easy to overlook if the primary disk fails without the system rebooting, at which point you'd get an unpleasant surprise on the next system reboot.

The general difference between UEFI and BIOS MBR booting for this is that in BIOS MBR booting, there's no /boot/efi to cause problems and running 'grub-install' against your replacement disk is a lot easier than creating and setting up the ESP. As I found out, a properly set up BIOS MBR system also 'knows' in debconf what devices you have GRUB installed on, and you'll need to update this (probably with 'dpkg-reconfigure grub-pc') when you replace a system disk.

(We've been able to avoid this so far because in Ubuntu 20.04 and 22.04, 'grub-install' isn't run during GRUB package updates for BIOS MBR systems so no errors actually show up. If we install any 24.04 systems with BIOS MBR booting and they have system disk failures, we'll have to remember to deal with it.)

(See also my entry on multi-disk UEFI in Ubuntu 22.04, which goes deeper into some details. That entry was written before I knew that a 'grub-*/install_devices' setting of a software RAID array was actually an error on Ubuntu's part, although I'd still like GRUB's UEFI and BIOS MBR scripts to support it.)

Old (Unix) workstations and servers tended to boot in the same ways

By: cks
23 September 2024 at 02:50

I somewhat recently read j. b. crawford's ipmi, where in a part crawford talks about how old servers of the late 80s and 90s (Unix and otherwise) often had various features for management like serial consoles. What makes something an old school 80s and 90s Unix server and why they died off is an interesting topic I have views on, but today I want to mention and cover a much smaller one, which is that this sort of early boot environment and low level management system was generally also found on Unix workstations.

By and large, the various companies making both Unix servers and Unix workstations, such as Sun, SGI, and DEC, all used the same boot time system firmware on both workstation models and server models (presumably partly because that was usually easier and cheaper). Since most workstations also had serial ports, the general consequence of this was that you could set up a 'workstation' with a serial console if you wanted to. Some companies even sold the same core hardware as either a server or workstation depending on what additional options you put in it (and with appropriate additional hardware you could convert an old server into a relatively powerful workstation).

(The line between 'workstation' and 'server' was especially fuzzy for SGI hardware, where high end systems could be physically big enough to be found in definite server-sized boxes. Whether you considered these 'servers with very expensive graphics boards' or 'big workstations' could be a matter of perspective and how they were used.)

As far as the firmware was concerned, generally what distinguished a 'server' that would talk to its serial port to control booting and so on from a 'workstation' that had a graphical console of some sort was the presence of (working) graphics hardware. If the firmware saw a graphics board and no PROM boot variables had been set, it would assume the machine was a workstation; if there was no graphics hardware, you were a server.

As a side note, back in those days 'server' models were not necessarily rack-mountable and weren't always designed with the 'must be in a machine room to not deafen you' level of fans that modern servers tend to be found with. The larger servers were physically large and could require special power (and generate enough noise that you didn't want them around you), but the smaller 'server' models could look just like a desktop workstation (at least until you counted up how many SCSI disks were cabled to them).

Sidebar: An example of repurposing older servers as workstations

At one point, I worked with an environment that used DEC's MIPS-based DECstations. DEC's 5000/2xx series were available either as a server, without any graphics hardware, or as a workstation, with graphics hardware. At one point we replaced some servers with better ones; I think they would have been 5000/200s being replaced with 5000/240s. At the time I was using a DECstation 3100 as my system administrator workstation, so I successfully proposed taking one of the old 5000/200s, adding the basic colour graphics module, and making it my new workstation. It was a very nice upgrade.

TLS certificates were (almost) never particularly well verified

By: cks
22 September 2024 at 02:32

Recently there was a little commotion in the TLS world, as discussed in We Spent $20 To Achieve RCE And Accidentally Became The Admins Of .MOBI. As part of this adventure, the authors of the article discovered that some TLS certificate authorities were using WHOIS information to validate who controlled a domain (so if you could take over a WHOIS server for a TLD, you could direct domain validation to wherever you wanted). This then got some people to realize that TLS Certificate Authorities were not actually doing very much to verify who owned and controlled a domain. I'm sure that there were also some people who yearned for a hypothetical old days when Certificate Authorities actually did that, as opposed to the modern days when they don't.

I'm afraid I have bad news for anyone with this yearning. Certificate Authorities have never done a particularly strong job of verifying who was asking for a TLS (then SSL) certificate. I will go further and be more controversial; we don't want them to be thorough about identity verification for TLS certificates.

There are a number of problems with identity verification in theory and in practice, but one of them is that it's expensive, and the more thorough and careful the identity verification, the more expensive it is. No Certificate Authority is in a position to absorb this expense, so a world where TLS certificates are carefully verified is also a world where they are expensive. It's also probably a world where they're difficult or impossible to obtain from a Certificate Authority that's not in your country, because the difficulty of identity verification goes up significantly in that case.

(One reason that thorough and careful verification is expensive is that it takes significant time from experienced, alert humans, and that time is not cheap.)

This isn't the world that we had even before Let's Encrypt created the ACME protocol for automated domain verifications. The pre-LE world might have started out with quite expensive TLS certificates, but it shifted fairly rapidly to ones that cost only $100 US or less, which is a price that doesn't cover very much human verification effort. And in that world, with minimal human involvement, WHOIS information is probably one of the better ways of doing such verification.

(Such a world was also one without a lot of top level domains, and most of the TLDs were country code TLDs. The turnover in WHOIS servers was probably a lot smaller back then.)

PS: The good news is that using WHOIS information for domain verification is probably on the way out, although how soon this will happen is an open question.

Our broad reasons for and approach to mirroring disks

By: cks
21 September 2024 at 02:51

When I talked about our recent interest in FreeBSD, I mentioned the issue of disk mirroring. One of the questions this raises is what we use disk mirroring for, and how we approach it in general. The simple answer is that we mirror disks for extra redundancy, not for performance, but we don't go too far to get extra redundancy.

The extremely thorough way to do disk mirroring for redundancy is to mirror with different makes and ages of disks on each side of the mirror, to try to avoid both age related failures and model or maker related issues (either firmware or where you find out that the company used some common problematic component). We don't go this far; we generally buy a block of whatever SSD is considered good at the moment, then use them for a while, in pairs, either fresh in newly deployed servers or re-using a pair in a server being re-deployed. One reason we tend to do this is that we generally get 'consumer' drives, and finding decent consumer drives is hard enough at the best of times without having to find two different vendors of them.

(We do have some HDD mirrors, for example on our Prometheus server, but these are also almost always paired disks of the same model, bought at the same time.)

Because we have backups, our redundancy goals are primarily to keep servers operating despite having one disk fail. This means that it's important that the system keep running after a disk failure, that it can still reboot after a disk failure (including of its first, primary disk), and that the disk can be replaced and put into service without downtime (provided that the hardware supports hot swapping the drive). The less this is true, the less useful any system's disk mirroring is to us (including 'hardware' mirroring, which might make you take a trip through the BIOS to trigger a rebuild after a disk replacement, which means downtime). It's also vital that the system be able to tell us when a disk has failed. Not being able to reliably tell us this is how you wind up with systems running on a single drive until that single drive then fails too.

On our ZFS fileservers it would be quite undesirable to have to restore from backups, so we have an elaborate spares system that uses extra disk space on the fileservers (cf) and a monitoring system to rapidly replace failed disks. On our regular servers we don't (currently) bother with this, even on servers where we could add a third disk as a spare to the two system disks.

(We temporarily moved to three way mirrors for system disks on some critical servers back in 2020, for relatively obvious reasons. Since we're now in the office regularly, we've moved back to two way mirrors.)

Our experience so far with both HDDs and SSDs is that we don't really seem to have clear age related or model related failures that take out multiple disks at once. In particular, we've yet to lose both disks of a mirror before one could be replaced, despite our habit of using SSDs and HDDs in basically identical pairs. We have had a modest number of disk failures over the years, but they've happened by themselves.

(It's possible that at some point we'll run a given set of SSDs for long enough that they start hitting lifetime limits. But we tend to grab new SSDs when re-deploying important servers. We also have a certain amount of server generation turnover for important servers, and when we use the latest hardware it also gets brand new SSDs.)

OpenBSD versus FreeBSD pf.conf syntax for address translation rules

By: cks
20 September 2024 at 02:53

I mentioned recently that we're looking at FreeBSD as a potential replacement for OpenBSD for our PF-based firewalls (for the reasons, see that entry). One of the things that will determine how likely we are to try this is how similar the pf.conf configuration syntax and semantics are between OpenBSD pf.conf (which all of our current firewall rulesets are obviously written in) and FreeBSD pf.conf (which we'd have to move them to). I've only done preliminary exploration of this but the news has been relatively good so far.

I've already found one significant syntax (and to some extent semantics) difference between the two PF ruleset dialects, which is that OpenBSD does BINAT, redirection, and other such things by means of rule modifiers; you write a 'pass' or a 'match' rule and add 'binat-to', 'nat-to', 'rdr-to', and so on modifiers to it. In FreeBSD PF, this must be done as standalone translation rules that take effect before your filtering rules. In OpenBSD PF, strategically placed (ie early) 'match' BINAT, NAT, and RDR rules have much the same effect as FreeBSD translation rules, causing your later filtering rules to see the translated addresses; however, 'pass quick' rules with translation modifiers combine filtering and translation into one thing, and there's not quite a FreeBSD equivalent.

That sounds abstract, so let's look at a somewhat hypothetical OpenBSD RDR rule:

pass in quick on $INT_IF proto {udp tcp} \
     from any to <old-DNS-IP> port = 53 \
     rdr-to <new-DNS-IP>

Here we want to redirect traffic to our deprecated old DNS resolver IP to the new DNS IP, but only DNS traffic.

In FreeBSD PF, the straightforward way would be two rules:

rdr on $INT_IF proto {udp tcp} \
    from any to <old-DNS-IP> port = 53 \
    -> <new-DNS-IP> port 53

pass in quick on $INT_IF proto {udp tcp} \
     from any to <new-DNS-IP> port = 53

In practice we would most likely already have the 'pass in' rule, and also you can write 'rdr pass' to immediately pass things and skip the filtering rules. However, 'rdr pass' is potentially dangerous because it skips all filtering. Do you have a single machine that is just hammering your DNS server through this redirection and you want to cut it off? You can't add a useful 'block in quick' rule for it if you have a 'rdr pass', because the 'pass' portion takes effect immediately. There are ways to work around this but they're not quite as straightforward.

(Probably this alone would push us to not using 'rdr pass'; there's also the potential confusion of passing traffic in two different sections of the pf.conf ruleset.)

Fortunately we have very few non-'match' translation rules. Turning OpenBSD 'match ... <whatever>-to <ip>' pf.conf rules into the equivalent FreeBSD '<whatever> ...' rules seems relatively mechanical. We'd have to make sure that the IP addresses our filtering rules saw continued to be the internal ones, but I think this would be work out naturally; our firewalls that do NAT and BINAT translation do it on their external interfaces, and we usually filter with 'pass in' rules.

(There may be more subtle semantic differences between OpenBSD and FreeBSD pf rules. A careful side by side reading of the two pf.conf manual pages might turn these up, but I'm not sure I can read the two manual pages that carefully.)

Open source maintainers with little time and changes

By: cks
19 September 2024 at 03:02

'Unmaintained' open source code represents a huge amount of value, value that shouldn't and can't be summarily ignored when considering issues like language backward compatibility. Some of that code is more or less unmaintained, but some of it is maintained by people spending a bit of time working on things to keep projects going. It is perhaps tempting to say that such semi-maintained projects should deal with language updates and so on. I maintain that this is wrong.

These people keeping the lights on in these projects often have limited amounts of time that they either can or will spend on their projects. They don't owe the C standard or anyone else any amount of that time, not even if the C standard people think it should be small and insignificant and easy. Outside backward incompatible changes (in anything) that force these people to spend their limited time keeping up (or force them to spend more time) are at the least kind of rude.

(Such changes are also potentially ineffective or dangerous, in that they push people towards not updating at all and locking themselves to old compilers, old compiler settings, old library and package versions, and so on. Or abandoning the project entirely because it's too much work.)

Of course this applies to more than just backward incompatible language changes; especially it applies to API changes. Both language and API changes force project maintainers into a Red Queen's Race, where their effort doesn't improve their project, it just keeps it working. Does this mean that you can never change languages or APIs in ways that break backward compatibility? Obviously not, but it does mean that you should make sure that the change is worth the cost, and the more used your language or API is, the higher the cost. C is an extremely widely used language, so the cost of any break with backward compatibility in it (including in the C standard library) is quite high.

The corollary of this for maintainers is that if you want your project to not require much of your time, you can't depend on APIs that are prone to backward incompatible changes. Unfortunately this may limit the features you can provide or the languages that you want to use (depending not just on the rate of change in the language itself but also in the libraries that the language will force you to use).

(For example, as a pragmatic thing I would rather write a low maintenance TLS using program in Go than in anything else right now, because the Go TLS package is part of the core Go library and is covered by the Go 1 compatibility guarantee. C and C++ may be pretty stable languages and less likely to change than Go, but OpenSSL's API is not.)

My "time to full crawl" (vague) metric

By: cks
18 September 2024 at 02:43

This entry, along with all of Wandering Thoughts (this blog) and in fact the entire wiki-thing it's part of is dynamically rendered from my wiki-text dialect to HTML. Well, in theory. In practice, one of the several layers of caching that make DWiki (this software) perform decently is a cache of the rendered HTML. Because DWiki is often running as an old fashioned Apache CGI, this rendering cache lives on disk.

(DWiki runs in a complicated way that can see it operating as a CGI under low load or as a daemon with a fast CGI frontend under higher load; this entry has more details.)

Since there are only so many things to render to HTML, this on disk cache has a maximum size that it stabilizes at; given enough time, everything gets visited and thus winds up in the disk cache of rendered HTML. The render disk cache lives in its own directory hierarchy, and so I can watch its size with a simple 'du -hs' command. Since I delete the entire cache every so often, this gives me an indicator that I can call either "time to full cache" or "time to full crawl". The time to full cache is how long it typically takes for the cache to reach maximum size, which is how long it takes for everything to be visited by something (or actually, used to render a URL that something visited).

I haven't attempted to systematically track this measure, but when I've looked it usually takes less than a week for the render cache to reach its stable 'full' size. The cache stores everything in separate files, so if I was an energetic person I could scan through the cache's directory tree, look at the file modification times, and generate some nice graphs of how fast the crawling goes (based on either the accumulated file sizes or the accumulated number of files, depending on what I was interested in).

(In theory I could do this from web server access logs. This would give me a somewhat different measure, since I'd be tracking what URLs had been accessed at least once instead of which bits of wikitext had been used in displaying URLs. At the same time, it might be a more interesting measure of how fast things are visited, and I do have a catalog of all page URLs here in the form of an automatically generated sitemap.)

PS: I doubt this is a single crawler visiting all of Wandering Thoughts in a week or so. Instead I expect it's the combination of the assorted crawlers (most of them undesirable), plus some amount of human traffic.

Why my Fedora 40 systems stalled logins for ten seconds or so

By: cks
17 September 2024 at 02:10

One of my peculiarities is that I reboot my Fedora 40 desktops by logging in as root on a text terminal and then running 'reboot' (sometimes or often also telling loginctl to terminate any remainders of my login session so that the reboot doesn't stall for irritating lengths of time). Recently, the simple process of logging in as root has been stalling for an alarmingly long time, enough time to make me think something was wrong with the system (it turns out that the stall was probably ten seconds or so, but even a couple of seconds is alarming for your root login not working). Today I hit this again and this time I dug into what was happening, partly because I was able to reproduce it with something other than a root login to reboot the machine.

My first step was to use the excellent extrace to find out what was taking so long, since this can trace all programs run from one top level process and report how long they took (along with the command line arguments). This revealed that the time consuming command was '/usr/libexec/pk-command-not-found compinit -c', and it was being run as part of quite a lot of commands being executed during shell startup. Specifically, Bash, because on Fedora root's login shell is Bash. This was happening because Bash's normal setup will source everything from /etc/profile.d/ in order to set up your new (interactive) Bash setup, and it turns out that there's a lot there. Using 'bash -xl' I was able to determine that pk-command-not-found was probably being run somehow in /usr/share/lmod/lmod/init/bash. If you're as puzzled as I was about that, lmod (also) is apparently a system for setting up paths for accessing Lua 'modules', so it wants to hook into shell startup to set up its environment variables.

It took me a bit of time to understand how the bits fit together, partly because there's no documentation for pk-command-not-found. The first step is that Bash has a feature that allows you to hook into what happens when a command isn't found (cf, see the discussion of the (potential) command_not_found_handle function), and PackageKit is doing this (in the PackageKit-command-not-found Fedora RPM package, which Fedora installs as a standard feature). It turns out that Bash will invoke this handler function not just for commands you run interactively, but also commands that aren't found while Bash is sourcing all of your shell startup. This handler is being triggered in Lmod's init/bash code because said code attempts to run 'compinit -c' to set up completion in zsh so that it can modify zsh's function search path. Compinit is a zsh thing (it's not technically a builtin), so there is no exposed 'compinit' command on the system. Running compinit outside of zsh is a bug; in this case, an expensive bug.

My solution was to remove both PackageKit-command-not-found, because I don't want this slow 'command not found' handling in general, and also the Lmod package, because I don't use Lmod. Because I'm a certain sort of person, I filed Lmod issue #725 to report the issue.

In some testing in a virtual machine, it appears that pk-command-not-found may be so slow only the first time it's invoked. This means that most people with these packages installed may not see or at least realize what's happening, because under normal circumstances they probably log in to Fedora machines graphically, at which point the login stall is hidden in the general graphical environment startup delay that everyone expects to be slow. I'm in the unusual circumstance that my login doesn't use any normal shell, so logging in as root is the first time my desktops will run Bash interactively and trigger pk-command-not-found.

(This elaborates on and cleans up a Fediverse thread I wrote as I poked around.)

Why we're interested in FreeBSD lately (and how it relates to OpenBSD here)

By: cks
16 September 2024 at 03:09

We have a long and generally happy history of using OpenBSD and PF for firewalls. To condense a long story, we're very happy with the PF part of our firewalls, but we're increasingly not as happy with the OpenBSD part (outside of PF). Part of our lack of cheer is the state of OpenBSD's 10G Ethernet support when combined with PF, but there are other aspects as well; we never got OpenBSD disk mirroring to be really useful and eventually gave up on it.

We wound up looking at FreeBSD after another incident with OpenBSD doing weird and unhelpful hardware things, because we're a little tired of the whole area. Our perception (which may not be reality) is that FreeBSD likely has better driver support for modern hardware, including 10G cards, and has gone further on SMP support for networking, hopefully including PF. The last time we looked at this, OpenBSD PF was more or less limited by single-'core' CPU performance, especially when used in bridging mode (which is what our most important firewall uses). We've seen fairly large bandwidth rates through our OpenBSD PF firewalls (in the 800 MBytes/sec range), but never full 10G wire bandwidth, so we've wound up suspecting that our network speed is partly being limited by OpenBSD's performance.

(To get to this good performance we had to buy servers that focused on single-core CPU performance. This created hassles in our environment, since these special single-core performance servers had to be specially reserved for OpenBSD firewalls. And single-core performance isn't going up all that fast.)

FreeBSD has a version of PF that's close enough to OpenBSD's older versions to accept much or all of the syntax of our pf.conf files (we're not exactly up to the minute on our use of PF features and syntax). We also perceive FreeBSD as likely more normal to operate than OpenBSD has been, making it easier to integrate into our environment (although we'd have to actually operate it for a while to see if that was actually the case). If FreeBSD has great 10G performance on our current generation commodity servers, without needing to buy special servers for it, and fixes other issues we have with OpenBSD, that makes it potentially fairly attractive.

(To be clear, I think that OpenBSD is (still) a great operating system if you're interested in what it has to offer for security and so on. But OpenBSD is necessarily opinionated, since it has a specific focus, and we're not really using OpenBSD for that focus. Our firewalls don't run additional services and don't let people log in, and some of them can only be accessed over a special, unrouted 'firewall' subnet.)

Getting maximum 10G Ethernet bandwidth still seems tricky

By: cks
15 September 2024 at 02:51

For reasons outside the scope of this entry, I've recently been trying to see how FreeBSD performs on 10G Ethernet when acting as a router or a bridge (both with and without PF turned on). This pretty much requires at least two more 10G test machines, so that the FreeBSD server can be put between them. When I set up these test machines, I didn't think much about them so I just grabbed two old servers that were handy (well, reasonably handy), stuck a 10G card into each, and set them up. Then I actually started testing their network performance.

I'm used to 1G Ethernet, where long ago it became trivial to achieve full wire bandwidth, even bidirectional full bandwidth (with test programs; there are many things that can cause real programs to not get this). 10G Ethernet does not seem to be like this today; the best I could do was get close to around 950 MBytes a second in one direction (which is not 10G's top speed). With the right circumstances, bidirectional traffic could total to just over 1 GByte a second, which is of course nothing like what we'd like to see.

(This isn't a new problem with 10G Ethernet, but I was hoping this had been solved in the past decade or so.)

There's a lot of things that could be contributing to this, like the speed of the CPU (and perhaps RAM), the specific 10G hardware I was using (including if it lacked performance increasing features that more expensive hardware would have had), and Linux kernel or driver issues (although this was Ubuntu 24.04, so I would hope that they were sorted out). I'm especially wondering about CPU limitations, because the kernel's CPU usage did seem to be quite high during my tests and, as mentioned, they're old servers with old CPUs (different old CPUs, even, one of which seemed to perform a bit better than the other).

(For the curious, one was a Celeron G530 in a Dell R210 II and the other a Pentium G6950 in a Dell R310, both of which date from before 2016 and are something like four generations back from our latest servers (we've moved on slightly since 2022).)

Mostly this is something I'm going to have to remember about 10G Ethernet in the future. If I'm doing anything involving testing its performance, I'll want to use relatively modern test machines, possibly several of them to create aggregate traffic, and then I'll want to start out by measuring the raw performance those machines can give me under the best circumstances. Someday perhaps 10G Ethernet will be like 1G Ethernet for this, but that's clearly not the case today (in our environment).

Threads, asynchronous IO, and cancellation

By: cks
14 September 2024 at 02:23

Recently I read Asynchronous IO: the next billion-dollar mistake? (via), and had a reaction to one bit of it. Then yesterday on the Fediverse I said something about IO in Go:

I really wish you could (easily) cancel io Reads (and Writes) in Go. I don't think there's any particularly straightforward way to do it today, since the io package was designed way before contexts were a thing.

(The underlying runtime infrastructure can often actually do this because it decouples 'check for IO being possible' from 'perform the IO', but stuff related to this is not actually exposed.)

Today this sparked a belated realization in my mind, which is that a model of threads performing blocking IO in each thread is simply a harder environment to have some sort of cancellation in than an asynchronous or 'event loop' environment. The core problem is that in their natural state, threads are opaque and therefor difficult to interrupt or stop safely (which is part of why Go's goroutines can't be terminated from the outside). This is the natural inverse of how threads handle state for you.

(This is made worse if the thread is blocked in the operating system itself, for example in a 'read()' system call, because now you have to use operating system facilities to either interrupt the system call so the thread can return to user level to notice your user level cancellation, or terminate the thread outright.)

Asynchronous IO generally lets you do better in a relatively clean way. Depending on the operating system facilities you're using, either there is a distinction between the OS telling you that IO is possible and your program doing IO, providing you a chance to not actually do the IO, or in an 'IO submission' environment you generally can tell the OS to cancel a submitted but not yet completed IO request. The latter is racy, but in many situations the IO is unlikely to become possible right as you want to cancel it. Both of these let you implement a relatively clean model of cancelling a conceptual IO operation, especially if you're doing the cancellation as the result of another IO operation.

Or to put it another way, event loops may make you manage state explicitly, but that also means that that state is visible and can be manipulated in relatively natural ways. The implicit state held in threads is easy to write code with but hard to reason about and work with from the outside.

Sidebar: My particular Go case

I have a Go program that at its core involves two goroutines, one reading from standard input and writing to a network connection, one reading from the network connection and writing to standard output. Under some circumstances, the goroutine reading from the network will want to close down the network collection and return to a top level, where another two way connection will be made. In the process, it needs to stop the 'read from stdin, write to the network' goroutine while it is parked in 'read from stdin', without closing stdin (because that will be reused for the next connection).

To deal with this cleanly, I think I would have to split the 'read from standard input, write to the network' goroutine into two that communicated through a channel. Then the 'write to the network' side could be replaced separately from the 'read from stdin' side, allowing me to cleanly substitute a new network connection.

(I could also use global variables to achieve the same substitution, but let's not.)

What admin access researchers have to their machines here

By: cks
13 September 2024 at 03:31

Recently on the Fediverse, Stephen Checkoway asked what level of access fellow academics had to 'their' computers to do things like install software (via). This is an issue very relevant to where I work, so I put a short-ish answer in the Fediverse thread and now I'm going to elaborate it at more length. Locally (within the research side of the department) we have a hierarchy of machines for this sort of thing.

At the most restricted are the shared core machines my group operates in our now-unusual environment, such as the mail server, the IMAP server, the main Unix login server, our SLURM cluster and general compute servers, our general purpose web server, and of course the NFS fileservers that sit behind all of this. For obvious reasons, only core staff have any sort of administrative access to these machines. However, since we operate a general Unix environment, people can install whatever they want to in their own space, and they can request that we install standard Ubuntu packages, which we mostly do (there are some sorts of packages that we'll decline to install). We do have some relatively standard Ubuntu features turned off for security reasons, such as "user namespaces", which somewhat limits what people can do without system privileges. Only our core machines live on our networks with public IPs; all other machines have to go on separate private "sandbox" networks.

The second most restricted are researcher owned machines that want to NFS mount filesystems from our NFS fileservers. By policy, these must be run by the researcher's Point of Contact, operated securely, and only the Point of Contact can have root on those machines. Beyond that, researchers can and do ask their Point of Contact to install all sorts of things on their machines (the Point of Contact effectively works for the researcher or the research group). As mentioned, these machines live on "sandbox" networks. Most often they're servers that the researcher has bought with grant funding, and there are some groups that operate more and better servers than we (the core group) do.

Next are non-NFS machines that people put on research group "sandbox" networks (including networks where some machines have NFS access); people do this with both servers and desktops (and sometimes laptops as well). The policies on who has what power over these machines is up to the research group and what they (and their Point of Contact) feel comfortable with. There are some groups where I believe the Point of Contact runs everything on their sandbox network, and other groups where their sandbox network is wide open with all sorts of people running their own machines, both servers and desktops. Usually if a researcher buys servers, the obvious person to have run them is their Point of Contact, unless the research work being done on the servers is such that other people need root access (or it's easier for the Point of Contact to hand the entire server over to a graduate student and have them run it as they need it).

Finally there are generic laptops and desktops, which normally go on our port-isolated 'laptop' network (called the 'red' network after the colour of network cables we use for it, so that it's clearly distinct from other networks). We (the central group) have no involvement in these machines and I believe they're almost always administered by the person who owns or at least uses them, possibly with help from that person's Point of Contact. These days, some number of laptops (and probably even desktops) don't bother with wired networking and use our wireless network instead, where similar 'it's yours' policies apply.

People who want access to their files from their self-managed desktop or laptop aren't left out in the cold, since we have a SMB (CIFS) server. People who use Unix and want their (NFS, central) home directory mounted can use the 'cifs' (aka 'smb3') filesystem to access it through our SMB server, or even use sshfs if they want to. Mounting via cifs or sshfs is in some cases superior to using NFS, because they can give you access to important shared filesystems that we can't NFS export to machines outside our direct control.

Rate-limiting failed SMTP authentication attempts in Exim 4.95

By: cks
12 September 2024 at 03:01

Much like with SSH servers, if you have a SMTP server exposed to the Internet that supports SMTP authentication, you'll get a whole lot of attackers showing up to do brute force password guessing. It would be nice to slow these attackers down by rate-limiting their attempts. If you're using Exim, as we are, then this is possible to some degree. If you're using Exim 4.95 on Ubuntu 22.04 (instead of a more recent Exim), it's trickier than it looks.

One of Exim's ACLs, the ACL specified by acl_smtp_auth, is consulted just before Exim accepts a SMTP 'AUTH <something>' command. If this ACL winds up returning a 'reject' or a 'defer' result, Exim will defer or reject the AUTH command and the SMTP client will not be able to try authenticating. So obviously you need to put your ratelimit statement in this ACL, but there are two complications. First, this ACL doesn't have access to the login name the client is trying to authenticate (this information is only sent after Exim accepts the 'AUTH <whatever>' command), so all you can ratelimit is the source IP (or a network area derived from it). Second, this ACL happens before you know what the authentication result is, so you don't want to actually update your ratelimit in it, just check what the ratelimit is.

This leads to the basic SMTP AUTH ACL of:

acl_smtp_auth = acl_check_auth
begin acl
acl_check_auth:
  # We'll cover what this is for later
  warn
    set acl_c_auth = true

  deny
    ratelimit = 10 / 10m / per_cmd / readonly / $sender_host_address
    delay = 10s
    message = You are failing too many authentication attempts.
    # you might also want:
    # log_message = ....

  # don't forget this or you will be sad
  # (because no one will be able to authenticate)
  accept

(The 'delay = 10s' usefully slows down our brute force SMTP authentication attackers because they seem to wait for the reply to their SMTP AUTH command rather than giving up and terminating the session after a couple of seconds.)

This ratelimit is read-only because we don't want to update it unless the SMTP authentication fails; otherwise, you will wind up (harshly) rate-limiting legitimate people who repeatedly connect to you, authenticate, perhaps send an email message, and then disconnect. Since we can't update the ratelimit in the SMTP AUTH ACL, we need to somehow recognize when authentication has failed and update the ratelimit in that place.

In Exim 4.97 and later, there's a convenient and direct way to do this through the events system and the 'auth:fail' event that is raised by an Exim server when SMTP authentication fails. As I understand it, the basic trick is that you make the auth:fail event invoke a special ACL, and have the user ACL update the ratelimit. Unfortunately Ubuntu 22.04 has Exim 4.95, so we must be more clever and indirect, and as a result somewhat imperfect in what we're doing.

To increase the ratelimit when SMTP authentication has failed, we add an ACL that is run at the end of the connection and increases the ratelimit if an authentication was attempted but did not succeed, which we detect by the lack of authentication information. Exim has two possible 'end of session' ACL settings, one that is used if the session is ended with a SMTP QUIT command and one that is ended if the SMTP session is just ended without a QUIT.

So our ACL setup to update our ratelimit looks like this:

[...]
acl_smtp_quit = acl_count_failed_auth
acl_smtp_notquit = acl_count_failed_auth

begin acl
[...]

acl_count_failed_auth:
  warn:
    condition = ${if bool{$acl_c_auth} }
    !authenticated = *
    ratelimit = 10 / 10m / per_cmd / strict / $sender_host_address

  accept

Our $acl_c_auth SMTP connection ACL variable tells us whether or not the connection attempted to authenticate (sometimes legitimate people simply connect and don't do anything before disconnecting), and then we also require that the connection not be authenticated now to screen out people who succeeded in their SMTP authentication. The settings for the two 'ratelimit =' settings have to match or I believe you'll get weird results.

(The '10 failures in 10 minutes' setting works for us but may not work for you. If you change the 'deny' to 'warn' in acl_check_auth and comment out the 'message =' bit, you can watch your logs to see what rates real people and your attackers actually use.)

The limitation on this is that we're actually increasing the ratelimit based not on the number of (failed) SMTP authentication attempts but on the number of connections that tried but failed SMTP authentication. If an attacker connects and repeatedly tries to do SMTP AUTH in the session, failing each time, we wind up only counting it as a single 'event' for ratelimiting because we only increase the ratelimit (by one) when the session ends. For the brute force SMTP authentication attackers we see, this doesn't seem to be an issue; as far as I can tell, they disconnect their session when they get a SMTP authentication failure.

Ways ATX power supply control could work on server motherboards

By: cks
11 September 2024 at 03:02

Yesterday I talked about how ATX power supply control seems to work on desktop motherboards, which is relatively straightforward; as far as I can tell from various sources, it's handled in the chipset (on modern Intel chipsets, in the PCH), which is powered from standby power by the ATX power supply. How things work on servers is less clear. Here when I say 'server' I mean something with a BMC (Baseboard management controller), because allowing you to control the server's power supply is one of the purposes of a BMC, which means the BMC has to hook into this power management picture.

There appear to be a number of ways that the power control and management could or may be done and the BMC connected to it. People on the Fediverse replying to my initial question gave me a number of possible answers:

I found documentation for some of Intel's older Xeon server chipsets (with provisions for BMCs) and as of that generation, power management was still handled in the PCH and described in basically the same language as for desktops. I couldn't spot a mention of special PCH access for the BMC, so BMC control over server power might have been implemented with the 'BMC controls the power button wire' approach.

I can also imagine hybrid approaches. For example, you could in theory give the BMC control over the 'turn power on' wire to the power supplies, and route the chipset's version of that line to the BMC, in addition to routing the power button wire to the BMC. Then the BMC would be in a position to force a hard power off even if something went wrong in the chipset (or a hard power on, although if the chipset refuses to trigger a power on there might be a good reason for that).

(Server power supplies aren't necessarily 'ATX' power supplies as such, but I suspect that they all have similar standby power, 'turn power on', and 'is the PSU power stable' features as ATX PSUs do. Server PSUs often clearly aren't plain ATX units because they allow the BMC to obtain additional information on things like the PSU's state, temperature, current power draw, and so on.)

Our recent experience with BMCs that wouldn't let their servers power on when they should have suggests that on these servers (both Dell R340s), the BMC has some sort of master control or veto power over the normal 'return to last state' settings in the BIOS. At the same time, the 'what to do after AC power returns' setting is in the BIOS, not in the BMC, so it seems that the BMC is not the sole thing controlling power.

(I tried to take a look at how this was done in OpenBMC, but rapidly got lost in a twisty maze of things. I think at least some of the OpenBMC supported hardware does this through I2C commands, although what I2C device it's talking to is a good question. Some of the other hardware appears to have GPIO signal definitions for power related stuff, including power button definitions.)

How ATX power supply control seems to work on desktop motherboards

By: cks
10 September 2024 at 03:11

Somewhat famously, the power button on x86 PC desktop machines with ATX power supplies is not a 'hard' power switch that interrupts or enables power through the ATX PSU but a 'soft' button that is controlled by the overall system. The actual power delivery is at least somewhat under software control, both the operating system (which enables modern OSes to actually power off the machine under software control) and the 'BIOS', broadly defined, which will do things like signal the OS to do an orderly shutdown if you merely tap the power button instead of holding it down for a few seconds. Because they're useful, 'soft' power buttons and the associated things have also spread to laptops and servers, even if their PSUs are not necessarily 'ATX' as such. After recent events, I found myself curious about actually did handle the chassis power button and associated things. Asking on the Fediverse produced a bunch of fascinating answers, so today I'm starting with plain desktop motherboards, where the answer seems to be relatively straightforward.

(As I looked up once, physically the power button is normally a momentary-contact switch that is open (off) when not pressed. A power button that's stuck 'pressed' can have odd effects.)

At the direct electrical level, ATX PSUs are either on, providing their normal power, or "off", which is not really completely off but has the PSU providing +5V standby power (with a low current limit) on a dedicated pin (pin 9, the ATX cable normally uses a purple wire for this). To switch an ATX PSU from "off" to on, you ground the 'power on' pin and keep it grounded (pin 16; the green wire in normal cables, and ground is black wires). After a bit of stabilization time, the ATX PSU will signal that all is well on another pin (pin 8, the grey wire). The ATX PSU's standby power is used to power the RTC and associated things, to provide the power for features like wake-on-lan (which requires network ports to be powered up at least a bit), and to power whatever handles the chassis power button when the PSU is "off".

On conventional desktop motherboards, the actual power button handling appears to be in the PCH or its equivalent (per @rj's information on the ICH, and also see Whitequark's ICH/PCH documentation links). In the ICH/PCH, this is part of general power management, including things like 'suspend to RAM'. Inside the PCH, there's a setting (or maybe two or three) that controls what happens when external power is restored; the easiest to find one is called AFTERG3_EN, which is a single bit in one of the PCH registers. To preserve this register's settings over loss of external power, it's part of what the documentation calls the "RTC well", which is apparently a chunk of stuff that's kept powered as part of the RTC, either from standby power or from the RTC's battery (depending on whether or not there's external power available). The ICH/PCH appears to have a direct "PWRBTN#" input line, which is presumably eventually connected to the chassis power button, and it directly implements the logic for handling things like the 'press and hold for four seconds to force a power off' feature (which Intel describes as 'transitioning to S5', the "Soft-Off" state).

('G3' is the short Intel name for what Intel calls "Mechanical Off", the condition where there's no external power. This makes the AFTERG3_EN name a bit clearer.)

As far as I can tell there's no obvious and clear support for the modern BIOS setting of 'when external power comes back, go to your last state'. I assume that what actually happens is that the ICH/PCH register involved is carefully updated by something (perhaps ACPI) as the system is powered on and off. When the system is powered on, early in the sequence you'd set the PCH to 'go to S0 after power returns'; when the system is powered off, right at the end you'd set the PCH to 'stay in S5 after power returns'.

(And apparently you can fiddle with this register yourself (via).)

All of the information I've dug up so far is for Intel ICH/PCH, but I suspect that AMD's chipsets work in a similar manner. Something has to do power management for suspend and sleep, and it seems that the chipset is the natural spot for it, and you might as well put the 'power off' handling into the same place. Whether AMD uses the same registers and the same bits is an open question, since I haven't turned up any chipset documentation so far.

I should probably reboot BMCs any time they behave oddly

By: cks
9 September 2024 at 03:13

Today on the Fediverse I said:

It has been '0' days since I had to reset a BMC/IPMI for reasons (in this case, apparently something power related happened that glitched the BMC sufficiently badly that it wasn't willing to turn on the system power). Next time a BMC is behaving oddly I should just immediately tell it to cold reset/reboot and see, rather than fiddling around.

(Assuming the system is already down. If not, there are potential dangers in a BMC reset.)

I've needed to reset a BMC before, but this time was more odd and less clear than the KVM over IP that wouldn't accept the '2' character.

We apparently had some sort of power event this morning, with a number of machines abruptly going down (distributed across several different PDUs). Most of the machines rebooted fine, either immediately or after some delay. A couple of the machines did not, and conveniently we had set up their BMCs on the network (although they didn't have KVM over IP). So I remotely logged in to their BMC's web interface, saw that the BMC was reporting that the power was off, and told the BMC to power on.

Nothing happened. Oh, the BMC's web interface accepted my command, but the power status stayed off and the machines didn't come back. Since I had a bike ride to go to, I stopped there. After I came back from the bike ride I tried some more things (still remotely). One machine I could remotely power cycle through its managed PDU, which brought it back. But the other machine was on an unmanaged PDU with no remote control capability. I wound up trying IPMI over the network (with ipmitool), which had no better luck getting the machine to power on, and then I finally decided to try resetting the BMC. That worked, in that all of a sudden the machine powered on the way it was supposed to (we set the 'what to do after power comes back' on our machines to 'last power state', which would have been 'powered on').

As they say, I have questions. What I don't have is any answers. I believe that the BMC's power control talks to the server's motherboard, instead of to the power supply units, and I suspect that it works in a way similar to desktop ATX chassis power switches. So maybe the BMC software had a bug, or some part of the communication between the BMC and the main motherboard circuitry got stuck or desynchronized, or both. Resetting the BMC would reset its software, and it could also force a hardware reset to bring the communication back to a good state. Or something else could be going on.

(Unfortunately BMCs are black boxes that are supposed to just work, so there's no way for ordinary system administrators like me to peer inside.)

I wish (Linux) WireGuard had a simple way to restrict peer public IPs

By: cks
8 September 2024 at 02:32

WireGuard is an obvious tool to build encrypted, authenticated connections out of, over which you can run more or less any network service. For example, you might expose the rsync daemon only over a specific WireGuard interface, instead of running rsync over SSH. Unfortunately, if you want to use WireGuard as a SSH replacement in this fashion, it has one limitation; unlike SSH, there's no simple way to restrict the public IP address of a particular peer.

The rough equivalent of a WireGuard peer is a SSH keypair. In SSH, you can restrict where a keypair will be accepted from with the 'from="..."' restriction in your .ssh/authorized_keys. This provides an extra layer of protection against the key being compromised; not only does an attacker have to acquire the key, they have to be able to use it from exactly the same IP (or the expected IPs). However, more or less by design WireGuard doesn't have a particular restriction on where a WireGuard peer key can be used from. You can set an expected public IP for the peer, but if the peer contacts you from another IP, your (Linux kernel) WireGuard will update its idea of where the peer is. This is handy for WireGuard's usual usage cases but not what we necessarily want for a wired down connection where the IPs should never change.

(I don't think this is a technical restriction in the WireGuard protocol, just something not done in most or all implementations.)

The normal answer is firewall rules that restrict access to the WireGuard port, but this has two limitations. The first and lesser limitation is that it's external to WireGuard, so it's possible to have WireGuard active but your firewall rules not properly applied, theoretically allowing more access than you intend. The bigger limitation is that if you have more than one such wired down WireGuard peer, firewall rules can't tell which WireGuard peer key is being used by which external peer. So in a straightforward implementation of firewall rules, any peer public IP can impersonate any other (if it has the required WireGuard peer key), which is different from the SSH 'from="..."' situation, where each key is restricted separately.

(On the other hand, the firewall situation is better in one way in that you can't accidentally add a WireGuard peer that will be accepted from anywhere the way you can with a SSH key by forgetting to put in a 'from="..."' restriction.)

To get firewall rules that can tell peers apart, you need to use different listening ports for each peer on your end. Today, this requires different WireGuard interfaces (and probably different server keys) for each peer. I think you can probably give all of the interfaces the same internal IP to simplify your life, although I haven't tested this.

(Having written this entry, I now wonder if it would be possible to write an nftables or iptables extension that hooked into the kernel side of WireGuard enough to know peer identities and let you match on them. Existing extensions are already able to be aware of various things like cgroup membership, and there's an existing extension for IPsec. Possibly you could do this with eBPF programs, since there's a BPF/eBPF iptables extension.)

Operating system threads are always going to be (more) expensive

By: cks
7 September 2024 at 04:01

Recently I read Asynchronous IO: the next billion-dollar mistake? (via). Among other things, it asks:

Now imagine a parallel universe where instead of focusing on making asynchronous IO work, we focused on improving the performance of OS threads [...]

I don't think this would have worked as well as you'd like, at least not with any conventional operating system. One of the core problems with making operating system threads really fast is the 'operating system' part.

A characteristic of all mainstream operating systems is that the operating system kernel operates in a separate hardware security domain than regular user (program) code. This means that any time the operating system becomes involved, the CPU must do at least two transitions between these security domains (into kernel mode and then back out). Doing these transitions is always more costly than not doing them, and on top of that the CPU's ISA often requires the operating system go through non-trivial work in order to be safe from user level attacks.

(The whole speculative execution set of attacks has only made this worse.)

A great deal of the low level work of modern asynchronous IO is about not crossing between these security domains, or doing so as little as possible. This is summarized as 'reducing system calls because they're expensive', which is true as far as it goes, but even the cheapest system call possible still has to cross between the domains (if it is an actual system call; some operating systems have 'system calls' that manage to execute entirely in user space).

The less that doing things with threads crosses the CPU's security boundary into (and out of) the kernel, the faster the threads go but the less we can really describe them as 'OS threads' and the harder it is to get things like forced thread preemption. And this applies not just for the 'OS threads' themselves but also to their activities. If you want 'OS threads' that perform 'synchronous IO through simple system calls', those IO operations are also transitioning into and out of the kernel. If you work to get around this purely through software, I suspect that what you wind up with is something that looks a lot like 'green' (user-space) threads with asynchronous IO once you peer behind the scenes of the abstractions that programs are seeing.

(You can do this today, as Go's runtime demonstrates. And you still benefit significantly from the operating system's high efficiency asynchronous IO, even if you're opting to use a simpler programming model.)

(See also thinking about event loops versus threads.)

The problems (Open)ZFS can have on new Linux kernel versions

By: cks
6 September 2024 at 03:00

Every so often, someone out there is using a normal released version of OpenZFS on Linux (currently ZFS 2.2.6, which was just released) on a distribution that uses very new kernels (such as Fedora). They may then read that their version of ZFS (such as 2.2.5) doesn't list the latest kernel (such as 6.10) as a 'supported platform'. They may then wonder why this is so.

Part of the answer is that OpenZFS developers are cautious people who don't want to list new kernels as officially supported until people have carefully inspected and tested the situation. Even if everything looks good, it's possible that there is some subtle problem in the interface between (Open)ZFS and the new kernel version. But another part of the answer comes down to how the Linux kernel has no stable internal API, which is also part of how you can get subtle problems in new kernels.

The Linux kernel is constantly changing how things work internally. Functions appear or go away (or simply mutate); fields are added or removed from C structs, or sometimes change their meaning; function arguments change; how you're supposed to do things shifts. It's up to any out of tree code, such as OpenZFS, to keep up with these changes (and that's why you want kernel modules to be in the main Linux kernel if possible, because then other people do some of this work). So to merely compile on a new kernel version, OpenZFS may need to change its own code to match the kernel changes. Sometimes this will be simple, requiring almost no changes; other times it may lead to a bunch of modifications.

(Two examples are the master pull request for 6.10, which had only a few changes, and the larger master pull request for 6.11, which may not even be quite complete yet since 6.11 is not yet released.)

Having things compiling is merely the first step. The OpenZFS developers need to make sure that they're making the right changes, and also they generally want to try to see if things have changed in a way that doesn't break compiling code. To quote a message from Rob Norris on the ZFS on Linux mailing list:

"Support" here means that the people involved with the OpenZFS are reasonably certain that the traditional OpenZFS goals of stability, durability, etc will hold when used with that kernel version. That usually means the test suites have passed, there's no significant new issues reported, and at least three people have looked at the kernel changes, the matching OpenZFS changes, and thought very hard about it.

As a practical matter (as Rob Norris notes), this often means that development versions of OpenZFS will often build and work on new kernel versions well before they're officially supported. Speaking from personal experience, it's possible to be using kernel versions that are not yet 'supported' without noticing until you hit an RPM version dependency surprise.

Using rsync to create a limited ability to write remote files

By: cks
5 September 2024 at 02:56

Suppose that you have an isolated high security machine and you want to back up some of its data on another machine, which is also sensitive in its own way and which doesn't really want to have to trust the high security machine very much. Given the source machine's high security, you need to push the data to the backup host instead of pulling it. Because of the limited trust relationship, you don't want to give the source host very much power on the backup host, just in case. And you'd like to do this with standard tools that you understand.

I will cut to the chase: as far as I can tell, the easiest way to do this is to use rsync's daemon mode on the backup host combined with SSH (to authenticate either end and encrypt the traffic in transit). It appears that another option is rrsync, but I just discovered that and we have prior experience with rsync's daemon mode for read-only replication.

Rsync's daemon mode is controlled by a configuration file that can restrict what it allows the client (your isolated high security source host) to do, particularly where the client can write, and can even chroot if you run things as root. So the first ingredient we need is a suitable rsyncd.conf, which will have at least one 'module' that defines parameters:

[backup-host1]
comment = Backup module for host1
# This will normally have restricted
# directory permissions, such as 0700.
path = /backups/host1
hosts allow = <host1 IP>
# Let's assume we're started out as root
use chroot = yes
uid = <something>
gid = <something>

The rsyncd.conf 'hosts allow' module parameter works even over SSH; rsync will correctly pull out the client IP from the environment variables the SSH daemon sets.

The next ingredient is a shell script that forces the use of this rsyncd.conf:

#!/bin/sh
exec /usr/bin/rsync --server --daemon --config=/backups/host1-rsyncd.conf .

As with the read-only replication, this script completely ignores command line arguments that the client may try to use. Very cautious people could inspect the client's command line to look for unexpected things, but we don't bother.

Finally you need a SSH keypair and a .ssh/authorized_keys entry on the backup machine for that keypair that forces using your script:

from="<host1 IP>",command="/backups/host1-script",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty [...]

(Since we're already restricting the rsync module by IP, we definitely want to restrict the key usage as well.)

On the high security host, you transfer files to the backup host with:

rsync -a --rsh="/usr/bin/ssh -i /client/identity" yourfile LOGIN@SERVER::backup-host1/

Depending on what you're backing up and how you want to do things, you might want to set the rsyncd.conf module parameters 'write only = true' and perhaps 'refuse options = delete', if you're sure you don't want the high security machine to be able to retrieve its files once it has put them there. On the other hand, if the high security machine is supposed to be able to routinely retrieve its backups (perhaps to check that they're good), you don't want this.

(If the high security machine is only supposed to read back files very rarely, you can set 'write only = true' until it needs to retrieve a file.)

There are various alternative approaches, but this one is relatively easy to set up, especially if you already have a related rsync daemon setup for read-only replication.

(On the one hand it feels annoying that there isn't a better way to do this sort of thing by now. On the other hand, the problems involved are not trivial. You need encryption, authentication of both ends, a confined transfer protocol, and so on. Here, SSH provides the encryption and authentication and rsync provides the confined transfer protocol, at the cost of having to give access to a Unix account and trust rsync's daemon mode code.)

TLS Server Name Indications can be altered by helpful code

By: cks
4 September 2024 at 03:25

In TLS, the Server Name Indication is how (in the modern TLS world) you tell the TLS server what (server) TLS certificate you're looking for. A TLS server that has multiple TLS certificates available, such as a web server handling multiple websites, will normally use your SNI to decide what server TLS certificate to provide to you. If you provide an SNI that the TLS server doesn't know or don't provide a SNI at all, the TLS server can do a variety of things, but many will fall back to some default TLS certificate. Use of SNI is pervasive in web PKI but not always used elsewhere; for example, SMTP clients don't always send SNI when establishing TLS with a SMTP server.

The official specification for SNI is section 3 of RFC 6066, and it permits exactly one format of the SNI data, which is, let's quote:

"HostName" contains the fully qualified DNS hostname of the server, as understood by the client. The hostname is represented as a byte string using ASCII encoding without a trailing dot. [...]

Anything other than this is an incorrectly formatted SNI. In particular, sending a SNI using a DNS name with a dot at the end (the customary way of specifying a fully qualified name in the context of DNS) is explicitly not allowed under RFC 6066. RFC 6066 SNI names are always fully qualified and without the trailing dots.

So what happens if you provide a SNI with a trailing dot? That depends. In particular, if you're providing a name with a trailing dot to a client library or a client program that does TLS, the library may helpfully remove the trailing dot for you when it sends the SNI. Go's crypto/tls definitely behaves this way, and it seems that some TLS libraries may. Based on observing behavior on systems I have access to, I believe that OpenSSL does strip the trailing dot but GnuTLS doesn't, and probably Mozilla's NSS doesn't either (since Firefox appears to not do this).

(I don't know what a TLS server sees as the SNI if it uses these libraries, but it appears likely that OpenSSL doesn't strip the trailing dot but instead passes it through literally.)

This dot stripping behavior is generally silent, which can lead to confusion if you're trying to test the behavior of providing a trailing dot in the SNI (which can cause web servers to give you errors). At the same time it's probably sensible behavior for the client side of TLS libraries, since some of the time they will be deriving the SNI hostname from the host name the caller has given them to connect to, and the caller may want to indicate a fully qualified DNS name in the customary way.

PS: Because I looked it up, the Go crypto/tls client code strips a trailing dot while the server code rejects a TLS ClientHelo that includes a SNI with a trailing dot (which will cause the TLS connection to fail).

Apache's odd behavior for requests with a domain with a dot at the end

By: cks
3 September 2024 at 03:16

When I wrote about the fun fact that domains can end in dots and how this affects URLs, I confidently said that Wandering Thoughts (this blog) reacted to being requested through 'utcc.utoronto.ca.' (with a dot at the end) by redirecting you to the canonical form, without the final dot. Then in comments, Alex reported that they got a Apache '400 Bad Request' response when they did it. From there, things got confusing (and are still confusing).

First, this response is coming from Apache, not DWiki (the code behind the blog). You can get the same '400 Bad Request' response from https://utcc.utoronto.ca./~cks/ (a static file handled only by this host's Apache). Second, you don't always get this response; what happens depends on what you're using to access the URL. Here's what I've noticed and tested so far:

  • In some tools you'll get a TLS certificate validation failure due to a name mismatch, presumably because 'utcc.utoronto.ca.' doesn't match 'utcc.utoronto.ca'. GNU Wget2 behaves this way.

    (GNU Wget version 1.x doesn't seem to have this behavior; instead I think it may strip the final '.' off before doing much processing. My impression is that GNU Wget2 and 'GNU Wget (1.x)' are fairly different programs.)

  • on some Apache configurations, you'll get a TLS certificate validation error from everything, because Apache apparently doesn't think that that the 'dot at end' version of the host name matches any of its configured virtual host names, and so it falls back to a default TLS certificate that doesn't match what you asked for.

    (This doesn't happen with this host's Apache configuration but it does happen on some other ones I tested with.)

  • against this host's Apache, at least lynx, curl, Safari on iOS (to my surprise), and manual testing all worked, with the request reaching DWiki and DWiki then generating a redirect to the canonical hostname. By a manual test, I mean making a TLS connection to port 443 with a tool of mine and issuing:

    GET /~cks/space/blog/ HTTP/1.0
    Host: utcc.utoronto.ca.
    

    (And no other headers, although a random User-Agent doesn't seem to affect things.)

  • Firefox and I presume Chrome get the Apache '400 Bad Request' error (I don't use Chrome and I'm not going to start for this).

I've looked at the HTTP headers that Firefox's web developer tools says it's sending and they don't look particularly different or unusual. But something is getting Apache to decide this is a bad request.

(It's possible that some modern web security related headers are triggering this behavior in Apache, and only a few major browsers are sending them. I am a little bit surprised that Safari on iOS doesn't trigger this.)

The status of putting a '.' at the end of domain names

By: cks
2 September 2024 at 02:29

A variety of things that interact with DNS interpret the host or domain name 'host.domain.' (with a '.' at the end) as the same as the fully qualified name 'host.domain'; for example this appears in web browsers and web servers. At this point one might wonder whether this is an official thing in DNS or merely a common convention and practice. The answer is somewhat mixed.

In the DNS wire protocol, initially described in RFC 1035, we can read this (in section 3.1):

Domain names in messages are expressed in terms of a sequence of labels. Each label is represented as a one octet length field followed by that number of octets. Since every domain name ends with the null label of the root, a domain name is terminated by a length byte of zero. [...]

DNS has a 'root', which all DNS queries (theoretically) start from, and a set of DNS servers, the root nameservers, that answer the initial queries that tell you what the DNS servers are for the top level domain is (such as the '.edu' or the '.ca' DNS servers). In the wire format, this root is explicitly represented as a 'null label', with zero length (instead of being implicit). In the DNS wire format, all domain names are fully qualified (and aren't represented as plain text).

RFC 1035 also defines a textual format to represent DNS information, Master files. When processing these files there is usually an 'origin', and textual domain names may be relative to that origin or absolute. The RFC says:

[...] Domain names that end in a dot are called absolute, and are taken as complete. Domain names which do not end in a dot are called relative; the actual domain name is the concatenation of the relative part with an origin specified in a $ORIGIN, $INCLUDE, or as an argument to the master file loading routine. A relative name is an error when no origin is available.

So in textual DNS data that follows RFC 1035's format, 'host.domain.' is how you specify an absolute (fully qualified) DNS name, as opposed to one that is under the current origin. Bind uses this format (or something derived from it, here in 2024 I don't know if it's strictly RF 1035 compliant any more), and in hand-maintained Bind format zone files you can find lots of use of both relative and absolute domain names.

DNS data doesn't have to be represented in text in RFC 1035 form (and doing so has some traps), either for use by DNS servers or for use by programs who do things like look up domain names. However, it's not quite accurate to say that 'host.domain.' is only a convention. A variety of things use a more or less RFC 1035 format, and in those things a terminal '.' means an absolute name because that's how RFC 1035 says to interpret and represent it.

Since RFC 1035 uses a '.' at the end of a domain name to mean a fully qualified domain name, it's become customary for code to accept one even if the code already only deals with fully qualified names (for example, DNS lookup libraries). Every program that accepts or reports this format creates more pressure on other programs to accept it.

(It's also useful as a widely understood signal that the textual domain name returned through some API is fully qualified. This may be part of why Go's net package consistently returns results from various sorts of name resolutions with a terminating '.', including in things like looking up the name(s) of IP addresses.)

At the same time, this syntax for fully qualified domain names is explicitly not accepted in certain contexts that have their own requirements. One example is in email addresses, where 'user@some.domain.' is almost invariably going to be rejected by mail systems as a syntax error.

In practice, abstractions hide their underlying details

By: cks
1 September 2024 at 01:58

Very broadly, there are two conflicting views of abstractions in computing. One camp says that abstractions simplify the underlying complexity but people still should know about what is behind the curtain, because all abstractions are leaky. The other camp says that abstractions should hide the underlying complexity entirely and do their best not to leak the details through, and that people using the abstraction should not need to know those underlying details. I don't particularly have a side, but I do have a pragmatic view, which is that many people using abstractions don't know the underlying details.

People can debate back and forth about whether people should know the underlying details and whether they are incorrect to not know them, but the well established pragmatic reality is that a lot of people writing a lot of code and building a lot of systems don't know more than a few of the details behind the abstractions that they use. For example, I believe that a lot of people in web development don't know that host and domain names can often have a dot at the end. And people who have opinions about programming probably have a favorite list of leaky abstractions that people don't know as much about as they should.

(One area a lot of programming abstractions 'leak' is around performance issues. For example, the (C)Python interpreter is often much faster if you make things local variables inside a function than if you use global variables because of things inside the abstraction it presents to you.)

That this happens should not be surprising. People have a limited amount of time and a limited amount of things that they can learn, remember, and keep track of. When presented with an abstraction, it's very attractive to not sweat the details, especially because no one can keep track of all of them. Computing is simply too complicated to see behind all of the abstractions all of the way down. Almost all of the time, your effort is better focused on learning and mastering your layer of the abstraction stack rather than trying to know 'enough' about every layer (especially when it's not clear how much is enough).

(Another reason to not dig too deeply into the details behind abstractions is that those details can change, especially if one reason the abstraction exists is to allow the details to change. We call some of these abstractions 'APIs' and discourage people investigating and using the specific details behind the current implementations.)

One corollary of this is that safety and security related abstractions need to be designed with the assumption that people using them won't know or remember all of the underlying details. If forgetting one of those details will leave people using the abstraction with security problems, the abstraction has a design flaw that will inevitably lead to a security issue sooner or later. This security issue is not the fault of the people using the abstraction, except in a mathematical security way.

❌
❌