❌

Reading view

There are new articles available, click to refresh the page.

When su replaced login for becoming another Unix login

By: cks

I recently read Simon Tatham's Nitpicking the shell history scene in Tron: Legacy, where one thing that surprised Tatham was the film using 'login -n root' to become root instead of 'su. This surprised me because I found that perfectly ordinary, and this turns up both a bit of Unix history and a difference between modern Unixes.

Plain 'su' can let you become another user, including root, but what it explicitly doesn't do by default is create a new login shell for that user. If you do 'su root', the new root shell normally inherits most of your environment, your current directory, and so on. Sometimes this is what you want and sometimes you really want a new login environment, and originally in Unix how you got the latter was to run 'login' from your existing shell session (and this meant that login was setuid root, like su).

This split usage of su(1) and login(1) is present in Research Unix V7 (and for login goes back to at least V3), where the respective manual pages clearly say that su doesn't change your environment or your current directory, while login's normal use (from a shell) is to 'change from one user to another'. Similar wording remains in the 4.2 BSD su(1), but in System III, su(1) picked up an option to make the new shell a login shell (and it even describes the mechanism) and login(1) lost the ability to be run from a normal shell. The 4.3 BSD su(1) picked up the System III su change, but login(1) can still be used from a normal shell, and I believe this continued on the BSD lineage in general.

As you might expect, all of the modern versions of su across Linux and the free BSDs support starting a login shell (cf the normal Linux su (also), FreeBSD su(1), NetBSD su(1), and OpenBSD su(1)). On Linux and OpenBSD, login isn't setuid root and so can't be used from a regular shell environment to become a new user; your only option is su. On FreeBSD and NetBSD, login is still setuid root and can be used to switch to another account with a login shell, although this usage doesn't seem to be explicitly documented in either's manual page. Illumos (the open source successor of Solaris) also still supports using login from a command shell, and explicitly documents this in login(1).

(OpenBSD making login not be setuid fits their general security posture, since a setuid login has been a vector for security issues in the past. I can't easily find out if Linux versions of login were ever setuid.)

PS: It's possible that login is still setuid on some Linux distributions. The normal util-linux login specifically says that it doesn't work from a shell session, but the shadow-utils login may still, and some distributions might enable that.

(This sort of elaborates on a Fediverse post I made.)

Sidebar: The early history of su

The su command goes back to V1 Unix, but at the time it was only used to let you become root ('superuser', likely the source of the 'su' command name). We don't have much from V2 (well, sort of [PDF]), but in V3 su's manpage moves to section 8 (for 'administrative commands') as su(8), where it stayed in Research Unix V4, V5 (per the V5 manual [PDF]), and V6. Only in V7 does su gain the ability to change to any user and its manual page was moved to section 1 (for general commands) as su(1).

(Potentially of interest is this reconstruction of old Unix manual pages.)

I'm not sure we'd use AppArmor much even if we could

By: cks

The news of the time interval is a string of local privilege escalation vulnerabilities in Linux (in part in the kernel). We very much need the security boundary of Unix logins, and some of these vulnerabilities are mitigated or blocked by various Linux kernel security modules ('LSMs') (cf), so I've recently been thinking if we'd use AppArmor, the LSM that Ubuntu supports.

(AppArmor didn't block as many of the vulnerabilities as a proper SELinux setup did, but SELinux needs distribution buyin and that's not what Canonical provides.)

We've traditionally disabled AppArmor because it's had issues in our environment of NFS home directories in our own locations for them (also). So let's assume that AppArmor magically works now for NFS home directories and other directories (or can easily be set with tuning knobs), and still provides meaningful security afterward. Setting up AppArmor for our environment will take some amount of work (cf), so the question is how much protection against local privilege escalation we get.

Roughly speaking, our systems fall into two categories; systems that normal people can access and run programs on, and systems that are purely for services (including things such as IMAP mail). For services, in theory we (or the people writing AppArmor profiles) can work out what the services should be allowed to do and not do, and thus lock things down against local privilege escalations in kernel systems that the services shouldn't be touching anyway (and other vulnerabilities, such as information disclosure from reading files the service shouldn't be accessing). However, this protects against an unlikely set of chained issues, where there's both a vulnerability in a service itself and then an additional vulnerability in the kernel.

(If these issues aren't unlikely, we have bigger problems.)

That leaves the systems where normal people can run their own programs (which are the ones where we really need the security boundary of logins). On these systems we have to assume that an attacker can gain the ability to run relatively arbitrary programs, either by compromising an account outright or through, for example, a compromised package that people are using in the code they're writing for their research (or a compromised editor extension, or etc; there are lots of ways in). Since people are effectively running arbitrary code, we can't protect ourselves by having AppArmor restrict what specific programs can do the way we can on service-based machines. Instead, we have to find and inventory kernel features that people will never legitimately use, and then block them through AppArmor rules.

(This is how a strict SELinux setup appears to protect against the recent vulnerabilities; a normal login is simply not allowed to use, eg, RDS sockets.)

The Linux kernel has a lot of features and facilities, although some of them are blocked off because we don't allow user namespaces, and people doing CS research do a lot of things, some of them at least unusual. Could an AppArmor profile (or a set of them) be written so that people would be allowed access to what they use and not allowed access to things that they don't? Probably (although AppArmor is more focused on programs than on people, well, logins). Would we be able to find an out of the box set of AppArmor rules and so on that worked? Maybe, and this depends on exploits not being found in areas that people pretty much have to be given access to.

If we had a reliable set of AppArmor or SELinux profiles, we might well use them because it would be easy enough. Without a reliable set of AppArmor profiles, I'm not sure we'd try to build some ourselves unless we were desperate. And if we were going to do the work, it appears that we might get more results for less effort through things like explicitly blocking all the loadable kernel modules for Linux socket types that we don't use.

(Some people even block all kernel modules that their current configuration doesn't use. I'm not sure I'd go that far, but I suppose you can always un-block things like the netfilter modules if you turn out to want to add some nftables rules later.)

Our unusual system of "web home directories" for people

By: cks

One of the things we operate for the research side of the department is an old fashioned general purpose web server, where everyone has a home page area of their own in the traditional '/~<login>/' style (cf). This web server has been there for a very long time, and one of the decisions that was made very early on was that for security reasons, the web server would not NFS mount people's regular home directories from our fileservers.

The traditional Apache way to do '/~<login>/' home pages is to have some location under your home directory that's exposed as your web home page area; the traditional name for this is 'public_html'. One alternative is to relocate this to a separate directory tree, but this directory tree is flat, which makes it awkward to have different pools of disk space for different people (which is absolutely required for us). Since we didn't want to use people's regular home directories for security reasons and we couldn't put everyone in one directory, we did the obvious hack: people have a different, special home directory on the web server. These home directories are in special 'webdir' filesystems on our fileservers, and these webdir filesystems are the only NFS filesystems that the web server NFS mounts.

The result is that everyone actually has two home directories in two different filesystems (although those two filesystems will come from the same ZFS pool). They have their regular home directory filesystem, which is accessible on our login and compute servers but not the web server, and their 'webdir' home directory, which is accessible everywhere. To make this more convenient to people, we create a 'public_html' symlink in people's regular home directories that points to the 'public_html' in their webdir home directory. If people have personally run web servers, these and their support files also live in the 'webdir' home directory, for relatively obvious reasons.

(We have a special short name form of people's home directories, so on the web server this short form points to their web home directory. The public_html symlink combined with this means that '/u/<login>/public_html' always refers to your web home page directory tree no matter what machine you're on.)

Because everyone's web home directory filesystem is in the same ZFS pool as their normal home directory filesystem, the web server still depends on all of our ZFS fileservers. Since our web server is reasonably active (also, also), it tends to react very rapidly to any NFS fileserver hiccups.

PS: The web home directory security decision predates me, so I don't know why it was made, but in my view it's a perfectly sensible decision. In general you should probably assume that your web server can be coaxed into reading and disclosing any Unix file that it has filesystem level access to. If you don't like the implications of this, you need to arrange for it to have access to fewer files. A dedicated set of filesystems is one relatively straightforward way to do that.

Our servers mostly don't seem to have high peak power usage

By: cks

I wrote a bit ago about how our servers seem to have surprisingly low power consumption, where I looked at IPMI based power consumption information to look at their current, typical power usage and found that several of them were sitting at lower power usage than my desktops. I'm still interested in typical or average power usage for reasons beyond the scope of this entry, but now I'm also interested in maximum power usage that we've observed.

One reason to be interested in maximum power usage is that servers are often given relatively high capacity power supplies. A 600 watt power supply is on the small side for even a 1U server, and my impression is that 800 watt and even 1200 watt PSUs are reasonably normal in basic servers. On the one hand, the vendors building servers don't know what they're going to be used for, so they're likely to be conservative. On the other hand, that's a lot of spare power for a system that is typically using, say, 24 watts. A PSU that is that far under its rating may not be all that efficient (although the one 600 watt server PSU I looked at had a 'platinum' rating).

Obviously a server's PSU has to support not just its typical power draw but also its maximum power draw; if you idle at 107 watts but reach 500 watts or more at full load (as one of our servers does), then you need that big PSU. But if the maximum observed power draw is substantially lower, the PSU looks more like overkill.

Because we only collect IPMI sensor information every minute, I can't be confident that we've captured the absolute highest peak usage for any of our servers. If a server has a high power draw for only fifteen or thirty seconds or so, it would have to happen at just the right time for us to capture that in a sensor reading. But if the power draw lasts for more than a minute, it becomes increasingly likely that we'll capture it.

With that said, the regular 1U servers I have reliable IPMI power usage from top out at around 155 watts for a few servers (over the past year), and our NFS fileservers, which are loaded with SSDs, are under 100 watts. These machines don't seem to be putting much stress on their PSUs, to say the least (I think they all have 600 watt PSUs or better, ie bigger). I don't know what to make of this, or if it matters much from an efficiency point of view. Some extra percent of 25 watts is not necessarily a large number, and I don't know if reducing the PSU from 600 watts to, say, 400 watts would improve things much.

(Even if it did, I suspect that there's not enough of a market for it to make it an economical option for basic server vendors. I have a feeling that most people's servers are more loaded and more power-consuming than ours are.)

The Go language server can do some impressive code navigation

By: cks

For reasons outside the scope of this entry, I recently dug into how the Go runtime did (Unix) signal handling on 64-bit x86 Linux. When I undertook this quest, I decided that the easiest way to navigate through the code of the Go runtime was to use the code navigation features exposed by the standard Go language server, gopls. In the process I was surprised by just how good its code navigation was, even in the Go runtime.

On Linux, Go's signal handling talks directly to the Linux kernel rather than going through the C library. As you can imagine, this is relatively architecture and Linux specific, as well as being relatively specific to Unix. The result is a tangle of OS and architecture specific code in a variety of signal related files in src/runtime. The first challenge for code navigation is picking out the right ones that apply to the environment you're interested with; in Go this is handled through build tags, which gopls understand. So gopls had no problem navigating from general Unix signal handling to setsig() in Linux-specific code and the 64-bit x86 Linux definition of the struct involved.

But that was only half the puzzle, because I was looking into how the Go runtime receives signals. This is done by 'sigtramp()' and the related function sigreturn__sigaction(), and it turns out that these functions are not defined in Go. All you'll find in Go is stubs of them at the start of os_linux.go. But gopls had no problems navigating from the stubs to the actual amd64 assembly version, despite the fact that the assembly version has an odd name, and then it was able to navigate from the assembly version of 'sigtramp()' back to the Go 'sigtrampgo()'.

(It turns out that one area where gopls is currently limited for Go assembly language is finding references for assembly language symbols. Fortunately I didn't need that here, all I needed was 'find definition'.)

Code navigation among Go code is not surprising, because that's what you expect from a (Go) language server like gopls. What surprised and impressed me is code navigation into and out of Go assembly code, where I was expecting to have to resort to manual searches with (rip)grep. This is almost certainly a relatively niche feature, yet gopls has basic support for it. This support doesn't come from the standard Go library; instead it's implemented specifically in gopls in internal/asm and internal/goasm.

Another nice trick that gopls can do (that I just investigated) is navigate from an interface or an interface method to everything in your codebase that implements the interface. This is done through (of course) the LSP 'find implementation' code navigation action (in Eglot in GNU Emacs, this is 'C-c i'). Gopls will also navigate backward from a concrete thing to all of the (in-scope) interfaces that it implements (again using 'find implementation'). Slightly inconveniently, if your thing has a String() method, this will report a number of interfaces in the Go standard library. Gopls currently includes non-exported interfaces in the standard library, which is technically correct but extra not useful.

(Specifically, currently this will include context.stringer and runtime.stringer, as well as fmt.Stringer, the public version (and expvar.Var, which has the same shape but incompatible return value requirements). I assume the Go runtime and standard library has multiple versions of this interface internally to limit cross-imports.)

Update: My GNU Emacs 'C-c i' key binding for Eglot's 'find implementation' command (eglot-find-implementation) is a custom personal key binding, not a standard one. Oops.

Using typing in Python leads to different sorts of code

By: cks

So what happened is that I converted a big pile of (highly untyped) Python 2 to Python 3 recently, and then I wanted to experiment with typing-heavy Python LSP servers in GNU Emacs, so I decided to try them out by experimentally adding some type annotations to DWiki, the aforementioned pile of untyped Python (and the code powering Wandering Thoughts). The experience was educational and taught me some new things about type annotations, but it also firmed up my view that typed Python code is different than untyped Python code (although not quite to the extent that they create a different language, as I sort of felt before). There are idioms that are perfectly natural in untyped Python that are pretty annoying to deal with in typed Python.

One of these idioms is dictionaries with multiple types of values. For instance, DWiki has a dictionary that is basically 'a collection of information about the HTTP request'. The authentic type of the values in this dictionary is "str | bool | SimpleCookie | dict[str, str]", which is to say that values can be any of a string, a boolean, a HTTP Cookie, or a dictionary of string key/value pairs. Of course, individual keys in the dictionary have a fixed type for their value; for example, the key 'request-fullpath' only ever has a string value, so in untyped Python code it's natural to write something like:

if reqdata['request-fullpath'] and \
   reqdata['request-fullpath'][-1] != '/':
    [...]

If you do this in typed Python, your type checker will almost certainly complain that this indexing isn't valid for booleans and HTTP Cookies. You need to either check or type-assert that the value is a string.

In untyped Python, this is a perfectly decent data structure (although it might not be good style). In typed Python, this is a bad data structure that will cause you pain. There are ways around the pain that preserve the underlying dictionary, but they exist almost entirely to pacify the type checker. A proper data structure in typed Python is not multi-typed like this, or at least it's not multi-typed with a lot of keys.

(One way is to use typing.TypedDict, but if you have a lot of keys it gets painful).

There's a good reason for this insistence in typed Python, because right now there's nothing preventing me from putting in the wrong type of value for a particular key in this dictionary. I could slip up and set some key that's supposed to have a string value to a boolean, or a key that's supposed to have a dictionary to a plain string. Typing can't detect those errors because any of those are valid for the dictionary in general, just not for that particular key. A proper data structure in typed Python is one where the type checker itself can check your invariants, so string values are separated from boolean values and so on. This would probably also be clearer code.

This is a general issue for any sort of variable-typed container object, return values, or the like. I saw a similar thing when typing my program that uses the email packages; the email packages have old-school polymorphic API return values that typing is not fond of and that required type checks or casts. This is relatively valid on the part of programs determining typing (they're unlikely to ever do full flow control analysis to determine actual types), and is clearly part of the style of typed Python.

(Another case of this in DWiki is that I have a general caching layer that uses pickle to store and retrieve arbitrary objects. The callers know what they're storing and retrieving under a particular key, but this isn't visible in any types I could assign.)

As far as I can see, typing also changes how you want to structure multi-file code with classes and other data structures. In untyped Python such as DWiki, it's natural to have one source file declare a data structure, create an instance of it, and pass it as an argument to a function (or a class) from another file that the first file imports. In typed Python, this doesn't work so well. Because everything that either takes data structures as arguments or returns them wants to name the data structure in type hints, you need the classes for those data structures to be eventually be accessible in everything that touches them, which means a tangle of circular imports.

(This is different from forward references in that the code that accepts instances of these data structures will normally never import the code that defines them, cf.)

Circular imports work, technically (as I've sort of written about before), but they make me unhappy. I lack enough experience with typed Python to know the correct approach, but it certainly feels like one should define as many data structures as possible in low level files that are relatively standalone so they can be imported into everything without circular imports. I'm not sure how this works once you want to put methods on your classes that take other classes as arguments and so on.

(Mypy has some suggestions but its answers don't make me feel happy.)

Another practical issue I ran into was that DWiki has a stack of middleware functions to fiddle with HTTP requests. All of the middleware functions take a standard set of four arguments, each with a specific type, and I have enough of theses functions that going through and adding the appropriate type annotation to each argument for each function (and the return value) was clearly a pain (in my experiment I only did this for a few). I found myself really wishing for a way to say that the function as a whole had a particular type shape, which would automatically infer the argument and return types. I think the proper way to do this is to pass each function fewer arguments (ideally one), but I'm not sure I like it (and the four arguments aren't tightly coupled to each other).

(I also wound up feeling that I should create a 'types.py' file that had all of the basic type definitions that didn't depend on classes and so on. This would be things like the shape of callable functions, that 'data about the HTTP request' dictionary, and so on. Many of these are used in multiple files in DWiki and this avoids various sorts of annoyances. I don't know if such a 'types.py' file is considered a code smell.)

I don't regret my scratch experiments with adding some types to DWiki (partly because I learned more useful things about Python typing), but it's clear that doing it properly is somewhere between infeasible and impossible (and Python typing acknowledges that this can be the case). A reasonable typed version of DWiki would be structured significantly differently, and getting from the current code to any new type-friendly structure would be a significant rewrite (which would fix some old mess but likely introduce new mess).

(The semi-typed results of my experimentation are messy enough that I'm to discard that copy of the source code.)

(I said something about type hints on the Fediverse and some interesting things came up in the replies, eg.)

My views on some Python LSP servers in GNU Emacs (as of mid 2026)

By: cks

Some languages have to make do with one LSP server. By contrast, Python has an embarrassment of riches; I know of at least five modern LSP servers for it. I've recently been experimenting with some of them in GNU Emacs, specifically Eglot, so before I forget I want to note down my views. The five Python things with LSP servers that I believe are modern and current are python-lsp-server ('pylsp'), Facebook's pyrefly, Astral's ty, Microsoft's pyright, and technically Astral's ruff.

The easiest to talk about is ruff, because it's not intended as a full-featured LSP server that does everything; instead it only does code diagnostics and formatting, and you need another LSP server for code navigation. Currently Eglot doesn't easily support multiple LSP servers and code navigation is a lot of what I care about, so direct use of ruff is off the table for me. Also off the table is pyright, since I don't have any interest in touching a Microsoft Python project or finding out how badly it works with anything other than VS Code (although there's basedpyright as a less-Microsofted pyright option).

Python-lsp-server is my default choice and is a solid basic LSP server with the code navigation features I normally care about, along with support for code diagnostics through either or both of mypy and ruff (via python-lsp-ruff). Python-lsp-server is also what I'd call a 'quiet' LSP server by default, without a lot of stuff popping up and being filled in in Eglot. It's supported by the community and is probably going to endure, but it's written in Python (so it's not the fastest thing) and my impression is that it's more focused on code navigation than on type checking your code. My view is that it's probably your best option if you have a lot of untyped Python code, which is my normal case.

(So after playing around with both ty and pyrefly for some time, I'm probably going to stick with python-lsp-server most of the time.)

Both ty and pyrefly are strongly into type checking and type annotations, in addition to supporting code navigation. Both support 'inlay hints' in Eglot, which fill in known or deduced types for you (and can also attach names to positional arguments in function calls; ty defaults this to on, pyrefly to off). There are some differences in what types they fill in, for example ty will tell me 'Unknown' for types while pyrefly is silent about them (with no inlay hint), and I suspect that there are differences in what types they deduce for things. I don't have enough experience with Python type checking to have strong opinions on the general choice between ty and pyrefly. Both support more or less all LSP code navigation features (ty's LSP documentation, pyrefly's LSP documentation), with pyrefly currently having one more supported navigation ('go to implementations', which lets you find the reimplementation of methods in sub-classes, and now that I've tried it that's kind of handy and it's not currently supported by python-lsp-server).

(Eglot allows you to easily toggle inlay hints off and on with 'eglot-inlay-hints-mode', in case you don't like the noise of them but do want, for example, pyrefly's code navigation. I'm not sure how much unwanted type diagnostics and notes pyrefly or ty will spit out at you on untyped, anarchic Python code bases.)

As before, I think setting up Python LSP support in GNU Emacs is worth it, especially if you're working with typed Python and pick a good LSP server for this. LSP server code navigation is really quite nice and will work across files in your Python project (and pyrefly's support for 'find everything that overrides this method' is handy if you have that kind of code base).

(GNU Emacs can do some amount of code navigation in Python code without a LSP, but you want to create and maintain a tags table and in brief experimentation the experience is not as smooth and more annoying.)

If you want the most deluxe Eglot based Python LSP experience, I think you want to set up pyrefly with however many inlay hints you want. Since I slogged through the effort to determine what special Eglot configuration you need for this, I will save people the effort:

(setq-default eglot-workspace-configuration
   '([...]
     :python (:analysis (:inlayHints (:callArgumentNames "partial")))
    )
 )

As (sort of) covered in pyrefly's LSP documentation, pyrefly doesn't use its own name for these settings, it uses names that pyright apparently originated. Fortunately Eglot will send (all of) your settings to whatever LSP you're currently running, regardless of their names. I believe you can also configure this in per-project configuration files, which would also let you entirely disable pyrefly type checking in places where you don't want it (per the configuration documentation).

(Some bits of the pyrefly experience in GNU Emacs will get more deluxe in GNU Emacs 31, when Eglot will acquire support for reporting things like call and type hierarchies.)

Sidebar: A brief experience with basedpyright

I ran a little poll on the Fediverse and a surprising number of people (to me) turned out to use pyright or basedpyright, so I gave it a try. The result is, effectively, a failure for my code. Even code that I thought was well typed and free of problems came out full of diagnostics in basedpyright's default configuration. It does have more or less the same code navigation features as pyrefly, but for me the cost of getting them is too high.

But if you want to write extremely strictly typed and careful Python code, basedpyright will make you do it (assuming you make it have no errors and keep its strict default settings).

(The poll also suggested that very few people use pyrefly, which surprised me a bit.)

Anti-robot techniques can be nice but the problem is, they're not static

By: cks

I've recently come up with what I expect would be a quite good anti-robot, anti-crawler tactic, which I will give the snappy label and summary of "robots don't POST". Simply require a HTTP cookie to see your web pages and then if visitors don't have the cookie, put up an interstitial page with a HTML form that requires them to POST it to get the cookie. All the form need is a "click me to get your entrance cookie", because right now, few or no robots or crawlers will make that HTTP POST request; they only do HTTP GETs. To distract bad crawlers you might need some other links on the interstitial page, optionally going to content tarpits.

(If you're going to do this in practice you'll want to exempt syndication feed requests and perhaps requests from bingbot, Googlebot, and so on. Although maybe not Googlebot any more.)

The obvious problem with this technique is that if people start doing it in any quantity, the "robots don't POST" thing won't last. Bad crawlers will start hitting POST endpoints for forms that just have a "click me" button, and then POST endpoints for forms that have an "I am human" tick box to mark or a field to fill in or whatever the elaboration people come with is, and so on. Bad crawlers are in an arms race with websites and this is a problem.

Arms races require two active participants. An inactive participant in an arms race usually loses by default. In today's environment with aggressively bad crawlers, you can't simply set up a website and walk away from it, not if you want it to survive; you're forced to participate in the arms race. Your website may be static but your operation of your website increasingly can't be, not unless you want to wake up one day and discover that you don't have a website, you have a smoking hole in the ground and perhaps a big bandwidth bill from your hosting provider.

I don't have any answers to this. Instead, it feels like this whole situation is another obstacle in the way of people having their own low-attention websites (after the comment spammers made it impossible to have your own low-attention comment system). Someone has to pay attention, so that's either you or someone you outsource it to, and that someone is most likely going to need to be paid sooner or later.

(There are exceptions, but they're rare. Also, if you run your own website you sort of have to maintain the software involved, but automatic updates (and static websites) have mostly made that easier.)

An idea: user level WireGuard for UDP based encryption and authentication

By: cks

In some environments, you want to connect programs together with mutual authentication and encryption of their traffic (so each end can trust the other and the traffic is immune to easy eavesdropping). If the programs are talking to each other over TCP, there's a well developed solution for this in the form of mutual TLS (mTLS) (although you'll probably get to enjoy the fun of running your own private Certificate Authority). But if you're using UDP, things are less clear. When this came up recently in a Fediverse discussion I was a peripheral part of, it occurred to me that we already have an existing, well regarded, UDP-based mechanism for authentication and encryption in the form of WireGuard.

(Yes, there's QUIC, but that still leaves you with TLS and it gives you a reliable stream model instead of a UDP model.)

WireGuard is normally used to create general networking connections between two machines, a connection that other programs can use to pass whatever traffic they want. But in theory it doesn't have to be used this way. There are purely user-level WireGuard libraries, and if you have a suitable library, you can have the program it's embedded in receive and handle the packets from inside the WireGuard connection, without injecting them into the operating system and exposing them to other programs (and it can also send its own packets back). You're probably going to need your own user level IP implementation to make the WireGuard library happy, and you may want or need a user level UDP or TCP implementation to make handling your own traffic simpler, but all of those are available if you hunt around.

What you get out of this is a well regarded protocol with simple, straightforward authentication, and to some degree it handles out of order packet delivery and packet loss, which is presumably something you care about if you picked UDP to start with. You don't have to deal with the complexities of any variant of TLS, you don't need a private Certificate Authority, and you can always directly know what one program is willing to talk to, because you have a list of public keys.

The drawback of this is that you have to put it together yourself. If you can find a suitable QUIC library (eg), it should do all of this for you in a hopefully straightforward API that looks a lot like (TCP based) TLS. The one potential drawback of the QUIC approach is that I believe it's only a stream based protocol without UDP-like, out of order delivery (and possible packet loss). If what you want is 'an authenticated, encrypted stream but over UDP', then QUIC is probably more like this than WireGuard. If what you want is 'authenticated, encrypted UDP', then WireGuard might be closer to that than QUIC.

Sidebar: WireGuard, QUIC, and out of order packets

As far as I can tell from reading the basic WireGuard protocol summary, each WireGuard encrypted packet is independent. While packets are sort of ordered by a counter, WireGuard can handle them out of order and will accept them out of order within a window (cf). If you're sending UDP datagram traffic within WireGuard, I believe this means that your underlying system can still have the UDP properties of non-blocking, non-sequential receives.

As I understand it, a single QUIC connection carries multiple streams within it and each of these streams runs independently, so packet loss and delays on one stream don't affect other streams. However, I believe packet loss and packet reordering will block a single stream because it's, well, a stream.

How our environment still needs the security boundary of Unix logins

By: cks

In a comment on this recent entry, I was asked if we still considered Unix logins to be a serious security boundary. This is a sensible question; there are a horde of Linux local privilege escalation vulnerabilities going around right now (and one FreeBSD one for spice), and in general (some) security people have been saying for years that once an attacker had local code execution, the game was over. Our answer is that yes, we consider it a serious security boundary, and if that situation ever changed we'd need a drastically different system environment from our current environment.

Our current environment has shared NFS fileservers where people keep all their files and data, shared login servers for both general usage and compute, a (shared) SLURM computer cluster, and a reasonably flexible shared web server environment where people can run programs. While some people are still using our login servers interactively, others are running software (such as VSCode) that connects to them somewhat behind the scenes and uses them to run tools. All of this is critically dependent on the security provided by Unix logins; if Unix logins weren't a real security boundary any more, anyone on any of these machines could read other people's files or run programs as them.

Since these machines are all shared machines with multiple people logged in at once, switching to Kerberos authenticated NFS wouldn't solve the problem. If we assume that attackers can merely become any other person, then they can gain access to the Kerberos tickets of anyone else who's currently logged in and access their files. If we assume that attackers can compromise root, then all bets are off and once a person has used that machine it can't be trusted for any future use (since the attacker could have compromised programs to capture the login credentials of future people logging in).

Basically, if you lose the security boundaries of Unix logins, you lose shared machines. You need to create a new environment without sharing (or with sharing boundaries that people can't break out of). Today, it appears that the only way to do that securely is a separate virtual machine for each person, with Kerberos authentication to our NFS fileservers (given some of the Linux security issues, containers are clearly not good enough). I'm not sure how you manage a SLURM cluster in this environment, but it certainly wouldn't be the straightforward way we do it today.

This would be a drastic change for people here and it would also be a significant increase in resource requirements (since realistic virtual machines are much more heavyweight than even full login sessions). We couldn't leave 'your' virtual machine (or machines) running all the time (we have too many people using our systems for that), so you'd have to use some web interface to request it be started with some resource allocation. Managing, maintaining, and updating these virtual machine images and running VMs would be at least a bit painful, and people would probably experience more disruption in their activities. Some things would become effectively impossible, such as running CGIs on our web server.

My views on Flymake and Flycheck in GNU Emacs (as of mid 2026)

By: cks

One of the divisions in GNU Emacs people is between using Flymake, which is built into GNU Emacs and is well supported by other standard GNU Emacs packages such as Eglot, and using Flycheck. I've used Flycheck for a long time (cf) and recently tried using Flymake, which has given me some pragmatic opinions for my own usage.

(For non GNU Emacs people, Flymake and Flycheck both exist to present (and to some extent detect) 'diagnostics' about your code or whatever file you're editing.)

For me, Flymake and Flycheck are about as good as each other, at least in LSP based environments and Emacs Lisp. Flymake is better integrated into Eglot and can make errors more visible, Flycheck comes with more keybindings by default, and I go back and forth about how I feel about their modelines (after I diminished Flymake's verbose modeline name down to 'FlyM' and changed the colours a bit). Why I prefer Flycheck is that it's more flexible in one way that matters to me.

My particular taste with checkers is that by default I only want to see actual errors (or relatively strong style issues), but I want to have access to linters that express views I may not agree with in order to see what they say and maybe fix some things they complain about. This way I can keep my code free of real, core issues (that are reported by the error linters) and have a nice clear modeline showing '0' issues (and not have to remember how many baseline non-issues a file has), while still being able to conveniently see style issues if I want to consider them.

As far as I can tell, Flymake has no built in support for (easily) changing what sources of diagnostics it draws on. Things are just magically supposed to get it right, which is fine if they actually do but sub-optimal if they don't. One case where they don't necessarily is in Eglot, where as far as I know the normal diagnostics will only come from the LSP server you're running and will cover only what it provides. Even in cases where it's possible, changing what diagnostics you get from a LSP server isn't simple.

Perhaps because you can switch Flycheck checkers around, there are a bunch of third party Flycheck packages that support optional Go and Python style checkers (and some for other languages). Flymake has some third party checkers, but not really in the way Flycheck does (and what third party checkers it has can be rather out of date). The Flycheck situation is convenient and useful for me, because it means I can easily run (for example) golangci-lint against my Go code within the Flycheck framework with all sorts of jump to complaint support.

(There is an adapter to connect Flycheck checkers to Flymake, but as far as I know you're still left without a convenient way to pick your checker.)

Although Flycheck is my default, I've kept my Flymake configuration around and wired up some personal functions so that I can switch back and forth (either buffer locally or globally). Sometimes I flip over to Flymake to see what it says or use some of its other features.

(There's also Flycheck's comparison page with Flymake. A bunch of the differences that Flycheck lists aren't important to me, partly because I don't use GNU Emacs to edit everything in sight so the large collection of languages and configuration files that Flycheck supports aren't as important.)

PS: I'm dating this in the title because both Flymake and Flycheck have changed over time. My impression is that Flymake stagnated for a while, putting Flycheck clearly ahead in those days, but that things are more even today (especially in LSP environments, where both are getting the same diagnostics from the LSP server).

Notes on respectfully getting a personal copy of a website's contents

By: cks

Suppose, hypothetically, that you want to have a personal copy of the content of some website that you feel is important (to you). There are perfectly good reasons to want such a copy; websites go away all the time on the Internet, and not everyone is online all of the time. It's generally possible to do this (and it's certainly possible to do this with Wandering Thoughts), but there's some things the hypothetical you is going to more or less need to do. These things will be work, but that's the difference between successfully getting a personal copy and turning a brute force crawler lose and then getting ratelimited and blocked. It's also the difference between being polite and being rude, and hopefully you care about that.

(With the increasing decay of Internet search engines, you might also want to build your own personal index of useful website content.)

First, you need to work out the URLs for the real content of the website. Many websites of interest have some mixture of real pages and various sorts of indexes and other aggregations of those real pages, and it's not uncommon for the index pages to outnumber the real pages, sometimes vastly. Your personal copy of the website contents doesn't need all of those index pages, you probably don't want them because they'll inflate the size of your copy, and the website itself will probably be unhappy that you're fetching a ton of redundant index pages.

(The amount of index pages varies with site design. Static sites are usually much friendlier than dynamic sites because it's more work to have a lot of index pages in a static site.)

If you're extremely lucky, the website will have an accurate, up to date (XML) sitemap and will put a tag mentioning this in the HTTML <head> of its pages. If you're not so lucky you will have to manually look around to see if it has any particular index pages that you can mine for URLs (eg) and then work out what additional links and pages you need to also fetch to get what you consider a full copy (for example, to also get comments or 'talk' pages or the like, or to fetch images used in the web pages). In less friendly cases you'll have to go through a whole collection of category pages to accumulate the URLs.

(It's possible that the website supports paged syndication feeds and you can go back through its syndication feed to collect a full set of initial URLs, but I suspect that's not any more likely than a discoverable sitemap.)

Having accumulated your list of URLs, it's time to start fetching them, respectfully. Respectful fetching means doing two things: working slowly, and having an honest HTTP User-Agent. Working slowly means that getting a full copy will take a significant amount of time, but unless you think the website is going to go away tomorrow, you have that time. By 'slowly' I mean a request rate of one every 30 seconds or every minute, and if you get HTTP 429s or other indications of rate limits, you should slow down, even if you think this is absurdly slow. In my view, an honest HTTP User-Agent admits to what you're doing and optionally names the software you're using to do the fetching, because the web site operator cares much more about why these requests are happening than that you're using curl, wget, or whatever to make them.

(You especially shouldn't pretend to be a regular browser, or directly use a headless one. In these days of aggressive stealth crawlers, that makes you look extremely suspicious and may well get you blocked rapidly.)

Once you start fetching, you should monitor your fetching for problem indicators. Basically anything other than a HTTP 200 success may be a sign that either you have the wrong URLs or that you're in some way not welcome to do what you're doing. Continuing despite a spate of HTTP redirections or HTTP errors isn't particularly useful for your content copying project; you're only going to have to weed the results out of your copy.

(Also, continuing when a website is telling you 'no' is being rude. You're saying that your desires are more important than the website's views, and this generally makes you a certain sort of person.)

What all of this will get you is a personal copy of the website's content, possibly in addition to a skeletal set of index pages that you can use to navigate through it (you collected these pages when you built the initial URL set). It won't get you a complete archive of the website in HTML form that you could stick up somewhere else. A full website archive is a different thing, one that websites may be much more hostile to depending (in part) on how much redundant content you will wind up crawling in order to assemble your 'complete' version.

(Even if what you want is a full archive of everything, including index pages, starting with the important content first gets you the important content if something goes wrong.)

PS: Wandering Thoughts has a sitemap, which I bashed together many years ago to make Google happy and then found it was convenient for testing because it gave me a list of all pages that I really cared about the HTML rendering of. Interested parties can access it by putting a '?sitemap' on any directory URL. It's not (currently) in the HTML <head> of any pages because when I set it up, that wasn't really a thing. Given the modern web environment, I'm not certain I'll ever make it visible in the HTML <head> because I'm not certain I want to hand every abusive crawler a nice obvious map to the juicy bits.

(I have no idea how long it's been since Google accessed the sitemap; I suspect it's been years. But then, I increasingly don't care about Googlebot, although that's another entry.)

Unix has been changing, but in places where I don't see it

By: cks

For reasons beyond the scope of this entry, I've wound up thinking about how stable or unstable the Unix landscape has been 'recently' (which means for more than a decade, and especially as compared with the 1990s and early 00s). I've written about aspects of this before, such as the fading out of multi-architecture Unix environments. In thinking about it more after my Fediverse post, I've come to feel that the Unix environment has still been changing but in places where I'm not as conscious of it.

The biggest change is probably the growth of cloud Unix, which I could characterize as "Unix machines on demand". In practice, cloud Unix is a whole new Unix environment that is quite different from traditional Unix, with different tools and especially different practices. Some of the practices are (sort of) extensions from old fashioned large scale Unix administration but many aren't really. I'm aware of cloud Unix and this gulf between operating it and what we do if I think about it, but I don't usually.

Cloud Unix practices spill over into what people want to do outside of the cloud, in the form of things like containers. Operating software through containers is quite different from traditional Unix system administration, especially if responsibility for the containers themselves gets moved from the system administration team to other people.

(There's also the idea of immutable systems created through declarative means, which isn't mainstream but also isn't a tiny corner any more. You can find plenty of people using Unix this way on servers and even desktops.)

I think that all of this has led to a significant change in how people experience Unix. Increasingly, Unix is either a desktop environment (not necessarily a graphical one, consider WSL in Windows) or a backend target; it's not something you explicitly (remotely) log in to very much. We've seen less and less direct use of our login servers and more use that is, for example, modern desktop IDEs starting remote sessions to run development tools on our servers. If VSCode could start SLURM jobs for people, some people here might never explicitly log in to our compute servers. I personally still log in to lots of remote Unix machines, but I'm increasingly an exception.

(I can't throw stones here since I recently carefully set up my desktop GNU Emacs so I could run remote LSP servers (and Git) through Tramp.)

A quiet but significant development is that after narrowing to x86 in practice for a while, Unix is moving back to being multi-architecture. There are a steadily increasing number of ARM servers and ARM devices that run Linux and other Unixes and that you'll find in the wild, primarily in clouds and as small Unix computers that you might put on your network to do specific jobs. It's plausible that some day we'll also get RISC-V servers and devices, or see ARM on general (Unix) desktops. People now routinely care about multi-architecture support for languages, compilers, distributions, and so on, where I think ten or twenty years ago that was a relatively niche concern.

(We've actually looked at small ARM-based Unix devices repeatedly and passed on trying to actually operate any of them for various reasons. Moderate sized, general purpose ARM servers don't seem to really be a thing so far, but maybe someday.)

In Linux, systemd is a drastic (and good) change on how init systems worked and how you interact with them, and makes that part of system administration relatively different from the pre-systemd days. Although I don't know when exactly it happened, the BSDs have gone through a similar evolution that regularized and improved the old ad-hoc BSD init system, making it rather easier to operate. This is probably the most dramatic change a system administrator from 2006 would notice if you jumped them 20 years ahead to today (and had them work with on premise servers without containerization).

There are certainly things that are part of my day to day use or at least administration of Unix that weren't there a decade or two ago. Even on old fashioned on premise servers, there's a lot more JSON and YAML than there used to be, partly because JSON has become the universal program-readable output format that everyone can agree on (and good tools, such as jq, have become widely available). But broadly, I feel that Unix has carried on being Unix and the experience of logging in and using the environment hasn't changed dramatically. If anything, different Unixes have become more similar, partly because lots of Unixes use the same programs (such as Bash and vim) and partly because Unixes have converged on common options for common programs (through both POSIX and pressure from people using them).

(Bash and vim aren't necessarily the default experience on all Unixes, but they're commonly available, partly because people want them.)

PS: The switch from X to Wayland is (still) a change that's in progress, but at the same time it's broadly supposed to be an invisible one to most people. Whether it should count as a change in Unix I will leave up to you.

Sidebar: My history with universal dotfiles

A long time ago I tried to have universal dotfiles for my shell environment across all of the multiple Unixes that I then had accounts on. The result was complicated, with lots of per-Unix and per-group settings. Today, I'm relatively certain that I could do a version for the surviving Unixes and system environments (and accounts) that had almost no conditionals. Some of this is through Unixes converging, some of it is through vendors with weird Unixes going away or becoming irrelevant to me (I'm unlikely to ever log in to another AIX machine, or a Solaris family one), and some of it through a relative convergence in how to administer machines.

Notes about reading messages with the Python email packages

By: cks

I have a long standing personal program to display MIME formatted email messages in the terminal in a sensible way (it was mentioned in this old entry on my email tools and its comments). For a long time this was a Python 2 program, using the Python 2 version of the email package. Recently, I moved this program to Python 3 as part of my sudden enthusiasm for Python 3 conversions, using the Python 3 version of email and its sub-packages. In the process I have wound up with some notes and opinions on practical use of the Python 3 email packages.

(The Python 2 version of email had its own quirks and oddities, but I worked all of those out that hard way years ago, have mostly forgotten them since, and they're not interesting any more now that the era of Python 2 is over.)

The Python 3 email documentation will tell you that the modern interface for email messages is email.message.EmailMessage. The older email.message.Message is (theoretically) only there for Python 3.2 compatibility and you should ignore its methods and use only the EmailMessage methods. This is not entirely the case. If you look behind the curtain, you'll discover that many of the EmailMessage APIs for reading message contents are in fact Message APIs with masks on, and especially they're various masks for Message.get_payload(). That get_payload() isn't obsolete in practice matters, because it turns out that get_payload() is the only way to do certain things you (I) need.

As with decoding email headers, my strong impression is that the entire set of email parsing and message reading APIs are only really designed to deal with well formed email messages with fully correct MIME. This isn't what you find out in the real world, both due to programs being imperfect and also due to things like other mail systems sending you a bounce message that includes a message/rfc822 version of the original message where the other mail system has retained all of the message headers, including the Content-Type that says the original message was a multipart/alternative, but has replaced the entire body of the message with '(Body suppressed)'. As far as I can tell, there's no EmailMessage API that will give you (just) the body text of that (malformed) message/rfc822; your only way to dig it out is to use the older Message.get_payload() API.

(That bounce example is a real case that I've seen.)

At the same time, EmailMessage.get_content() is a handy API that does a lot of the work for you for things like extracting a de-mangled, Unicode version of a text part (or anything that's sufficiently text-like, although you will get back a bytes thing instead of a str and then decode it yourself). So I use get_content() as much as possible but some things have to fall back to get_payload(). The one thing I'm cautious about with get_content() is that it has a cheerful trust in the asserted character set encoding of the MIME part, when I'm pretty certain that some mail creation programs blithely assume you'll typically interpret stuff as UTF-8 (especially if it has no type specified, which in theory means ASCII).

(get_payload() will also probably give you heartburn if you're trying to use typing, but this is a general email problem with API typing.)

The email package parses your messages with stuff in email.parser, which has some additional notes on how it theoretically parses things. Some of these notes are experimentally false, especially the one for message/delivery-status. The actual story is in comments in the source code:

message/delivery-status contains blocks of headers separated by a blank line. We'll represent each header block as a separate nested message object, but the processing is a bit different than standard message/* types because there is no body for the nested messages. A blank line separates the subparts.

Although the actual text of a message/delivery-status part is plain text (admittedly in a specific format, in theory), the parsed version is a multipart EmailMessage object containing a series of text/plain EmailMessage children, where the actual contents are in the headers of those text/plain children (and the 'body' is empty). The best way to extract the actual contents as text to print or process them is to use EmailMessage.as_string() on each child. This is quite confusing if you expect a message/delivery-status to have obvious contents or to match the documentation (and EmailMessage.get_content() doesn't work right on the multipart parent object; this may be a bug that will be fixed at some point).

PS: The reason you don't want to use .as_string() on text or broken MIME parts is that MIME parts have headers, namely the various Content- ones, and .as_string() will give you those headers as well as the text you want. There's no option in the EmailMessage API to not get the headers.

Sidebar: Types for email stuff

Because sometimes I get enthusiasms, I added types to my program that's using email. It was somewhat painful and the kind of thing that you describe after the fact as "a valuable learning experience". In order for future me to not lose that learning experience, here's some notes.

My first problem was that often, mypy inferred that something was an email.message.Message instead of an email.message.EmailMessage; the latter is a subclass of the former. Much of this could be fixed with isinstance() to create type narrowing. I found the most convenient way to do this to be an assert(), for example:

prs = email.parser.BytesParser(policy=...)
m = prs.parse(fp)
assert(isinstance(m, EmailMessage))
[...]

Here I know that email.parser.BytesParser will return an EmailMessage because that's what my policy is set up to do (cf), but mypy can't see that.

A more involved situation is the return value of Message.get_payload(), which mypy typically typed as including 'list[Message]' when I know that what I have is a 'list[EmailMessage]'. Fixing this requires typing.cast():

def showalternative(p: EmailMessage) -> None:
  m = p.get_payload()
  if isinstance(m, str):
    [...]
    return

  assert(isinstance(m, list)) # for safety
  m = typing.cast(list[EmailMessage], m)
  [...]

You need to use typing.cast() to correct mypy's idea of the member type of a list or other container.

(Technically mypy and any other type checker that does similar inference. I don't know my way around the Python typechecker landscape, although I've wound up with a few of them installed.)

The hardware needs of our mail system (as of mid 2026)

By: cks

In a comment on my entry on universities, email, and the issues of running things in house, I mentioned that our departmental email system has a non-trivial cost in hardware alone to keep going. To better illustrate that, I'll describe all of the servers that our email system currently requires (because it's more than one). Some of these servers exist for historical reasons and may go away at some point, but many of them don't.

Currently, we have:

  • A server as our external mail gateway (our DNS MX target). This is separate from other mail servers because it's much simpler to configure and operate this way.

  • A server for the (FOSS) anti-spam and anti-virus software we use (and everyone needs some version of). This could be folded into the mail gateway server (and it was in our recent backup MX, but we weren't sure about the software's resource usage and system impact when we set it up. Keeping it separate also means we can move it to a new OS version for more up to date software without having to worry about any changes in new versions of the mailer that the mail gateway runs.

  • A server for our central mail machine that handles all aspects of email to local addresses, which for various reasons (cf) can include sending email to the outside world. This machine doesn't store any email locally; instead, to simplify slightly, email lives on our general purpose NFS fileservers.

  • A separate server to handle forwarding known spam to outside email addresses. We're required to support this by people using our email system and we found it necessary to put this work on a separate machine.

  • A server to handle unauthenticated mail submission from inside our networks. Separating mail submission from the central mail machine makes for a simpler configuration for both (eg), and we historically started with only an unauthenticated mail submission machine.

  • A fairly powerful server to handle IMAP and authenticated SMTP submission, which these days also has /var/mail (where all our inboxes live) on local storage and thus also acts as a NFS server.

  • A server for a webmail frontend (to our IMAP server). We put this on a separate server than IMAP for multiple reasons, including resource usage and that it decouples the OS and packaged software version requirements of our webmail (for instance, certain versions of PHP and Apache) from everything else.

We've found it very important for practical reasons to use separate IP addresses for different sorts of outgoing email (also). We can do this on a single machine (and we do), but in many ways it's simpler to use separate machines for different sorts of email. It's also simpler to handle things like rate limits if we use different machines for things that need different rate limits.

All of these servers rely on existing elements of our general infrastructure, such as our general purpose NFS fileservers, our local DNS resolvers, and our system of propagating account information. I hope that at some point in the future our IMAP server machine will also wind up relying on our local OIDC identity provider (and indirectly on the LDAP server it uses), but that's currently not possible in practice. I'm mentioning these because a stand-alone mail environment would require some equivalent of all of them; you have to store mailboxes somewhere, get account and authentication information, do DNS resolution, and so on.

Most of these servers are 'basic' 1U servers, which these days means that they have 16 GB to 32 GB of RAM, a mirrored pair of SATA SSDs, a reasonable CPU, and traditionally cost a few thousand dollars each if bought new (their prices are probably higher at the moment). These specifications are good enough that we don't have to worry about the exact resource requirements of each server's job (although we made sure to give the anti-spam software machine 32 GB of RAM and a decent CPU). If we used smaller machines we'd have to be more careful; I'm pretty sure that not all of these roles would be happy with only 8 GB of RAM in practice (much less 4 GB). Basic 1U servers used to be cheaper, and these days we've got a stock of older servers that are good enough for these jobs. But if we were setting up a green field environment from scratch and had to buy all of these new, five or six servers (possibly plus a spare) would be a non-trivial cost.

(Because we're using the same sort of servers for these as we use for everything else, there's no dedicated spare for specific machines; we have spare server hardware in general.)

The one server that is an exception is our IMAP server. The current version has 64 GB, four relatively large SATA SSDs, a decent CPU, and 10G-T networking, and because it's so important we have a spare server ready to be pressed into use immediately in case of a hardware failure. The current hardware is old enough that we'd like to replace it, this time with more memory (so more things get cached) and NVMe SSDs instead of SATA ones. Unfortunately, in the current environment the price quotes we got are jaw dropping and unpleasant (especially since we have to buy two of the basic server to have a spare, although we don't need two sets of the NVMe drives).

All of this serves a department with somewhat over a thousand active people, about 1.5 TBytes of inboxes (if we talk about the likely uncompressed size; since we use ZFS for /var/mail, we have compression turned on), and an inbound mail volume that is probably around 10,000 messages a day. As mail system sizes go, this is modest.

(We have several thousand inboxes (and Unix accounts to go with them), but many of them are inactive for various reasons. The size distribution of inboxes is also extremely uneven, as you might guess.)

(Publication of this entry was delayed by me getting distracted and forgetting to actually publish it last night. I didn't realize it was still sitting in my drafts area until I noticed the stray editor window just now.)

Unicode and Emoji in terminals, or my simple but difficult wish

By: cks

On the Fediverse, I had a simple sounding wish:

This is my face that I need a simple pagination program for Unix (show a page, pause, hit CR to show the next page) that is Unicode and emoji aware, so that it knows how long lines with them are. AFAIK less can't be used for this when I don't want it to ever clear the screen, just to keep printing the next page for however long.

(... because you have to know how long lines of text are so that you know when you've printed a full page.)

(So what I'd like is 'cat, but paginated'.)

This sounds like a simple, easy wish. Some of my readers are now laughing flatly, because it's not. In fact I believe it's impossible to write a simple general program to do this; you need either terminal program specific knowledge or to do some relatively extreme tricks as you print text.

Once upon a time, physical terminals and thus terminal programs were simple. They showed a set of characters in a monospaced grid, commonly with bytes mapping one to one to displayed characters (I'm ignoring DEC's double-sized character escape sequences). In this world, 'cat but paginated' is relatively simple, and indeed I have a program that does exactly this job; the only real complexity is handling tabs (where you have to work out what the next tab stop is in order to correctly track the width of the line).

(You have to track the width of a line because you need to know when the line you're printing spills over to a second physical line despite the lack of a newline character.)

The first problem that terminal programs give a pagination program in the non-Latin world is over-sized characters. Latin text has relatively simple character shapes that are easy to read at modest font sizes, but other scripts and other sorts of characters have much more complex shapes that are hard to read if you squeeze them into the same monospaced grid block as a Latin character at a given point size. So some of the time, some terminal programs don't; they render the characters larger. Which characters are rendered larger? It depends on the terminal program and, I think, the font (and certainly the character; see Let's Stop Ascribing Meaning to Code Points).

The second problem is emoji. Emoji are one of the common cases of Unicode characters combining together, or more exactly I should say Unicode code points. Famously, many flag emoji are actually two emoji put together. For example, the Canadian flag emoji, πŸ‡¨πŸ‡¦, is the πŸ‡¨ emoji followed by the πŸ‡¦ emoji (CA is the ISO country code for Canada). Whether this renders as a Canadian flag or as C followed by A depends on whether the terminal program and the font rendering environment knows about this specific combination and is willing to turn it into a flag.

(There can be multiple reasons for not rendering an emoji flag as a flag, including that sometimes flags are new and sometimes flags are politically charged, for example πŸ‡ΉπŸ‡Ό or πŸ‡΅πŸ‡Έ. I would not be surprised if in some environments, one or both of those flags is not rendered as a flag, but as two emoji characters.)

As a practical example, in my X environment the only terminal program that combined πŸ‡¨ and πŸ‡¦ to make a Canadian flag is konsole. None of xterm, rxvt-unicode, or to my surprise gnome-terminal did (gnome-terminal does render many emoji, but apparently not flags or even emoji characters). What this means is that how many displayed characters this sequence takes up depends on the terminal program. A pagination program that assumes it's some fixed width is guaranteed to be wrong some of the time.

(Emoji rendering can also be an example of wider character rendering. In konsole, πŸ‡¨ is as wide as two regular Latin characters; in the other three terminal programs, it's single width. My graphical GNU Emacs also combines emoji characters to make flags and displays emoji characters as double width in a monospace environment. In gnome-terminal, emoji that are displayed properly (as emoji) are typically double width.)

If you want your pagination program to strictly print output without manipulating things through cursor positioning, I'm not sure what a good way to handle this is. From where I sit, it certainly looks like a program that satisfied my simple sounding wish would have to hard code knowledge of how various terminal programs I use render emoji.

(I'd be remiss if I didn't point to It’s Not Wrong that "πŸ€¦πŸΌβ€β™‚οΈ".length == 7.)

PS: The 'less' pager seems to be able to cope with this, so it's possible in general if I'm willing to give up my quixotic wish for a 'cat with pagination' instead of something that clears and overwrites the screen, ruining my scrollback.

PPS: Arguably the correct place for this sort of pagination is in the terminal program itself, but xterm doesn't do that and I'm very attached to xterm. Also, I'm not sure if any existing good X terminal program does this today.

Our servers seem to have surprisingly low power consumption

By: cks

For reasons beyond the scope of this entry, today I was curious how much power our servers were using. If you have sufficiently fancy PDUs, I think you can get per-outlet measurements that are trustworthy, but we don't have such PDUs. Instead, the best I can do is look at the information that some of our servers report through IPMI. I'm not sure how accurate the IPMI information is, but at least some of the numbers seem plausible, so I'll assume that it's not massively off. The results surprise me with how low they were.

Most of our servers are basic 1U servers with a pair of SATA SSDs. They're typically not very active, and the not particularly active servers that report IPMI power usage are reporting anything from 22 watts to 26 watts right now. A few servers also have a pair of HDDs, and one has four SSDs and is typically active; all of them report 44 watts right now. Our NFS fileservers currently have 24 SSDs in each and are reporting a range of power usage from 47 watts to 62 watts. One interesting case is our perimeter firewall, where we have the active server and an identical running hot spare, both 1U servers. The active server is at 52 watts (and handling about 40 Mbytes/sec of network traffic); the hot spare is idle at 26 watts.

We have a few compute servers that report power information through their IPMI, and the ones that are currently active are reporting the highest power usage. However, even this isn't spectacularly high. The most power hungry machine is a GPU SLURM node, where its IPMI is reporting 330 watts of total power while its GPU is claiming about 166 watts of power draw (its CPU is busy too).

Some of our servers don't report power usage in their IPMI sensor data but will report it through the web interface of their BMCs. I checked two of them, both powerful 1U servers that are essentially identical to each other. The one that is our primary login server is reporting an average of 149 watts and a peak of 195 watts over the past hour. The SLURM compute node, which is currently in active use, averaged 501 watts in the past hour with a peak slightly higher (when not in use it appears to idle around 107 watts).

One of the reason these numbers surprise me is that many of the idle numbers are lower than my desktops. I have a mental image of servers as not being particularly low power or power efficient, just as they're not particularly quiet, but that seems to be wrong. I suppose it's not too odd that people making 1U servers care about power usage and power density, since that's definitely a concern in general in data centers, it just hadn't really occurred to me before.

(Our own use of 1U servers is not particularly constrained by power and cooling.)

Getting C code navigation even for Debian (or Ubuntu) packages

By: cks

Every so often, I want (or need) to make modifications to programs in an Ubuntu package, and often the programs are written in C (and these days I'm using dgit to manipulate the package). One of my challenges when I do this is that I generally don't start out knowing where and how to change the code to do what I want; instead, I have to navigate around an unfamiliar code base and work out enough of its structure to find the specific bit of code I need to change.

These days, the dominant way to get smart code navigation and other code knowledge things is through LSP servers and clients. A variety of modern and semi-modern languages have LSP servers that you can immediately use in your editor of choice and then navigate around random code bases with handy features like 'find definition' and 'find references' (for example, Go, Python, and Rust). Unfortunately, C isn't such a language. In the general case, understanding C code requires knowing how it's compiled, and that means you often have to tell C LSP servers this information. Well, specifically you have to tell this stuff to clangd, the dominant LSP server for C and C++.

(There's also ccls, which may work out part of this information on its own, but it seems to be less popular and I have no experience with it.)

Fortunately for people like me, there is a simple way to gather this compilation information even if the program's build system doesn't do it for you, and that's Bear (which is available as a standard Ubuntu package for extra convenience). Bear operates as a front-end on however you normally build your program; you build your program (or collection of programs) with 'bear -- <build command>', and Bear monitors compiler execution and records everything. This is slower than a normal build (sometimes significantly so), but you get a compilation database out of it and then you can use LSP tooling to jump around the source code.

(My understanding is that gcc, clang, and so on can generate this compilation information if they're asked, and modern build systems often ask them to do so, but an old fashioned build system using things like 'make' won't include the magic compiler options necessary. Possibly you can include them yourself by hand, but Bear takes care of the work for you.)

Somewhat to my surprise, Bear not only works with programs built by 'make', it also works when you build Debian or Ubuntu packages under Bear with 'bear -- dpkg-buildpackage -uc -b'. If you're building a substantial package (such as Dovecot), you're definitely going to notice the slowdown, but you do get LSP based code intelligence out of it (and you only have to do this once, not every time you change the code).

(Under some circumstances you may have to edit the generated compile_commands.json to take out gcc options that clang doesn't support, but fortunately the JSON file is in a human friendly format where each compiler option is on its own line. Possibly there's a way to manipulate the Debian/Ubuntu package build process to not use such options in the first place.)

Building Debian and Ubuntu packages contaminate your source directory, so once you've run a build under Bear to generate the compile_commands.json file, you need to move the file to safety and then reset your source directory somehow. If you're using dgit (which I very much think you should be), I believe this can be done with a variant of the standard dgit source directory reset instructions:

git clean -xdf -e compile_commands.json
git reset --hard

The process I suspect I'm going to follow in future dgit modifications of Ubuntu packages is to set up the package with dgit, build it once under Bear in unmodified state, rm the generated .deb and .ddeb files, and then start poking around the source code with LSP intelligence to find where I need to make my modifications (and then commit them and do a dgit build as usual).

(This elaborates on some Fediverse posts.)

I've finally ported DWiki from Python 2 to Python 3

By: cks

DWiki is the pile of code that underlies Wandering Thoughts. It started out many years ago as a Python 2 program (partly because there was no Python 3 at the time), and it stayed that way for a long time, making it the most significant and by far the most substantial Python 2 program I still cared deeply about. Years ago I said I'd port it to Python 3 someday and somewhat to my surprise, that day has now come (well, it came yesterday).

The direct trigger was discovering that Python 3.13 had dropped 2to3, which made me feel that I should run 2to3 over DWiki's current Python 2 code base while I still could (I had an old conversion from many years ago, but that converted code base was very out of date). One thing led to another, as it often does with me, and I wound up doing a full port and then putting it into production, which is to say serving this blog. I suspect that part of me just felt it was time.

(The 2to3 removal is in the Python 3.13 release notes, and it comes after 2to3 and its infrastructure were deprecated in 3.11 for reasonable reasons.)

As I expected years ago, the stuff that 2to3 could handle was the easy part. Much of the actual work of the port was sorting out the boundary between Unicode strings and byte strings in a Python 3 world. Some of this would have been easier if I'd found PEP 3333 earlier and followed it in my own discount WSGI implementation, but a bunch of it I had to find the hard way, by trying things and having them blow up, sometimes in production.

(I wound up in the same place as PEP 3333 just from the inherent requirements of the web. For example, the HTTP Content-Length is in octets, so if you're using it to read a POST body, the object you're reading from has to be providing bytes. And it turns out that you can't write HTTP headers to a text mode file object because that will turn \r\n sequences into \n, which will make things unhappy with you.)

Not all of the changes were at the IO boundaries of DWiki (and the IO boundaries themselves weren't always simple or obvious). Python 3's handling of cryptographic hashes requires bytes, which rippled through to several places where I use them in DWiki (and the hmac API changed a bit, which wasn't fixed up by 2to3). Python 3 also really wants your regular expressions to be in r"..." strings, because otherwise it will complain about you using regular expression backslash escapes like '\s' that aren't string backslash escapes.

I don't have a DWiki test suite, but long ago I built scripts that would crawl and collect all real pages from an old and a new version of DWiki. I originally used these to check for changes in how pages got rendered when I changed the wikitext processing code (often I wanted no changes), but this time around I was able to use them to verify that the Python 3 DWiki could at least render all existing pages into essentially the same thing (there were \r\n sequences that turned into \n instead of being passed through, but that's probably a good change). But that still left things like writing comments, and also the two sets of code involved in how DWiki runs in production instead of in testing.

I probably wouldn't have tried to do this if I hadn't had a relatively substantial block of free time. It took me more or less all day yesterday to get up to the current production state, with a lot of back and forth, experimentation, and tweaking. There was a lot of code and problem context that I might not have retained if I'd had to slice my work up into half hour or hour long chunks of work, and once I started running the Python 3 version as the live server I was relatively committed to fixing any problems that came up on the spot.

(I could have rolled back to the Python 2 version but it would have been at least a bit awkward for various reasons, including a pickle format change.)

The current Python 3 DWiki code still needs additional cleanups, partly to undo unnecessary 2to3 changes like changing 'for ... in dct.keys():' to 'for ... in list(dct.keys()):'. But it's running stably now for, well, not quite 24 hours yet but for at least a bunch of all of the typical traffic that Wandering Thoughts gets. Probably there aren't any remaining Unicode conversion issues, although re-reading one of my old entries makes me feel I should audit every use of EnvironmentError when dealing with files.

(2to3 appears to always put list() around things that changed to return generators in Python 3. Sometimes this is important, but it's not necessary if the result is only being used in a 'for'.)

I also want to think about what Unicode error handling to use in various circumstances, although these days I'm inclined to be draconian. For example, if someone tries to write a comment with invalid UTF-8, I probably don't want to backslash escape the invalid bits, so the default 'replace' handling is fine (in my case, this comes from using urllib to decode POST bodies). And currently all of the existing content in Wandering Thoughts is UTF-8 clean, at least as far as I can tell.

(The whole Unicode and bytes issue is something where types would be handy (or an option to turn off all of Python 3's implicit conversions), but adding typing to DWiki's 'originated in Python 2' codebase is both a lot of work and also extremely messy, because it uses things in ways that mypy is already unhappy about.)

PS: The Github version of DWiki is now significantly out of date and I'm probably not going to update it for reasons that don't fit in the margins of this entry.

Sidebar: The Python 3 WSGI rules in a nutshell

To summarize PEP 3333 in my own way, HTTP headers are Unicode strings, ie str, but must be limited to iso-8859-1 characters (at least when you write them). The wsgi.input file object produces bytes and your HTTP response body is also bytes. In a CGI environment, you read from sys.stdin.buffer and your WSGI CGI implementation writes to sys.stdout.buffer (including the headers, after encoding to iso-8859-1).

If your WSGI implementation is talking to a network socket, you can and must leave the network socket as a binary file object. In my case, this generally means wsgi.input is created with 'os.fdopen(fd, "rb")'.

A GNU Emacs learning experience with text-mode hooks

By: cks

For a while, one of my little irritations with my Emacs environment was that sometimes, when I fired up Emacs to edit some code and then quit out of it, Emacs would complain that there was still an ispell process running and ask me what to do with it. This was especially mysterious to me as I don't normally use flyspell-prog-mode (I find it too irritating for general use). Recently I got sufficiently irritated to use a combination of the ELisp debugger and strategic '(message ...)' usage to track this down, which initially looked like one issue and actually turned out to be another one that I discovered only as part of writing this entry.

One of the major modes in GNU Emacs is text-mode. I have a text-mode hook, probably like many people, and one of the things it does is turn on flyspell-mode in that buffer, which causes flyspell to invoke ispell and thus start an ispell process. It's also my custom from long ago to set the default major mode of buffers to text-mode (the out of the box default is fundamental-mode). If I'm editing something and it's not program source code, it's almost always text and having to say 'M-x text-mode' all the time is the kind of annoyance GNU Emacs is designed to erase.

When I used debug-on-entry to find out where the ispell process was starting from, it pointed to my text-mode hook. At first I theorized that code buffers were starting out in the default mode (and thus triggering my text-mode hook) before being switched to their proper mode, but strategic use of '(message ...)' in my text-mode hook revealed that it was actually being triggered on a scratch buffer for Flycheck. So I switched my theory to Flycheck creating scratch buffers without specifying their mode, so they would up in the default major-mode, which for normal setups is fundamental-mode but for me is text-mode, triggering my text-mode hook and starting ispell.

Except I looked at the Flycheck source and this is wrong. Here, let me quote a small bit:

(define-derived-mode flycheck-error-message-mode text-mode
  "Flycheck error messages"
  "Major mode for extended error messages.")

Flycheck explicitly derives the mode for some of its scratch buffers from text-mode, which of course means that they run text-mode hooks. This is a perfectly reasonable thing to do in general, since text-mode is the appropriate mode in general for, well, text, but it leads me to today's GNU Emacs learning experience which is that text-mode hooks may run in surprising buffers, not just text files I'm visiting and editing. I shouldn't put anything in my text-mode hook that I want only for real text files that I'm editing, at least not without guarding it somehow. One of those things is flyspell, not just because of its side effects of starting an ispell process but also because I don't particularly want flyspell to mark 'misspelled' words in, for example, Flycheck diagnostics.

(Flyspell's markings also get in the way of mouse based copy and paste.)

My solution was to guard what my text-mode hook did so that it only happens in buffers associated with a file:

(defun cks/text-mode-hook ()
  (when buffer-file-name
     ....))

It's possible that some day I'll want my text-mode setup in an anonymous buffer, but until that day I'll leave such scratch buffers alone. I could probably do a bit better by looking for buffer names that start and end with * (this is the usual GNU Emacs naming convention for explicit scratch buffers), but that would take a bit more work.

(Although not much more, now that I've found string-prefix-p and string-suffix-p.)

Going from a ZFS object ID to its path the easier way

By: cks

It's not uncommon that people using filesystems want to map from an internal object number (an 'inode number' for normal filesystems, an object id or object number in ZFS) to a path. ZFS itself wants to do this efficiently for things like 'zfs diff' and the 'zpool status' report on what files are damaged. To help with this, ZFS stores the likely parent object for every normal filesystem object. If you use zdb to do a sufficiently verbose dump of any particular object, you can find this as the 'parent' attribute.

If you want to do this mapping yourself, you can use zdb or something like it to manually follow these 'parent' pointers (and also look up the name of everything in its parent directory). However, that would require high privileges, and ZFS doesn't want to make things like 'zpool status' require that, so the kernel and libzfs expose an API for this. In libzfs, this is 'zpool_obj_to_path()', which uses the kernel's ZFS_IOC_OBJ_TO_PATH ioctl(). Because it's intended for internal usage, this API doesn't take a pool and filesystem name (in addition to the object ID); instead it takes a pool handle and a dataset ID. It's up to callers, such as 'zpool status', to do the mapping.

(One reason you might want to go from an inode number (object id) to a path is that various things only give you inode numbers, such as NFS v4 locks on Linux NFS servers. Or you might have NFS activity tracing software that can only reliably report the inode number of files and directories that people are using heavily.)

In OpenZFS, years ago someone wrote a command that used this libzfs API to do all the work for us, zfs_ids_to_path (also). Like the API, this requires the dataset ID. Helpfully we don't need to use 'zdb' to get this; instead we can ask 'zfs list' for it. This gives us:

# zfs list -o name,objsetid ssddata/homes
NAME           OBJSETID
ssddata/homes       431
# zfs_ids_to_path 431 1920047
/homes/cks/.rcenv

Illumos and FreeBSD don't ship a version of zfs_ids_to_path, but the source code is sufficiently small and self contained that you could probably compile it yourself.

(Although my test FreeBSD 15 instance doesn't have the libshare.h header that's needed by libzfs.h, presumably through a packing mistake.)

If you needed to do this frequently and found it annoying to look up the dataset ID every time, I believe that it wouldn't be too hard to work out and write the code you needed in order to go from a name like 'ssddata/homes' to a pool object and a dataset ID. Sorting through, for example, the source code for 'zfs list' might take some work (there's a whole collection of callbacks and so on), but it's doable (and perhaps someday people will write a slightly handier version).

(The lazy person can write a front end script today that combines 'zfs list' with zfs_ids_to_path.)

In praise of the Linux kernel netconsole (in the right circumstances)

By: cks

The Linux kernel's netconsole is a kernel module that will "log kernel printk messages over UDP" to a remote system, which makes it another form of kernel (message) console. These days it can be activated either on boot or after boot, and in the past I've had mixed views of it. However, I recently had a nice experience with netconsole that's left me more well inclined to it in specific situations.

A while back, my home desktop started locking up every once in a while. Several years ago my home desktop had a somewhat similar problem that was due to hardware issues, but the lockups this time were different, in that the machine would lock up for a bit and then reboot on its own. Local logs showed nothing, but I happen to have another machine sitting around so I thought I might as well try netconsole again. These days netconsole can be enabled on the fly:

modprobe netconsole
cd /sys/kernel/config/netconsole
mkdir heedra
cd heedra
echo em0 >dev_name
echo 192.168.X.Y >remote_ip
echo 1 >enabled

(This other machine is called heedra for obscure reasons.)

On the other machine I ran a simple script to capture output inside a screen session:

#!/bin/sh
while :; do
   nc --recv-only -u -l 6666 |
      tee $HOME/work/h-logs/netconsole
done

(The advantage of --recv-only is that nc won't complain if I hit CR a few times in the screen session to create blank lines, so new messages are more obvious.)

After a while, my home desktop locked up again and rebooted soon afterward. When I checked the netconsole log file on the other machine, I discovered that I had actually captured kernel log messages, and reasonably useful ones at that.

The kernel logs revealed that this appears to be a kernel 'soft lockup', where all cores had gone to 100% system usage during what appears to be TLB flushes or cross-core kernel communication. In several of the kernel stack backtraces, bpf_trace_run4 appears, so I suspect that there's an uncommon eBPF locking race or issue that's infrequently tickled by the eBPF metrics gathering programs I normally run on my desktop.

(It's probably not from the eBPF programs systemd uses for network access control, since those are used widely.)

Capturing these kernel messages doesn't give me a solution, but at least it gives me a way forward if the lockups get too frequent and annoying (I can try disabling my eBPF metrics collectors). And I couldn't have gotten these messages with anything else except a serial console, which I don't have available on my home desktop and anyway would have needed a second machine in physical proximity (which is awkward in my home setup).

My understanding is that netconsole isn't quite as reliable as a serial console for getting last gasp kernel panic messages out, since you need more kernel pieces to still be working to transmit network packets. But it's more reliable than anything short of a serial console, and serial consoles are generally in short supply on modern desktops and desktop-like things (including hand-built SLURM nodes). For one off, small scale use my listening script would be fine, although if we needed to use it on a larger scale, we'd need some infrastructure to collect netconsole logs from multiple machines.

(Some suggestions for that are in the comments on my earlier entry.)

A code (reformatting) conundrum in Python, and heuristics

By: cks

Suppose that you are a Python code reformatter, and someone hands you the following snippet of Python code to act on:

if something:
    blah blah blah
    [...]
    final-line
some-statement

[... more statements ...]

Here's the question: should you reindent 'some-statement' so that it's part of the 'if' block?

One answer is that you absolutely should not. The current code is valid Python code, and you are a reformatter for style, not to correct (presumed) errors. Since this is valid code, you should re-flow line wrapping and so on within blocks, but not change what block valid code is part of.

Another answer is that maybe the person writing this code made a mistake. Style wise, it's common to add a blank line between the end of an indented block and following code; the lack of a blank line suggests that a mistake was made. So maybe you should reindent 'some-statement' to where it properly should be, especially if you have a style rule that says that there should be blank lines in this sort of situation.

(Of course, you could also opt to add the blank line that your style guide says should be there and not change what block a statement goes in. But we're in heuristics territory here.)

If you're a heuristic reformatter, your opinion may change depending on what the 'final-statement' is. For instance, if the final statement in the if block is 'return', it is pretty obvious that there's not supposed to be anything after it. Anything after it is dead code, which would be a different and less likely error. So you should leave 'some-statement' alone and it's valid style to not have a blank line between the last statement in the 'if' block and 'some-statement'.

Python doesn't have all that many statements that definitively end blocks, but it does have some that are extremely suggestive. Consider this pattern of code:

try:
   something
except SomeError:
   pass
some-statement

The pass statement is a no-op, not something that affects control flow, so it's perfectly valid to have statements after a 'pass'; they will be executed normally. At the same time it's commonly used this way when there's not going to be anything after it, so a heuristic Python code formatter that moved 'some-statement' up into the 'except' would make lots of people unhappy.

One such heuristic Python code reformatter is the one used in GNU Emacs in both its conventional python-mode (which 'parses' Python code with regular expressions) and python-ts-mode (which fully parses Python code with a tree-sitter grammar). I'm not sure if these are the same reformatters, but they have the same effects. This particular reformatter heuristic turns out to be the root cause of my Python code reformatting glitches.

(In fact the GNU Emacs Python code reformatting appears to take a 'pass' as a hard end of block and will out-dent anything after it, regardless of which this does to control flow. If you add a 'pass' in the middle of a function and reflow with M-q, GNU Emacs will happily make all statements after it module level ones.)

I experimented with some stand-alone Python code formatters I had sitting around, and none of them behaved this way, which I guess isn't surprising (I tried black, ruff, and yapf). Since the normal pylsp Python LSP server relies on one of them for code reformatting (which one depends on your configuration), this also means LSP-driven code reformatting won't do this. It's possible that only GNU Emacs has this (arguably incorrect) heuristic reformatting.

(I was led to discover all of this by a comment ae left on my earlier entry about Python 2 LSP problems.)

PS: There are other heuristic decisions you can make depending on what 'some-statement' is and where it currently is in the overall block. For example, if 'some-statement' is the last statement in a function and in a 'return', then it's almost certainly correct in its current place. But these heuristics multiply endlessly.

Moving from lsp-mode in GNU Emacs to Eglot

By: cks

Recently, I decided to take my long standing, perfectly good GNU Emacs lsp-mode setup and completely replace it with Eglot, the now built in GNU Emacs LSP solution. At one level I didn't have any particularly strong specific reason to switch; I started by trying out Eglot after switching entirely to Corfu then just kept going to see how far I could get towards a good Eglot environment. The result is perfectly good and some things work better (Eglot will do 'complete to common prefix' in Go and Python modes) but it took more than a little bit of yak shaving to get here.

At another level, lsp-mode with lsp-ui is what I'd call a busy interface, with all sorts of things going on, and these days I've decided that I want a quieter LSP experience. Eglot is famously more minimal and quiet than lsp-mode, although you can and should augment Eglot's interface with additional packages. I could have tamed lsp-ui more with additional settings and fiddling, but switching to Eglot took care of all of that all at once, with other benefits. Overall I'm happy to have switched, although it was more work than I was entirely expecting.

(Should you switch? I don't know, but if you stick with GNU Emacs and use it in the modern way, I think you will sooner or later.)

As I've described in an earlier entry, Eglot's minimalism is because it's a modern GNU Emacs package that expects you to fill in features with other packages that interact with it through standard Emacs Lisp APIs. This means that for a good (but non-busy) LSP experience in Eglot, I needed to hook up a variety of additional things.

  • Corfu just worked for completion; my general Corfu settings were fine.
  • To get a good cross reference setup where I could get lsp-ui like previews of references to something, I needed to connect consult to the general Emacs xref system by setting 'xref-show-xrefs-function' to 'consult-xref'.

  • I went back and forth between Flycheck with flycheck-eglot and Flymake before eventually settling on Flycheck. Flymake is better integrated with Eglot (in a way that I notice a bit) but I can make Flycheck work well enough and I prefer it in general. Eglot normally automatically puts buffers into flymake-mode, so to shut that off I do (in my use-package declaration for Eglot):

    :config
    (add-to-list 'eglot-stay-out-of 'flymake)
    

    And then to automatically activate flycheck-eglot:

    :hook
    (eglot-managed-mode . (lambda () (if (eglot-managed-p) (flycheck-eglot-mode 1))))
    

    (In theory flycheck-eglot has a global mode, in practice it didn't work out reliably for me and the brute force of a hook was the easiest approach.)

Eglot has some configuration settings that you'll want to experiment with. I found that I wanted 'eglot-extend-to-xref' to be 't', partly because that makes M-? find other uses in my own project of whatever external thing I've jumped to.

Eglot doesn't ship with any key bindings and I definitely needed some, partly to make LSP code actions more accessible. Since it's early in my Eglot usage, my key bindings are probably going to change, but my current set are:

("C-c r" . eglot-rename)
("C-c o" . eglot-code-action-organize-imports)
("C-c h" . eldoc)
("C-c a" . eglot-code-actions)
("C-c q" . eglot-code-action-quickfix)
("C-M-<mouse-2>" . eglot-code-actions-at-mouse)

The mouse binding exists because of one way flycheck-eglot isn't as fully hooked into Eglot as I'd wish, but it turns out to be generally convenient for access to LSP 'code actions'.

(I have deliberately not bound eglot-format to anything. In Go, the one language where I would trust LSP-driven code formatting, I already go-mode's gofmt command that I'm accustomed to using. I also don't expect to use the LSP 'organize imports' often, but maybe in Python.)

This is in addition to key bindings for other packages, such as Flymake, where in order to get nice navigation of Flymake reports, I needed to set up a key binding for consult-flymake along with a few others for Flymake functions. This became a somewhat unnecessary side trip when I went back to Flycheck, but since I built a working Flymake setup, I'm keeping it for any time when I want to use Flymake instead.

Looking back, I'd estimate that most of my work in switching from lsp-mode to Eglot wasn't in configuring Eglot, it was in configuring other packages. But to say it that way makes it sound more straightforward than it was. The actual process involved a lot of looking around for additional packages, trying things out, discovering things that didn't work for me, and so on (and some amount of backtracking, like my adventures with Flymake). To be fair, this is more or less what I went through with lsp-mode when I first set it up.

Eglot officially recommends that you start it by hand (cf), but I'm too lazy for that. Instead, as I did with lsp-mode, I arranged to start it automatically for local files in the relevant modes.

(use-package eglot
  :defer t
  :init
  (defun eglot-ensure-local-only ()
    "Enable Eglot only on local buffers."
    (unless (file-remote-p default-directory) (eglot-ensure)))
  :hook
  (python-mode . eglot-ensure-local-only)
  (go-mode . eglot-ensure-local-only)
  [...]

One potential limitation of eglot-ensure as compared to eglot is that if you have multiple LSP servers for a particular language (such as 'pylsp' and 'ruff' for Python), eglot-ensure just picks the default one while eglot offers you a choice. To change afterward, you need to shut down the current LSP server and invoke 'eglot'.

(There's a program to multiplex LSP servers (discussion) if I ever want to run several at once.)

LSP servers can offer you a profusion of 'code actions'. Sadly Eglot doesn't make these particularly conveniently accessible (but then neither did my lsp-mode setup), although I hacked around that with a mouse binding (mentioned above). At one level this is technically fair and correct, because LSP servers only offer you code actions when you ask (and code actions are specific to a particular spot). Eglot also doesn't give you any way of filtering what specific code actions it will show you out of a potentially long server list that you find mostly irrelevant (and some, not working), which sadly makes them rather 'busy' for both Go and Python.

Once I had a basic Eglot setup working, I had a fun time learning how to disable some checkers in pylsp, the Python LSP server I use, because my tastes are strongly against style-based linters in 'present all the time' diagnostics. Lsp-mode provides convenient controls to turn off, for example, diagnostics from the 'mccabe' complexity linter. With Eglot, I got to learn all about user specified workspace configuration, which is definitely the morally correct approach to this but which is much more complex. Here, let me show you:

(setq-default eglot-workspace-configuration
   '(:pylsp (:plugins (:mccabe (:enabled :json-false)
                       :pylint (:enabled :json-false)
                       :pylsp_mypy (:enabled :json-false)
                       :mypy (:enabled :json-false)
                       :pycodestyle (:enabled :json-false))
                      )))

Yes, sometimes the mypy stuff is "pylsp_mypy" and sometimes it's just "mypy". This is an internal pylsp detail that Eglot makes you learn. Also, that 'setq-default' is load bearing; you can't use setq.

I find it unfortunate that Eglot doesn't have any convenient way to temporarily set LSP server parameters for a project. If you have specific settings, your life will be much easier if you put them in a correctly formatted .dir-locals.el file, which may look like this:

(( nil
   . ((eglot-workspace-configuration
       . ( :gopls (:analyses
             (:unusedresult :json-false
              :QF1012 :json-false
	      :fmtappendf :json-false)))))))

(As you can tell, what you need to set varies from LSP server to LSP server. Gopls for Go is completely different than pylsp. This is a directory local setting for me rather than a global one because they only mis-fire on some of my code.)

If you want to change these settings on the fly, Eglot has documentation on that but it's not fun to deal with. If you sometimes want to turn on mypy for your Python (LSP) code but not always, as I do, you'll get to use 'dir-locals-set-class-variables' to set up a new class, then use a function that looks like this:

 (defun cks/mypy-enable ()
   "Set Python eglot workspace configuration to enable mypy."
   (interactive)
   (let ((server (eglot--current-server-or-lose)))
     (dir-locals-set-directory-class
        (project-root (eglot--project server))
                      'cks-mypy-enabled)
     (eglot-signal-didChangeConfiguration server)))

That this elaborate process is required is an accurate reflection of reality. Eglot is running one LSP server (per language) across your entire 'project' (directory tree), and settings for that LSP apply to all files you're editing in the project, so it can't have any notion of file or buffer local LSP server settings; they have to be project wide. By extension, setting 'eglot-workspace-configuration' through conventional means is a bad idea; that makes it a buffer local variable, which does nothing useful and will only confuse you.

Sidebar: My journey with Flymake and Flycheck in Eglot

Eglot works better with Flymake than with Flycheck and flycheck-eglot, at least currently. Specifically, with Flymake, Emacs will put a button 2 popup menu on the note itself with any LSP server driven corrections (usually a 'quickfix' LSP code action), but with Flycheck, all you get is the error being marked and you have to look for and trigger LSP code actions in another way. I initially switched to Flymake because of this, but Flymake took me some effort to configure so that I liked it.

However, after switching from Flycheck to Flymake, I found that there were still some things that Flycheck did better and sometimes I wanted Flycheck instead. So I retained my Flycheck setup as well (with flycheck-eglot too), which was convenient when the flycheck-eglot author came up with a nice workaround for my issue.

There's stuff to use Flycheck checkers in Flymake but I haven't done much experimentation with it, although I installed the package and set up some support infrastructure. My impression is that Flycheck has a larger collection of checkers than Flymake does and it's easier to shuffle among them. In theory a LSP server should make all other checkers unimportant, but in practice not so, especially if you want to sometimes invoke 'linter' level checkers.

I do sort of miss Flymake's 'show diagnostics at end of line' option, because it was a good way to make LSP diagnostics glaringly obvious, for times when I want that. There's flycheck-inline, but that only displays the current warning when you're on it, not all of the warnings when you scroll through. Sideline with sideline-flycheck has the same limitation but in my view a better UI experience.

Using a Python 3 LSP server with Python 2 code works (more or less)

By: cks

I still have a certain amount of Python 2 code, both for work and for personal projects (for example, DWiki, the wiki software behind this blog; it will be Python 3 someday, but not so far). For a long time, I've preferred to do any significant editing of Python code in GNU Emacs, my normal choice for a superintelligent editor, and for a while, I've used LSP based Python editing. There's a very old LSP server for Python 2, but all of the Python LSP servers you actually want to use are specifically for Python 3, and recently I hit a problem that made me turn off the Python 2 LSP server. Since then I've been editing my Python 2 code (cautiously) with pylsp (my normal Python 3 LSP server) and recently, a little bit with 'ruff'. Somewhat to my surprise, this has more or less worked.

My minimum standard for more or less working is that the LSP doesn't malfunction obviously or deluge me with errors and other diagnostics that aren't applicable because it's applying Python 3 rules to Python 2 code. It's even better if the LSP can actually identify real problems, such as misspelled variable names or function names, and recently I've had pylsp do that for some of my code (code that was never tested or used, or I'd have found the problems much earlier; possibly this is a sign that I should have deleted the code instead of fixing it).

(The LSP server does obviously complain about Python 2 code that's using 'print' as a statement, since it's invalid Python 3 syntax, but this is easily fixed even in Python 2 code, and I want to fix it in anything I intend to maintain.)

Much of my Python 2 code mixes spaces and tabs for indentation, and I expected this to upset the Python 3 LSP servers. To my surprise, it hasn't for either pylsp or ruff. Although I can't tell for sure, I think that they're even still correctly interpreting the result (in terms of indentation levels and so on), or at least they're not complaining about syntax errors or other things I'd expect them to if they had the wrong idea of the code's structure.

(Parts of GNU Emacs' python-mode do seem to get confused and (re)indent stuff incorrectly in my old school Python 2 code with 8 space indents and real tabs, which is somewhat surprising. But I guess very few people are editing Python 2 code with tabs in GNU Emacs these days.)

I've done some testing, and as far as I can tell LSP features like 'go to definition' and 'find references' more or less work as I'd expect them to in pylsp. In my (GNU Emacs) environment I think pylsp is limited to cross references within the set of Python files that the editor has loaded and told it about, but within that it's handy.

All of this makes it clearly worthwhile to me to keep LSP stuff enabled for my Python 2 code and to continue to use a superintelligent editor for editing it (although I still make quick changes to Python 2 code with vim). Which is good, because it's also easier and sometimes I'm lazy.

(Work still has Python 2 programs because those programs are load bearing and doesn't particularly need to change, at least most of the time. Could we port them to Python 3? Sure. Could we be sure they didn't have lurking Unicode issues or other problems? No, not necessarily. I did one Python 2 to Python 3 conversion for a load bearing set of programs, our suite of ZFS management tools (including our spares management system), and it was somewhat nerve wracking.)

PS: In my current GNU Emacs environment using Eglot, I don't think the LSP server is called when I hit TAB or M-q (based on the server events reported by eglot-events-buffer), so it's not going to be involved in any rerun of my problem with lsp-mode and the Python 2 LSP server. The LSP server will reindent and reflow the entire file (Emacs buffer), but I have to very specifically ask it to do that. If I have Eglot ask pylsp to reformat a function (selected as a region), pylsp ends back a null result, which I believe means 'no changes', so perhaps pylsp is throwing up its hands at my mixed tabs and spaces indentation.

Notes on using GNU Emacs' Tramp system in an unusual shell environment

By: cks

Tramp is a famous and often praised GNU Emacs system for editing remote files; lots of people will call it one of Emacs' compelling features. I've always had a decidedly different view of Tramp because Tramp has mostly not worked for me in opaque ways. I recently took another run at getting Tramp working (so I could have an informed opinion on why I'm not a fan), and in the process I've learned a bunch of things that I don't want to forget.

Although Tramp has a bunch of ways to get access to files remotely ('methods' in Tramp jargon), the dominant way is for Tramp to SSH in to the remote system and do stuff. In order to work with your remote shell, Tramp really wants your login on the remote system to have a conventional shell environment, ideally one that uses the Bourne shell (especially Bash).

(But see Remote shell setup hints and the Tramp FAQ.)

In specific, Tramp has requirements for its ssh method in a stock setup:

  • Your shell must have a relatively conventional shell prompt. Defining this is beyond the scope of this entry; see the definition of tramp-shell-prompt-pattern in tramp.el.
  • Your shell must accept and use backslash quoting of more or less arbitrary characters in command lines.
  • Your shell login can't pause to ask questions; it can produce some additional output but it needs to drop you to a shell prompt (that Tramp can recognize).

All of these are required because with the 'ssh' method, Tramp ssh's in and starts a full login session, then switches to /bin/sh (or the Tramp remote shell you've set) with some special things that will let it reliably recognize its own Tramp (shell) prompts. Using the 'sshx' method can bypass a lot of this because with it, Tramp directly runs /bin/sh without going through your remote login session. I believe sshx is also often going to be faster, at the cost of not establishing all of the environment variables and so on that your login session would (including your remote shell's normal $PATH).

If your login shell environment doesn't match all of these you're going to have a varying amount of problems, especially with the 'ssh' method. If you have an unconventional prompt, you can sort of fix it, but a shell with different quoting rules will be painful. Tramp has some mechanisms to deal with additional questions but my impression is that they're at least a slog (see parts of Remote shell setup).

(Since I went through this, to deal with quoting issues you need to redefine tramp-end-of-output to something that doesn't require quoting that your shell doesn't support, and then make sure that your tramp-shell-prompt-pattern matches it in addition to everything else. The only characters that won't be quoted with backslashes by GNU Emacs are -, ., /, 0-9. and a-zA-Z (this is deep in shell-quote-argument). There are some things that may break inside GNU Emacs and Tramp if you do this but I haven't had any problems yet.)

If you ask Tramp to use the (remote) $PATH your remote environment sets up, it must be able to run '/bin/sh -l -c ...' in a way that successfully runs the command string without having your .profile blow things up, despite your .profile probably not being able to detect this. This is typically triggered by you putting 'tramp-own-remote-path' somewhere in tramp-remote-path (either the global version or a connection profile). Because Tramp is that way, the remote path is not part of the predefined connection information that you can set directly.

Despite Tramp carefully initializing your remote login session (if you use 'ssh'), Tramp then normally ignores your remote $PATH and instead generates its own, based on tramp-remote-path. Various bits of Tramp documentation will imply that you can use '~' in things you add to tramp-remote-path (cf some of the examples), but as far as I can tell this is what you would call inoperable. As part of connection setup, Tramp reduces tramp-remote-path down to the directories that exist on the remote machine, and the mechanism Tramp uses for this appears to be incompatible with the use of either '~' or environment variables like '$HOME'.

(Tramp does this path check using the tramp-bundle-read-file-names defconst and you can read what that expands to in order to see the details, along with the tramp-get-remote-path function and the stuff it calls. Since the shell snippet Tramp sends to the remote end quotes all of the directory names it checks, whether or not the remote shell supports '~' is irrelevant and it won't expand $HOME for you. It's possible that this is a bug and Tramp will get fixed some day, but don't hold your breath.)

There's no particularly good fix to this that I know of; instead, I think you have two options. The first is to make tramp-own-remote-path work (it probably will if you use a conventional shell and .profile), add it to tramp-remote-path, and set up and handle your $PATH properly in each machine's .profile. This is probably the better option if you can arrange it, in part because you probably want a correctly set remote $PATH for when you're logged in to the machine directly. The second option, suitable only if you have a common home directory name pattern or two across all your machines, is to add all likely directories to your tramp-remote-path in whatever variations of your home directory you might have:

(dolist (pe '("/home/cks/go/bin" "/u/cks/go/bin" ....))
  (add-to-list 'tramp-remote-path pe))

(Or you could write an ELisp function that generated the list from multiple sublists, one for things relative to your home directory and one a list of possible home directories.)

Many modern Unix systems in standard configurations will make your home directory be /home/<login>, so you can cover all of them by a few paths in tramp-remote-path. Well, assuming you have the same login on all of them. Otherwise, you'll probably have to venture into the world of connection local variables and profiles.

When changing tramp-remote-path there is something very important that can cause you (me) a great deal of frustration if you don't know the full story. At the very end of Tramp's documentation on remote programs, there is this critically important bit:

When remote search paths are changed, local Tramp caches must be recomputed. To force Tramp to recompute afresh, call M-x tramp-cleanup-this-connection RET or friends (see Cleanup remote connections).

If you're me, you might innocently think that it's safe to, for example, set or modify tramp-remote-path before you make any connections. This is false, and calling tramp-cleanup-this-connection is not sufficient to force 'local Tramp caches' to be recomputed. In fact, not even quitting and restarting Emacs will do so. Tramp maintains a persistent file based cache of information about each host you've ever connected to, including the remote $PATH it determined at the time of the first connection (with the first connection's tramp-remote-path), and it will use that cached remote $PATH value until and unless you clear the entire cache by, for example, deleting ~/.emacs.d/tramp (with Emacs not running), or you use tramp-cleanup-all-connections, which I think is probably sufficient.

Given its persistent and dangerous effects, you might want to disable this Tramp cache file. The fine documentation asserts that you can do this by setting tramp-persistency-file-name to nil. This appears to be technically correct but practically inoperative, because you cannot customize the variable to nil (only to a filename) or usefully setq it before Tramp is loaded. You can only setq it to nil (and have it stick) after Tramp is loaded (and you probably also want to invoke tramp-cleanup-all-connections to get rid of anything Tramp may have loaded).

Tramp isn't a mode and so doesn't have any hook that fires when it loads and starts to activate, which would be the right time to augment tramp-remote-path, clear any cached data Tramp loaded, and so on. This is unfortunate but use-package provides a way to work around it:

(use-package tramp
  :defer t
  ;; :config will be run right after Tramp loads.
  :config
  (cks/tramp-setup)
  )

This appears to reliably fire as I start to enter '/sshx:' or '/ssh:' or what have you.

(The manual version of this would be to directly use eval-after-load, but I might as well stick with use-package even if that's what use-package is using under the (macro) hood.)

When it works, Tramp can be pretty magical. However, my voyage of getting to this point was anything but smooth, and parts of it were extremely frustrating. That part was the part with the Tramp file cache, which made various changes to tramp-remote-path have no effect and then sometimes have effect and then go back to having no effect because I wasn't religiously clearing and removing the cache.

(Tramp badly needs a command that reports all of the relevant parameters for the current connection, such as the current remote path that Tramp is using. I could probably put my own version together with enough determination, but I shouldn't have to.)

PS: This entry was written in my working Tramp configuration from my home desktop, but I'm not sure I'm going to bother doing this again (I normally write entries in vim on the host that Wandering Thoughts is on). The red squiggles under (potentially) misspelled words are sort of nice, but on the other hand I turn out to have lots of vim reflexes for writing Wandering Thoughts entries.

(The reflexes aren't triggered by writing in general, because these days I write a lot of email in GNU Emacs and that goes fine.)

Detecting (or not) the use of -l and -c together in Bourne shells

By: cks

Many Bourne shells go slightly beyond the POSIX sh specification to also support a '-l' option that makes the shell act as a 'login shell'. POSIX's omission of -l isn't only because it doesn't really talk about login shells at all, it's also because Unix has a special way of marking login shells that goes back very far in its history. The -l option isn't necessarily what login and sshd and so on use, it's something that you can use if you specifically want to get a login shell in an unusual circumstance.

Bourne shells also have a '-c <command string>' option that causes the shell to execute the command string rather than be interactive (this is a long standing option that is in POSIX). It may surprise you to hear that most or all Bourne shells that support -l also allow you to use -l and -c together. Basically all Bourne shells interpret this as first executing your .profile and so on, then executing the command string instead of going interactive. One use for this is to non-interactively run a command line in the context of your fully set up shell, with $PATH and other environment variables ready for use.

Now, suppose (not hypothetically) that you have some things in your .profile that you would like to not run in this situation. Perhaps they're unnecessary and expensive, or perhaps they cause problems in the rest of your environment if they're run outside the context of a genuine login shell. It would be nice for your .profile to be able to detect this. Unfortunately, as far as I know at the moment this is impossible in general.

If you're using Bash specifically (and /bin/sh is Bash), Bash will set '$BASH_EXECUTION_STRING' in this case when invoked as either 'bash -l -c ...' or 'sh -l -c'. As far as I know, this is the only common Bourne compatible shell that provides any way to detect this. I've been unable to find any other shell that provides any indicator of this, neither as environment variables nor as, for example setting '$*' or '$0' to any special values.

(I checked Dash on Ubuntu LTS, which is their standard /bin/sh, the OpenBSD ksh (which is also /bin/sh), and FreeBSD /bin/sh (also a descendant of Kenneth Almquist's shell, like Dash. Fedora Linux uses Bash as /bin/sh.)

If you use Bash as your login shell but /bin/sh is something else (and things are specifically doing '/bin/sh -l -c ...'), you can rename your .profile to be .bash_profile, so it will only be read by Bash. If you have common setup stuff, you could put it in .profile and source that from your .bash_profile (although Bash has some complicated tangles around startup files, also).

If what you really care about is running your .profile a second time inside a first session, you can set a marker environment variable at the end of your .profile and then have your .profile look to see if the environment variable is set. If it is, you can skip re-doing things. This doesn't help if some clever program is doing 'ssh yourhost sh -l -c ....' in order to run some command line with your full environment set up (or if a graphical login environment does this, partly because of the Unix shell initialization problem). Under some circumstances you can detect this in your .profile because it won't be connected to a tty and you can check this with, for example, 'test -t 0'. Under other circumstances, you may have a tty despite being 'non interactive' in practice.

(It's possible that there is some generally available marker that I'm not aware of. If so, I'd be happy to hear about it.)

Your Linux distribution may no longer auto-generate new SSH host keys

By: cks

All Linux distributions (and all systems) face the need to generate SSH host keys when your system gets installed. One traditional way this was done was if the system started and discovered it had no SSH host keys, it would generate new ones. One way this was handy was that if you wanted to generate new SSH host keys for some reason, you could remove the existing ones and either reboot or restart the SSH daemon (which would usually trigger this).

As I found out the hard way the other day, some Linux distributions don't do this any more. In particular, Ubuntu doesn't. If you remove your SSH host keys, your SSH daemon will refuse to (re)start, and as far as I know there's no convenient, simple way to regenerate the necessary keys. If you make this mistake (as I did), you'll get to have fun looking up the ssh-keygen arguments you need (and then typing them in on the system console or a serial connection).

Before I started writing this entry, I would have guessed that this was common behavior across multiple distributions, because in this day and age it makes sense for your SSH keys to be set up in the installer rather than (possibly) on system boot, in a situation where the kernel's random number generation may not have accumulated much entropy. However, it turns out that Fedora doesn't behave like this.

Fedora's OpenSSH package has an entire set of systemd units and a script to generate SSH host keys if any of them are missing. Fedora has a templated sshd-keygen@service, which uses /usr/libexec/openssh/sshd-keygen to generate a host key of the appropriate type if it doesn't exist. Then Fedora's sshd.service unit 'wants' sshd-keygen.target, which in turn wants sshd-keygen@rsa.service, sshd-keygen@ecdsa.service, and sshd-keygen@ed25519.service, so before sshd starts, any missing host keys will be generated (whether or not your specific SSH server configuration uses them).

Since Ubuntu usually follows Debian, I assume Debian also doesn't automatically regenerate SSH host keys (and if it does, it doesn't seem to use the approach Fedora does). Fedora derived enterprise distributions probably follow Fedora, but I'm not even going to look. Other distributions may go either way, there probably isn't anything you could describe as a standard approach for this.

In the future, if I want to reset an Ubuntu machine's SSH host keys, the simplest thing for me will be to copy the Fedora sshd-keygen over to the system and run it (since my desktops are Fedora, I have convenient access to it). On a quick scan, the script itself is distribution-independent, so in theory you (I) could fish it out of Fedora in advance and stash a copy somewhere.

(Especially for servers, there's an argument that a missing SSH host key should be a fatal error for sshd, not something you should automatically fix up, since something is obviously badly wrong. If you generate new SSH host keys anyway so maybe people can SSH in to check the server, what you're effectively doing is training people to accept mismatched host keys in times of problems.)

Update: In a comment, Andreas pointed out 'ssh-keygen -A', which does exactly this system host key regeneration.

Splitting up my .emacs, or "use-package doesn't solve all problems"

By: cks

Over on the Fediverse, I shared a little story:

The current state of my GNU Emacs yak shaving:
; wc -l .emacs
1550 .emacs

Some of that is comments. Some of that is personal functions that I should move out to other files so I can have only use-package stuff in my .emacs. And some of it is large blocks for lsp-mode and company and some other stuff I'm probably never going to use again, which I should drop.

I got into my .emacs situation despite using use-package, which is the usual way people recommend to tame your Emacs configuration. Today I dealt with the whole thing by splitting my .emacs up into separate files, which is much better in general even if it's a bit more annoying in some ways.

My .emacs had accumulated a number of things over time, probably like many long term Emacs users. Besides infrastructure for use-package, it also had general Emacs settings I want, little personal commands and functions, simple use-package declarations, and a number of large, complex use-package declarations for packages that I need to significantly customize and tweak (and where I had lots of comments about the situation, written for my future self). Plus it also had commented out remains of experiments with things like origami-mode.

Some of these things I could remove to cut down my .emacs size, but a lot of them are intrinsically large, especially various use-package declarations. Things like Eglot, Corfu, and Vertico aren't really small packages with little to configure and adjust; they touch core areas of my Emacs experience where I have some strong opinions that don't match their default configurations. Making them work how I want them to is not necessarily a tiny process of one or two customizations. Using use-package basically encourages putting those customizations inline, as part of the overall use-package declaration, including little helper functions.

(Plus, modern modular things like Eglot require a bunch of additional packages for the full experience, and a bunch of small use-package declarations add up, especially when I add comments about why I have them.)

I took two approaches in my split. For personal functions and key bindings, I set up a number of new personal packages in ~/share/elisp and configured them in new use-package blocks (which also let me set hooks and establish key bindings). Then I took existing large use-package declarations (and small ones tied to them) and moved them all to a collection of separate files that I directly 'load-file' in my .emacs. The separate files are organized by general purpose; I have one for LSP stuff, one for all aspects of completion, one for flymake and flycheck, and one for MH-E. I left unrelated small use-package declarations (many for small packages) in my .emacs rather than try to push them to a 'miscellaneous' file where I'd probably forget about them.

(The resulting .emacs file has 18 use-package declarations left, five of which are for personal things and some of which are present purely so I can ':diminish' their modeline markers.)

What I take from this is that use-package is a perfectly good way to keep things organized but, somewhat obviously, it in no way guarantees that they will stay small or that I will refrain from adding packages.

Sidebar: use-package, load paths, load-file, and require

I don't have my ~/share/elisp directory tree on my Emacs load path, although maybe I should. For my collection of personal functions that I set up with use-package, I used ':load-path':

(use-package cks-misc
  :load-path "~/share/elisp"
  :commands (....)
  [...]
  )

(These things have no use-package usage inside themselves, they're just ELisp functions and so on.)

For the use-package declarations I moved to other files, I put them in ~/share/elisp/startup and did, eg:

(load-file "~/share/elisp/startup/lsp-startup.el")

It's probably more Emacs-proper to put things on my load path and then require the relevant name, but the load-file approach works, it directly expresses what I'm doing, and it hopefully makes it clear to future me that these aren't anything like normal packages.

Incidentally, as I discovered in the process of writing this entry but my readers may already know, when you use use-package's ':load-path', the directory is permanently added to 'load-path'; it's not just used once for this use-package (this is sort of spelled out in the documentation if you read carefully). I'm still going to use ':load-path' on everything that 'needs' it, although now I'm more tempted than before to extend 'load-path' to my ~/share/elisp in my .emacs setup code.

Straightforward checklists don't fit every situation

By: cks

We had a weekend long power shutdown this past weekend in the building with our main machine room. As is our custom, we powered off the servers before hand (on Friday evening, with some surprises), and then turned them back on this morning. This isn't the first time we've gone through such a power shutdown (although usually they're shorter), and over time we've written checklists and lessons learned for these things. This time was no exception, so I wrote checklists for both powering down everything and powering it back up (well in advance for once). Then we collectively looked at my nice, detailed, step by step power on checklist and ripped it up.

The issue is that powering things up in our environment is not really an orderly, step by step process. A lot of our systems are both core things and relatively independent of each other, and while there are ordering dependencies (our fileservers have to be up before any NFS client, for example, and the DNS resolvers need to be up before the fileservers), they're small and at the start. Even in the ordering there's a lot that can be done at once, such as booting up all of the fileservers at once.

This structure, or lack of it, doesn't particularly fit in the traditional checklist format and process, which sort of assumes that you have a real order to things. Our power up process is more anarchic than that; at best it proceeds in stages, and even then there are multiple stages that can be done at once (such as turning on most of the firewalls and turning on the fileservers; neither depends on the other). Adding to the mix is the potential need to either troubleshoot things like failed PDUs or non-booted switches, or to decide to defer them to later.

This isn't the first time I've written up a power up list and had it more or less abandoned in practice (and our retrospective 'what actually happened' worklogs even talked about it). This is just the first time I've really admitted it up front.

I'm not sure what the best form of documentation is for our orderly cold start power up requirements, but it's certainly not a detailed checklist or anything that claims to be a linear narrative. Maybe what we want to do is simply list what everything requires, starting from the machines that don't require anything. Then everyone involved can look at what has all its requirements satisfied and go for it.

(A complication is that there are also some things that are ideally started early but if they're having problems it's not critical. For example, it's nice to have our central syslog server up early to collect everyone's logs right from the start, but it's not essential in the way that, say, our NFS fileservers or our local DNS resolvers are.)

Some views on Eglot and lsp-mode in GNU Emacs

By: cks

Not content with blowing up my in-buffer LSP completion, I decided to follow it up by first trying out Eglot and then more or less switching from my relatively long standing use of lsp-mode. In the process I've wound up with some opinions on the contrast between lsp-mode and Eglot. I will give you the summary up front.

If you're just starting out with GNU Emacs and you want to have a functional, nice LSP based development environment without going through a long voyage of discovery, install and use lsp-mode, company-mode, and probably lsp-ui-mode (it comes with lsp-mode). If you stick with GNU Emacs you'll eventually want to move to Eglot, but that's for later.

To understand why I say this requires a voyage into the history of GNU Emacs, at least as I understand it.

GNU Emacs has always been in part a programming environment for creating UIs for editing text, especially code. However, for a long time many of the basic native ELisp pieces involved in doing this were relatively monolithic and weren't designed to be extensively modified and customized. If you wanted a modified version of something that GNU Emacs had a basic ELisp version of, you usually didn't hook into the native version; instead you had to replace it entirely with your own version (possibly copying and modifying the original Emacs ELisp code). One area where this was the case was in-buffer completion (especially autocompletion), which gave us third party monolithic packages like auto-complete and company(-mode) that a decade ago were your best or only choices. Lsp-mode dates from this era (its first commits were in 2016) and unsurprisingly, it's a monolith that implements many UI features itself (and it integrates with company-mode, also a monolith).

Somewhat recently (I'm not sure when it started), GNU Emacs has been modularizing many of these internal features, creating APIs that let people hook into aspects of (for one prominent example), completion (also, also). Modern GNU Emacs has adopted what you could call a "Unix tools" approach, where you have small, contained packages that handle one aspect of something and work by connecting themselves to these API points. Sometimes this results in very small, modest packages but even packages that take on bigger jobs are more smaller and more limited than past monoliths. Partly this is because they don't have to do everything themselves; they can leave various things as a problem for other people. Is Corfu giving you only limited completions in some programming language? That's not Corfu's problem, you need something else to create completion data.

(When back in the day it was Company's problem, more or less, and Company had to get a bunch of people to write a bunch of things to provide completion data.)

Eglot is a GNU Emacs package for this modern GNU Emacs world, which is why people say it's smaller than lsp-mode and also 'better' or 'more Emacsy'. It's smaller, more limited, and more Emacsy because it relies on standard GNU Emacs facilities that can now be customized and improved by other packages, rather than implementing its own nicer versions of those facilities the way lsp-mode does. Do you want nice autocompletion? That's not Eglot's problem, you can set up corfu yourself. Do you want nice 'go to definition' and 'see (other) references'? That's also not Eglot's problem, see consult-xref. Would you like to display code action possibilities on the right side? You probably want sideline. And so on.

Eglot's choice has a good side and a bad side. The good side is that it's part of this powerful, capable modern Emacs ecology of relatively narrow, focused packages. As you adopt the versions of these packages that you like, these packages improve things all across GNU Emacs, including in Eglot, because they're hooking into those general Emacs features and APIs. Corfu isn't just autocompletion for LSP buffers, it's potentially autocompletion for everything. And your Eglot environment inherits the other general improvements you make in your overall GNU Emacs environment. The whole thing gives you compounding effects from individual improvements.

This good side is why I think you'll wind up with Eglot if you stay with GNU Emacs. Over the long term you get a lot of power from moving into the modern Emacs ecology of narrowly focused but general purpose packages, and the more packages you adopt the more appealing Eglot is as part of that ecology (and the more foreign lsp-mode and company-mode become, and the more attractive it becomes to move to your standard packages). This is more or less my path to Eglot, and I wouldn't be here if I hadn't already adopted a whole collection of packages.

The bad side is that to get a decently nice Eglot experience, you also need a bunch of other packages. This means that you have to hear about those packages, experiment to decide which ones you like, learn how to set them up for your tastes, and so on. Until you do so, your LSP editing will be left with the relatively bare bones base GNU Emacs experience for completion, cross references, and other things. This is functional but by modern standards, not all that appealing. Even once you have all the packages you have to learn how to connect them all up to Eglot; lacking that knowledge at the time is why I bounced off Eglot in an earlier experiment with it.

(You could adopt someone else's modern GNU Emacs configuration, but your tastes may not be their tastes and anyway, that way you're effectively adopting a black box that you don't (yet) understand. I'm not sure this is meaningfully better than using lsp-mode, and lsp-mode will probably be better documented than the combination you've been given.)

Another issue is that integrated packages like lsp-mode and company tend to give you a better, more pleasant experience for some things. For one painful example, Eglot's approach to configuring what LSP servers support is general and clearly the proper way to do it, but lsp-mode's approach is much easier to use. Turning off pylsp's 'mccabe' code complexity metrics is simple in lsp-mode and an extended voyage of discover in Eglot (at least for me). You may discover that it's beyond your (current) GNU Emacs capabilities to do some things in Eglot that are relatively straightforward in lsp-mode.

(This is kind of the extended version of something I said on the Fediverse.)

How backups work depends on the goals of the people setting them up

By: cks

One of the recent commotions in my corner of the tech sphere was over an incident where a piece of software deleted a company's production database and all of its backups. The software got all of the backups too because, I'll quote:

[Their SaaS provider] stores volume-level backups in the same volume β€” a fact buried in their own documentation that says "wiping a volume deletes all backups" β€” [...]

A lot of people were horrified, but I had some sympathies with the SaaS provider. An important thing about backups is how backups work depends on what you're trying to recover from, and for certain sorts of disasters and recoveries, this decision is perfectly sensible. For a SaaS company, they also depend on customer support needs and what customers are going to want, and the decision can also make sense from that perspective.

In this case, the obvious question is whether the SaaS provider is trying to protect customers from loss of data in the volume or from deliberate deletion of the volume. If what you're protecting people from is an accidental 'DROP TABLE' or an accidental 'rm' (or an accidental overwrite of something important), then in volume backups such as ZFS snapshots make perfect sense. We use ZFS snapshots ourselves for this purpose on some filesystems (although they're not our only form of backups). As a bonus, restores are much faster than external backups. However, backups tied directly to the volume aren't a good ideal if what you're protecting people against is deletion of the volume itself.

(The SaaS provider itself might be concerned about loss of the volume from things other than deliberate deletion, but this isn't a concern customers want to have; they want to pretend that the SaaS provider has 100% reliable handling of volumes until they delete them. Of course, this can lead to unpleasant customer surprises if something goes wrong, which is why wise customers have completely external backups so they don't have to trust the SaaS provider and the SaaS provider's cloud vendor. The people this happened to were not wise customers, but if you've heard of this incident, you already knew that.)

If a SaaS provider wants to potentially protect people from deliberate deletion of a volume, there are a bunch of tradeoffs. For example, you're probably charging people for out of volume backups in some way, which means that if people really want to delete an unused volume, they also want to delete its backups so they're not being charged for those either. If you surface an option for 'also delete backups of this volume' so that people deleting volumes can handle the situation right away and aren't surprised by charges later, what you're surfacing is an easy total data loss option; people will reflexively say "yes" and wipe out their backups too.

(After all, typically people who delete volumes think they're doing the right thing at the time. Software agents don't think but they're generally going to behave in the same way.)

The harder you make it to delete volume backups, the more you're going to annoy some of your customers who really do want to delete their volume backups (or perhaps many of your customers, since you'd hope that almost all volume deletions are customers making the right choice and they probably don't want the backups either). At a certain point, a SaaS provider might take a rational look at their data on what people are deleting and what they're recovering from (and customer support calls), and conclude that hard to delete volume backups aren't worth it because customers don't use the extra resilience and are annoyed by the side effects of it. Perhaps you can design both your systems and your charging to get around this, but it's more product development work and if you're a SaaS company, you have a lot of other product development work you could be doing and that other work may have much higher value to your company.

(Convenient, easily accessible in volume backups may also have side effects. The space consumption side effects of ZFS backups are why we don't use them pervasively for all of our fileserver ZFS filesystems.)

Locally we use external backups, but this is because we're operating physical storage and so we have to be concerned about all sorts of catastrophic things happening to it. Our external backups are slower to restore from for in-volume damage like deleted files, but we have to make that tradeoff because we absolutely have to be able to recover from a total loss of a ZFS filesystem, ZFS pool, or an entire fileserver (or our entire machine room).

Some of our servers revived themselves unexpectedly

By: cks

We have a whole building, weekend long power shutdown in the building with our machine room that officially starts tomorrow (Saturday) morning at 5am, which is the motivation for our newly added temporary backup MX. Because we like to be in control of both the shutdown and the startup of our machines, we turn machines off in advance for scheduled outages (there's not much we can do about unscheduled ones). For various reasons we did the shutdown earlier this evening.

(One reason to start machines under controlled circumstances is that sometimes hardware fails, things go wrong, or you discover unfortunate aspects of your environment (also). At least these days we've mostly learned lessons from previous power shutdowns and startups, although there are aspects I hadn't fully absorbed and will write about later.)

During the shutdown, something surprising happened, which is that all of our ZFS fileservers came back to life. We definitely ran 'poweroff' on each of them and they were off the network for some amount of time, but then my co-workers doing work in the machine room noticed that they were all powered back on. We ran 'poweroff' on the rebooted servers and they shut down properly, rather than rebooting, so that part's not the problem. After some discussion we decided to deal with the immediate problem by pulling their power plugs, so they can't come back on even if something on board wants them to (all of these servers have BMCs).

One of the things we did between the fileservers shutting down and them coming back up is that I ran fping to scan the subnet they're on, to see if we'd missed shutting down any machines (and this fping run showed that none of them were on the network at the time). The host I ran fping from was on the same network and would have still had the MAC addresses of the fileservers in its ARP cache, so it could have directly unicast packets to the MAC.

One theory we have is that this triggered some sort of 'Wake on LAN' power up behavior. I wasn't pinging with a WoL 'Magic Packet', but as covered in sources like the Linux ethtool(8) manual pages, your hardware may potentially support a whole host of WoL mechanisms, including 'unicast messages'. This sounds like it might cause a server to wake up if its network interface receives a packet to its hardware MAC. Such as, for example, an ICMP ping packet that didn't need an ARP because the sending host already knew the target's MAC.

(I can't find much documentation on what these Wake on LAN options mean, but see eg here, this chipset documentation, or FreeBSD's ifconfig and its 'wol' options.)

When the power shutdown is over and we bring the fileservers back up on Monday, we'll be looking at what 'ethtool' reports as their Wake on LAN settings. Since they have fully capable BMCs, we may want to force all of them to have no Wake on LAN active at all. Certainly it seems undesired to have them potentially powering up based on just receiving packets, since there's a whole host of ways they could receive traffic.

PS: We haven't seen this in past power shutdowns, but our fileserver hardware was refreshed between the last one and now.

Learning my lesson that Python virtual environments aren't always movable

By: cks

I've said before that Python virtual environments can be moved around. Well, technically that entry said 'usually', but in practice I don't remember the limitations I mentioned in that entry. And that is how a while back I renamed the top level directory of a Django virtual environment that I'd also installed the Python LSP server into, and then yesterday I was rather puzzled when I tried some Django development and GNU Emacs gave me a weird error and didn't start my LSP environment.

(Fortunately what I was really doing was seeing how my new Corfu based lsp-mode completion would behave with some Python code.)

The issue is simple: every (Python) program installed into your venv's bin/ directory starts with '#!/path/to/venv/bin/python3', including programs like pylsp, the Python LSP server. They have to do this because they need to run the venv's Python, but that means that they're locked to the original filesystem location of the venv. If you move the venv, either there will be no 'python3' at that path for them to run or worse, you'll be pointing into and using a different venv. Programs outside the venv aren't normally affected, because they're directly using the venv's bin/python3 and the Python interpreter makes that work.

(In my case in GNU Emacs, there was no python3 at the path that pylsp was pointing to, so it failed to start with a weird system message. With no LSP server, Emacs' lsp-mode threw up its hands and gave up.)

Incidentally, this includes the venv's 'pip'. If its '#!' line points to what is now another venv's Python, I believe 'pip install <whatever>' will wind up installing <whatever> into that other venv, not the one you think you're in. This could be anywhere from confusing to somewhat disastrous, depending on what the alternate venv is. Venv name reuse may seem unlikely, but it depends on what your venv naming is like; a worst case option would be something like 'dev-venv' and 'prod-venv', where you remove the old 'prod-venv' venv and rename the 'dev-venv' top level directory to 'prod-venv' (then create a new 'dev-venv' sooner or later).

So far I haven't stubbed my toe on this in anything critical, but it's definitely something I need to remember and it may change how I set up and (don't) move venvs. If I'm going to move venvs very much, it'd be tempting to write something that fixed up all of the '#!' lines in a venv's 'bin/' directory.

(There may already be tools out there that do this, but I'd have to find one of them and Internet search is increasingly bad.)

Switching entirely to Corfu in my GNU Emacs configuration

By: cks

Somewhat recently I read this article on a modular completion framework for GNU Emacs (via) and expressed a thought on the Fediverse:

If lsp-mode in GNU Emacs supported corfu in addition to (or instead of) company-mode, I would probably switch from company to corfu just to have a unified completion environment. But I don't think as-you-type completion with LSP is supported in anything except company-mode, and I'm not moving to eglot (I looked once and rejected it).

Oh well, maybe someday I can unify things a bit more. (Or I will get annoyed with as-you-type completion.)

(I already use corfu for general completion, with company-mode only used in lsp-mode buffers.)

I was wrong; corfu does support as you type completion. Corfu calls this auto completion and doesn't enable it by default, but you can change that if you want, either locally to specific buffers or generally. Today I gave things a try and after the dust has settled, I've switched entirely to corfu, even in lsp-mode, with some additional changes.

As I discovered when I first explored as you type autocompletion in GNU Emacs, I like seeing the completion information but what I don't want is to have my keystrokes stolen just because some autocomplete information showed up. Corfu's default keybindings steal common keys that I might want to type while programming, such as RETURN, TAB, and cursor up and down; this is perfectly reasonable in corfu's normal environment where you have to manually trigger completion, but isn't what I want with autocomplete on. Corfu makes life slightly more difficult for me by using '<remap>' in its corfu-map local keybindings, so I have to unset them by hand:

  (keymap-unset corfu-map "RET" 'remove)
  (keymap-unset corfu-map "TAB" 'remove)
  (keymap-unset corfu-map "<up>" 'remove)
  (keymap-unset corfu-map "<down>" 'remove)
  (keymap-unset corfu-map "<remap> <next-line>" 'remove)
  (keymap-unset corfu-map "<remap> <previous-line>" 'remove)

Then, as with company-mode, I bind C-RET, C-TAB, C-<up>, and C-<down> to do these actions, which are corfu-insert, corfu-complete, corfu-previous, and corfu-next respectively. I also made my wheel mouse scroll up and down through the selections.

While I was fiddling around in corfu, I made the fortuitous discovery of completion-preview-mode. The visually obvious thing completion-preview mode does for me is that it shows the current completion prefix ahead of what I'm typing (if there is one). The non-obvious thing it does is that I can immediately hit TAB to complete to that prefix (the same way a single tab works in shell filename completion). This completion sort of works even even in lsp-mode, where corfu's normal completion expansion gives up entirely. Initially I thought that completion-preview could successfully complete prefixes in lsp-mode, but I was being fooled by how often the prefix was the first completion.

With completion-preview showing me basic information about what I can immediately complete, I decided to slow down how soon corfu's auto-completion popup appears. If I want to trigger it early, I can always hit M-TAB. My current value for 'corfu-auto-delay' is 0.5 (seconds).

The remaining fix needed is that lsp-mode is extremely attached to company-mode. If you have company-mode installed lsp-mode will activate it in buffers, and if you don't have it installed, lsp-mode will complain. This behavior can be turned off by setting the somewhat oddly named lsp-completion-provider variable to ':none' from its default value of ':capf'. Despite capf being a standard GNU Emacs jargon, lsp-mode really means 'company' here. No doubt there's some history involved.

(It's not clear to me if corfu makes some use of company-mode if it's available.)

Although I had to shave a certain amount of yaks to get here, I feel glad to have switched to only using Corfu. Company-mode is a perfectly fine autocompletion environment and I was happy with it for years, but I didn't use it everywhere and once I added corfu I was juggling two sets of reflexes, one for corfu M-TAB initiated completion in places like Emacs Lisp and the other for company autocompletion in LSP buffers. Every so often I'd hit M-TAB in an LSP buffer out of reflex, and sometimes that got confusing. Now I only have one set of reflexes.

One tricky bit of using autocompletion in corfu is that you can't change the value of 'corfu-auto' on the fly. What matters is its value when corfu-mode starts. Fortunately we can use brute force; if we assume that we're only going to change corfu-auto when corfu-mode is on, we can write functions like:

 (defun corfu-enable-auto ()
   "Enable corfu auto-completion in this buffer."
   (interactive)
   (setq-local corfu-auto t)
   (corfu-mode -1)
   (corfu-mode 1))
 (defun corfu-disable-auto ()
   "Disable corfu auto-completion in this buffer."
   (interactive)
   (setq-local corfu-auto nil)
   (corfu-mode -1)
   (corfu-mode 1))

There is probably a better way to do this, and possibly I should turn the mode off before changing the corfu-auto value. Also, don't forget to make these interactive functions, as I did in the first version I wrote.

(Well, it's GNU Emacs, we can always read the source, also, and then duplicate what the source is doing when it goes into or out of corfu-mode. But those are internal details that might change, while having corfu-mode redo its setup should always work.)

PS: Someday I would like to make corfu complete prefixes properly in lsp-mode (which would probably also fix completion-preview, and even company-mode, since they all have the same problem), but that's another and bigger problem. For today I'm happy to have switched.

If it's in JSON, it's not really a configuration file

By: cks

Over on the Fediverse, I said something:

If your idea of a good configuration file format is JSON, you are not a daemon I will ever run voluntarily.

This is not very much of a subtoot of ISC Kea. If we ever have to replace the traditional ISC DHCP server with anything, it will not be with Kea.

If your program's configuration file format is JSON, you're openly advertising that you care far more about programming convenience in reading and loading your configuration file than you do about the people operating your software. "You can generate our JSON with software from something else", yeah, no. You've told me what your priorities are and I'm going to believe you. I would rather run software that actually cares about the people running it.

JSON is a perfectly good format for your internal configuration data store, what you transform a configuration file into and then save for your software's future convenience. It's not a configuration file format, and if you use it as such, you're basically forcing people to write your compiled configuration storage format themselves. The result is a configuration file only in a narrow technical sense that it is a file you force people to supply to configure your software. You could tell them to compile C or their language of choice into a shared .so file that you will load as a plugin to configure things, or to write a Python, Perl, Lua, or JavaScript file (depending on your implementation language) that you will load and execute to create the configuration, and call all of those 'configuration files', and it would not be too far off from the JSON case.

(One of my Python programs can get its configuration from a pickled configuration object loaded from a file. That is a file and it has the program's configuration in it, but I would never call it a configuration file.)

Why all of this matters is something I said on the Fediverse and have said before (more or less):

I should say this out loud: a program's configuration files and configuration file format is part of its user interface. Much like other user interfaces, you cannot necessarily use a generic 'UI' for your configuration files without inflicting pain on people operating your software.

Yes, this means that sometimes you have to design and build your own configuration file format, much like you may have to build other UIs for your program.

(See also.)

If your configuration user interface is JSON, you're making a statement about what and who you care about. You may also be making a statement about how you more or less require your software to be used, and how you expect people to deploy it. Certainly various people are going to read things into your choice, whether or not that's your genuine intentions, because people do that.

Pragmatically, I expect that almost no one is writing those JSON configurations and configuration files by hand. Instead they're probably generating them through a program or translating them from some (slightly) more approachable format, like YAML (which is only mildly better, but at least it has comments and an explicit multi-line structure). I'm sure there are multiple YAML to JSON translators, and some of them probably can take some sort of schema along with the input file, so you can get useful syntax errors when you make certain sorts of mistakes in your configuration.

(This is probably the route we would take if we absolutely had to run such a program.)

The easy way to switch my libvirt-based virtual machines to UEFI

By: cks

I mentioned before that I've been switching some libvirt-based virtual machines to UEFI. I've recently had to do some more things there, which has led me to discover what's important about the XML parts of your libvirt machine definitions for this. Or at least, what's important if you use virt-manager to change things.

(There's a long story that boils down to libvirt external snapshots not playing well with virtual CD-ROMs, BIOS PXE booting being annoying, and UEFI Secure Boot causing the Ubuntu 26.04 GRUB to refuse to touch the Ubuntu 22.04 installer kernel.)

As mentioned in the previous entry, what determines whether the machine boots into UEFI or BIOS is whether or not the <os> XML node has a "firmware='efi'" attribute set on it. Once you have UEFI firmware, the <os> XML node can have a '<firmware>' node with some '<feature>' nodes that tell it what to do about Secure Boot:

 <os firmware='efi'>
   <type arch='x86_64' machine='pc-q35-9.2'>hvm</type>
   <firmware>
     <feature enabled='yes' name='enrolled-keys'/>
     <feature enabled='yes' name='secure-boot'/>
   </firmware>
 </os>

By itself this isn't a fully specified UEFI set of attributes, because you need <loader> and <nvram> elements as well, and these vary based on your secure-boot and enrolled-keys settings.

Conveniently for me, if you edit your XML in virt-manager, don't have (or remove) the <loader> and <nvram> elements, and then pick the 'Apply' button, virt-manager will pick appropriate values for you based on your settings for Secure Boot (or the lack of it). This can be used when you're turning off Secure Boot (or turning it on), or when you're moving from BIOS to UEFI.

(This might also happen if you use 'virsh edit', but I haven't tested that. But I suspect it's virt-manager doing some convenient magic for you.)

So the easy way to convert a machine from BIOS booting to UEFI, with or without secure boot, is to add "firmware='efi'" to the <os> attribute and past in an appropriate <firmware> block. The block above is for full Secure Boot. For full lack of Secure Boot, I want:

   <firmware>
     <feature enabled='no' name='enrolled-keys'/>
     <feature enabled='no' name='secure-boot'/>
   </firmware>

Apparently if you flip around between Secure Boot and non-Secure Boot, you may want to reset your NVRAM file. One way to do this is to remove the relevant NVRAM file that I will find in /var/lib/libvirt/qemu/nvram/. Another way is to use --reset-nvram with 'virsh start', eg 'virsh start foo --reset-nvram'. You can also use --reset-nvram with 'virsh snapshot-revert', and I may be doing that someday.

(You don't need to reset the NVRAM file when going from BIOS to UEFI because BIOS doesn't have a NVRAM file. If you go from UEFI to BIOS and then back to UEFI, probably you want to reset your NVRAM, but also maybe you want two separate VMs instead of switching between BIOS and UEFI all the time.)

Browsers, OCSP, and a view of the web in practice

By: cks

I recently read Geoff Huston's Revocation of X.509 certificates, which in part talks about OCSP's failure. One of the pragmatic reasons for OCSP being dead is that Chrome dropped support for it more than a decade ago. Specifically, Chrome's replacement for certificate revocation was for Chrome to have an internal set of revoked certificates. Recently, Firefox has adopted a similar approach (with a different technical implementation).

One of my views of this is that it shows browsers recognizing and accepting that if they want something, they have to do it themselves and they can't rely on the behavior of outside parties, especially the behavior of a lot of outside parties. Another way to put it is that browsers can change themselves to get something done but they often have a hard time getting other people to change.

OCSP had two groups of outside parties; Certificate Authorities for direct CA OCSP checks, and web servers for OCSP stapling, and in the end browsers clearly couldn't rely on either group. In my own experience, direct use of CA OCSP checks by Firefox failed so often because of problems with CA OCSP servers that turning it off was my first reaction any time I ran into a TLS problem (cf). When you think about it, browsers clearly couldn't count on other parties to run high volume, critical services with no economic model that were guaranteed to be both reliable and private.

(The kindest thing you can say about OCSP is that it was created in a long ago world where probably no one expected that HTTPS would become as prevalent and as critical as it has. In a world where HTTPS was only used when paying for your shopping cart and interacting with parts of your bank, both the volume and the privacy impacts of OCSP would be much, much lower.)

The answer to the problems with direct OCSP checks with Certificate Authorities was supposed to be OCSP Stapling. However, this had its own problem, which was that for it to really work, all (HTTPS) web servers had to upgrade. This was never really likely to happen, especially on a timely basis, and it probably became obvious fairly soon that it wasn't going to happen in practice (partly because it's hard, also).

So one way to view Chrome's decision to drop support for OCSP (and quite early) was a recognition that they couldn't count on any other party to handle certificate revocation for them. If Chrome wanted certificate revocation to work, they had to own their own mechanism for it (even if that mechanism was only used to a limited extent for high priority revocations). Browsers building their own mechanism also meant that browsers could handle the situation where a Certificate Authority was slow to handle a revocation for one reason or another, since the revocation data doesn't have to come only from CAs.

(The browsers require Certificate Authorities to promptly handle revocations, but if a CA doesn't do it in practice, resolving this is generally a long process involving people arguing over things, not an immediate thing where browsers remove the Certificate Authority. Immediate removal is reserved for a crisis, such as the Certificate Authority being compromised entirely.)

PS: For similar reasons I think that browsers relying on DNSSEC for TLS security properties in modern web PKI is a non-starter, even beyond all of the other DNSSEC problems in practice.

Understanding the Ubuntu server installer initramfs

By: cks

I recently wrote about all of the various steps of a UEFI network install, where you have a whole collection of DHCP, GRUB fetching things via TFTP and HTTP, and so on, all to boot into your Ubuntu server install ISO image. Specifically, all of the GRUB stuff and much of the complicated DHCP stuff is there because we have to load the installer's kernel and initial ramdisk. Our primary usage for UEFI network installs is to reinstall physical servers that are now in inconvenient locations, so eventually it occurred to me that if we already have running Linux systems, there are simpler ways to boot into a specific kernel and initramfs with specific command line arguments. One way is to add new GRUB boot entries, and another way is kexec.

If we're already using a local kernel and initramfs, it might be convenient to get rid of the need for a DHCP server too, by copying the network parameters from the currently running server and embedding them in both the kernel boot parameters and, more importantly, the cloud-init files that the installer will use. To do this, we need to embed the cloud-init files in the initramfs (and then point to them with 'ds=nocloud;s=/whatever' in the kernel command lines). Well, that's the theory, but it turns out that this is not quite the practice.

The problem is that contrary to what you (I) might think, the Ubuntu server installer is not running from the initramfs. Instead, the initramfs constructs an in-memory root filesystem from various squashfs filesystem images that it gets from /casper on the installer ISO. As part of the initramfs boot, Casper mounts the ISO image (either via NFS or via a HTTP copy), finds those files on it in /casper, and then uses these files to construct the root filesystem that will then have the ISO image (still) mounted in it when Casper pivots the system into running from it. This means that while it's readily possible to add files to the initramfs, your added files are immediately discarded when Casper pivots to its pre-built root filesystem. Since the squashfs filesystem images come from the ISO image, they're generic across your systems and you can't use them to embed per-system configurations.

(In the process of this pivot, Casper will do things like switch to a standard systemd init environment.)

To deal with Casper dropping the initramfs, we must arrange to copy our injected initramfs contents into the root filesystem that Casper builds before Casper pivots into it and discards the initramfs (as far as I know, there's no way to access the initramfs after this, especially with it pre-mounted so that your cloud-init file can be immediately read). Sadly Casper makes this complicated and potentially specific to the Ubuntu server installer you're using.

As part of the Casper initramfs process, Casper will run a collection of scripts from /scripts/casper-bottom, so ideally we can just add our own script to that and have it copy things from the initramfs to appropriate places in /root (the real root filesystem to be). Unfortunately, Casper doesn't scan this directory for scripts to run; instead what scripts to run (in what order) is handled by /scripts/casper-bottom/ORDER (this is the standard Casper way and is used for other Casper 'directories of scripts'). So we have to add our script and also replace the ORDER file from the ISO's initrd with one that includes our script.

A Linux kernel initramfs is a collection of cpio archives, with the last archive (usually) compressed. You can put your own uncompressed cpio archive on the front, or (usually) compress your own cpio archive with the same compression method as the compressed archive and stick it on at the end. Files in later cpio archives overwrite files from earlier cpio archives, and since we need to overwrite /scripts/casper-bottom/ORDER, we have to put our cpio archive at the end. Starting no later than Ubuntu 22.04 LTS, the standard installers all have the last cpio archive compressed with zstd, so that's also what we need to compress our own cpio archive.

(I believe there are potentially tricky issues with sticking compressed archives together this way, which I will leave to others to investigate. I made a 26.04 version work without problems but that could have been luck.)

To make this less annoying, we can use two local cpio archives. One archive contains only our additions and changes to /scripts/casper-bottom; it's zstd compressed and goes on the end of the initramfs, and we can even prepare generic, amended initramfs images with this already pre-built. Then the only per-machine addition we need to build is our cloud-init configuration files, which can go into an uncompressed cpio archive that we put on the front of our initramfs (perhaps the prepared, modified initramfs). This will give us a full initramfs that we can use as kexec's '--initrd' argument (or set up in a GRUB entry).

(This is not quite enough by itself to enable a DHCP-less network boot and install, because we also have to configure the system's IP address and other details in Casper itself via the 'ip=' command line argument; see casper(7) for the format of that. With a proper ip= setting, Casper can find the ISO image and mount it, and with a proper cloud-init injected into the initramfs and then the installer root filesystem, the server installer will properly set up networking and keep it up so that you can go through the normal over the network installer operation.)

PS: Apparently I will go through quite a lot to not have to maintain and update DHCP server entries, even through scripts that the future me might have fun writing.

Our backup MX server was easy to build, but yours might not be

By: cks

I recently mentioned that we'd built a backup MX server due to concerns prompted by a scheduled power outage. In a comment on that entry, Greg A. Woods said something that I broadly agree with:

I think backup MX hosts are, generally speaking, a bad idea in modern times (even going back a couple of decades).

[...]

The added maintenance overhead and headache of keeping a full-time backup MX host running and reliably forwarding ALL email it collects, and reliably rejecting all email it should reject, isn't usually worth the bother.

One reason that we implemented a backup MX is that this isn't our experience. Our backup MX was easy to build and is essentially trivial to keep in reliable operation. However, this isn't because we have some special trick to running backup MXes; instead, it's because we have a general mail architecture that enables it.

Many, many years ago we moved from a mail architecture that was essentially monolithic to one that had an external MX gateway that was stuck in front of our central mail server. This transition involved creating what I call a 'white-box' mailer environment, where knowledge of things like valid local addresses and domains was materialized in text files and reusable in many contexts. Our spam and virus filtering is also done with FOSS components, which we can more or less run as many copies of as we like.

So our backup MX is essentially a clone of our regular external MX gateway machine, except that it has the MTA and the anti-spam stuff on the same machine (and we may do this for the next version of the external MX gateway, now we know more about how much load the anti-spam stuff creates). The backup MX server uses the same white-box mail information that our external MX gateway machine does, and we arranged for it to sit in a network environment where it could deliver accepted mail straight to our central mail server (instead of later delivering it to the normal external MX gateway, which would have added more hops and more redundant spam checking).

(All of the changes from the regular external MX gateway were things that we already had in operation on other machines and needed only modest tweaks to deal with the unique parts of this one.)

This is only possible because we already had all of the pieces. We have a general framework for installing and operating servers, we had an external MX gateway separate from the main mail system, that external MX gateway didn't rely on internal services to do things like validate addresses, and we didn't have commercial software involved that might have had license restrictions that prevented us from running an extra copy on our new backup MX.

We're also making life easier on ourselves by only running this backup MX temporarily, and with a configuration for valid email addresses, spam settings, and so on that is effectively frozen because all of the machines and services that could change any of that are powered off. That way we don't have to worry about what happens if the network connection between the backup MX and us gets blocked and the backup MX starts drifting out of sync on what email addresses are valid and so on.

If we hadn't already moved from a monolithic black-box mailer environment to a multi-machine white box one, building and running a backup MX host would have had all of the issues that Greg A. Woods identified. The existence of some of these issues is part of why spammers like to probe your backup MX. Also, in general I still agree with my old entry on the case against a full time backup MX, although modern email makes me nervous about the potential for aggressive mail delivery timeouts.

(In my old terminology, what we've built is technically a redundant MX. But that's a happy accident of the available network connectivity where this machine is going to be located for the power outage, and it could have had to deliver mail to our regular external MX gateway.)

Configuring the ISC DHCP server to pick the right network boot option

By: cks

There are at least three ways that x86 machines can try to boot from the network; BIOS PXE boot, UEFI PXE boot, and UEFI HTTP boot. All of them start by the machine asking a DHCP server for what it should boot, and all of them require different answers from the DHCP server. If you want to support more than one network booting option, your DHCP server needs to give each sort of client the right answer for it, which generally means you have to tell the DHCP server how to tell the types of clients apart.

(If you have all modern machines you can probably get away with only supporting UEFI PXE booting, which will simplify your life slightly.)

The DHCP server we use is the standard and now old-fashioned ISC DHCP server. There are a variety of guides for how to configure your ISC DHCP server for multiple types of network booting, but for various reasons I'm writing my own. This one is actually tested in real use (I've booted machines all three ways from this configuration).

When DHCP clients send out network booting requests, they include two important pieces of information, their "vendor class identifier" and their 'architecture'; these are DHCP option code 60 and DHCP option code 93 respectively. The vendor class identifier is a string and the architecture is a 16-bit integer. ISC DHCP has names for both options, vendor-class-identifier and pxe-system-type respectively (cf), although the latter appears to be recent enough that a lot of Internet writeups think you have to define it yourself in your dhcpd.conf, eg:

option pxe-arch code 93 = unsigned integer 16;

Since I didn't read up on all of this before this entry, my dhcpd.conf contains this superstition and I haven't (yet) tested a version without it.

If all you care about is UEFI x86 systems, you can use the vendor class identifier to tell apart UEFI PXE booting and UEFI HTTP booting. In PXE booting, it starts with 'PXEClient', and in HTTP booting, it starts with 'HTTPClient'. This results in a configuration snippet that looks like this:

class "pxeclients" {
  # TFTP
  match if substring (option vendor-class-identifier, 0, 9) = "PXEClient";
  next-server X.Y.Z.Q;
  filename "/grub/shimx64.efi";
}
class "httpclients" {
  match if substring (option vendor-class-identifier, 0, 10) = "HTTPClient";
  # the v-c-i in the reply is required
  option vendor-class-identifier "HTTPClient";
  filename "http://X.Y.Z.Q/grub/shimx64.efi";
}

If you also want to handle BIOS PXE systems, you need something more complicated, because both BIOS PXE and UEFI PXE have a vendor class identifier that starts with 'PXEClient'. You can be more precise by matching more of the vendor class identifier because it also includes an 'Arch:XXXXX' string (cf), but I think it's simpler to switch to using the 'architecture' number (which is what the 'Arch:' part is telling you anyway). The official list of architecture types is IANA's Processor Architecture Types, and one thing to know when reading it is that 'x64' is 64-bit x86, not Itanium. In practice with x86, what you'll see is 0x00 (BIOS PXE), 0x07 (UEFI PXE), and 0x10 (UEFI HTTP). In your dhcpd.conf, this looks like:

if (option pxe-arch = 00:10) {
  # The v-c-i is required
  option vendor-class-identifier "HTTPClient";
  filename "http://X.Y.Z.Q/grub/shimx64.efi";
} else if (option pxe-arch = 00:07) {
  next-server X.Y.Z.Q;
  filename "/grub/shimx64.efi";
} else {
  next-server X.Y.Z.Q;
  filename "/pxe/lpxelinux.0";
}

(Technically I should check pxe-arch for the last clause.)

I believe you can use the official 'pxe-system-type' here instead of my self-defined version, but I'm copying this example straight from my known-working dhcpd.conf. Also, as covered in dhcp-eval, possibly this would be more clearly written as a switch statement. I may experiment with both changes later, but this is what's working for me today.

(See also my entry on the various steps of a network install from an Ubuntu server ISO, which discusses the shimx64.efi and lpxelinux.0 bits a bit more.)

The various steps of a UEFI network install from an Ubuntu server ISO

By: cks

Suppose, not hypothetically, that you have a locally customized Ubuntu server install ISO image (and have for a while), and you also now have a number of UEFI based machines that it would be convenient to (re)install over the network without having to visit them in person (and they don't have IPMIs/BMCs that support virtual media). It turns out that you can take an Ubuntu ISO and install from it over the network, but how the various steps and stages connect together isn't obvious. Here are my notes on this, before I forget them all. I'll assume that you already have a modern Ubuntu server installer configuration setup, but you can also do this with a stock ISO image that will walk you through the full set of server installer questions.

The process of booting your ISO over the network goes like this (including recommended things):

  1. Your UEFI based server sends out a DHCP request that includes, among other things, its request for one of the UEFI network booting options.
  2. Your DHCP server answers with the server's IP and either a HTTP URL (for UEFI HTTP boot) to shimx64.efi, or a TFTP server and the (TFTP) path to shimx64.efi. Most of your machines will probably want the TFTP option. Provided that your DHCP server gave the server you're installing a usable gateway, this TFTP and HTTP server doesn't have to be on the same network as the server you're network installing.
  3. Shimx64.efi will load grubx64.efi (which must be the signed grubnetx64.efi) from the same server and (relative) path as it was loaded, eg if shimx64.efi was loaded from '/inst/2604/grub/shimx64.efi', it will load '/inst/2604/grub/grubx64.efi'.

    The shimx64.efi and grub(net)x64.efi don't have to be from the Ubuntu version you're booting, but your grubx64.efi should match the GRUB modules you're going to use with it. You probably want to use the latest GRUB you can conveniently get your hands on.

  4. GRUB will load '/grub/grub.cfg' and various other things in '/grub' from your TFTP or HTTP server. Unlike the shimx64 to grubx64 transition, GRUB (at least the Ubuntu version) insists on using an absolute path, not one relative to the directory it was loaded from. GRUB will expect to find various things in, for example, '/grub/x86_64-efi/'.

    In your /grub/grub.cfg, you can switch all future accesses to HTTP by using '(http)' in future references to things, perhaps with a prefix:

    set http=(http)/inst/
    

    Your grub.cfg can be universal for all of your machines, or you can go on to load a machine-specific one using some GRUB variables:

    source $http/grub/by-net/$net_default_ip
    

    (This trick comes from a co-worker, not me.)

    Some GRUB documentation will claim that GRUB will automatically search for a variety of grub.cfg names that are derived from the machine's IP address and other parameters. This is experimentally false for the Ubuntu 26.04 UEFI grubnetx64.efi; my server logs show no attempts for anything other than '/grub/grub.cfg'.

  5. Whatever GRUB configuration file you use now loads the appropriate installer ISO's kernel and initrd, ideally over HTTP instead of TFTP because you switched above. You can get both of these from the /casper directory on the ISO (along with things you don't need). Once you've put these where you want them, you can specify them as, say:

    linux $http/casper/vmlinuz ip=dhcp [other options to come] ---
    initrd $http/casper/initrd
    

    Because the Ubuntu ISO's initrd contains kernel modules, it's specifically tied to the ISO's kernel; you have to use a matching pair and can't just swap in a more modern kernel with better hardware support for your hardware.

  6. GRUB boots the installer kernel with the installer initrd, which makes its own DHCP request (and hopefully gets the same IP back), because once booted into Linux you no longer get to use UEFI services and the UEFI-obtained DHCP stuff. If you forgot to put 'ip=dhcp' into the kernel command line, the Ubuntu server installer initrd won't do DHCP, won't set up any networking, and everything else will fail.

    (It would be nice if the kernel automatically inherited all of the UEFI IP settings, including the TFTP or HTTP server information, but as far as I can tell it doesn't.)

  7. The initrd 'mounts' the ISO. You have two options for how this is done, which are covered in the casper(7) manual page. Either the .iso image itself can be fetched over HTTP, stuffed into RAM, and mounted as a ramdisk image, or you can NFS mount an extracted directory tree version of it from a suitable server (perhaps the very install server that you've been TFTP'ing and HTTP'ing from so far; GRUB's $net_default_server variable may be convenient for this).

    The simpler option is configuring a NFS mount. This is done with the (kernel) command line options:

    netboot=nfs nfsroot=W.X.Y.Z:/some/path/
    

    To fetch the ISO from a URL, the kernel command line parameter is 'iso-url=http://...', but by itself this will probably fail because the default ramdisk is too small. So instead you need to also specify a bigger ramdisk (the size appears to be superstition, cf, but it works for Ubuntu 26.04 beta):

    root=/dev/ram0 ramdisk_size=1500000 iso-url=....
    

    A potential advantage of directly loading the ISO is that once it's loaded, you don't really have to care about the network connection to the install server. With a NFS mount, if something resets the networking you're really up the creek. On the other hand, the NFS mount starts quickly and means you don't have to care about things like ramdisk sizes and how much RAM your servers have.

  8. Something fetches your your installer configuration quite early on (I think it may be the installer proper, not the initrd). If you don't provide a configuration, all you've done is network booted the stock install ISO and it's now going to sit there asking you to interact with the installer on the system console (which might be good enough if the server has a BMC with KVM over IP support). To either automatically install your system or to allow you remote SSH access to the installer, you need a cloud-init configuration. I believe that you can use the version you've embedded in your ISO image with your regular ds= parameter, but you may find it more convenient to fetch it via HTTP with more kernel command line parameters:

    "ds=nocloud-net;s=http://..."
    

    (You have to put this in quotes or GRUB will break it at the ';'.)

    If your install isn't fully automated and you want remote access to it to configure the interactive sections, your cloud-init user-data must include a chpasswd section for the user 'installer', or a ssh_authorized_keys with an appropriate key (which will again be used for 'installer').

    (I found this long ago from here.)

    (It's possible that you can configure a kernel serial console and then use IPMI Serial Over LAN to talk to the installer, if you have an IPMI with SoL support but no KVM over IP.)

  9. The Ubuntu server installer will start up as normal, just as if it was booted from a real ISO, except that when the installer gets to configuring the network, it will reset networking and proceed according to your default configured networking, if any. This makes it critical to set 'dhcpv4: true' (or 'dhcpv6: true' if you're that sort of person) in your installer configuration, because otherwise your server will drop off the network, probably breaking its (network) install, especially if you opted to NFS-mount the ISO image's directory tree instead of fetching the ISO into RAM.

Provided that you've configured an appropriate cloud-init password or SSH key, you can SSH in to your network-booted server as 'installer' and be put in the regular server ISO installer environment, where you can go through whatever interactive steps you normally would with an in-person install. You'll want to use a big window and it needs to be a modern terminal program like gnome-terminal (don't try this with xterm). If you set 'network' as one of your interactive sections and you don't want to keep using DHCP in the installed system, you can switch from getting networking through DHCP to the same networking being set statically.

(You can also switch from DHCP to a static networking setup after the system has booted into its new local Ubuntu install; your install DHCP server is probably not going anywhere.)

Some of the kernel parameters here are confusing, because some of the time they can be interpreted by the kernel and some of the time they're ignored by the kernel and interpreted by things like casper(7). This is the case with the 'ip=' parameter, which in theory can be interpreted by the kernel but in practice is interpreted by Casper, with a different syntax. Since I just went on an extended digging session to find this out, I will tell you that the syntax Casper actually accepts for 'ip=' is the extended syntax used for klibc's ipconfig in its -d argument, because if your 'ip=' is something complicated, Casper winds up more or less passing it to ipconfig.

(This contradicts the Casper manual page but I extracted the 26.04 /casper/initrd to find this out. Not that it really matters, because in practice you mostly have to have DHCP working to get UEFI to network boot and then to keep your running install ISO on the network so you can talk to it.)

The minimal-changes version of going from an Ubuntu (server) ISO image to a booting it over the network is the iso-url option, although you will need to extract /casper/vmlinuz and /casper/initrd from the ISO. This avoids setting up NFS service on your install server, and also avoids having to unpack the ISO (which is easy enough with the right tools, but you have to know what the right tools are). My personal view is that I prefer the NFS option, and if you're the right kind of person you can use Apache Alias directives to serve /casper right out of the ISO's extracted directory tree rather than copy them into your web server area.

PS: It's possible to do much the same with a BIOS PXE booting server, but you have to use PXELINUX instead of GRUB (in practice you'll want to use the 'lpxelinux.0' variant that understands HTTP). Once you're at the stage of loading and booting the kernel, everything is the same; you need to boot the /casper vmlinuz and initrd, with the same kernel command line options as in the UEFI case. The one gotcha is that you can't use the syslinux INITRD directive because it messes up the kernel command line.

Some general notes on network booting UEFI machines

By: cks

If you need to (re)install a large collection of servers or servers in inconvenient locations for physical access, booting them from the network in order to install them is something that you might be quite interested in. In the pre-UEFI PC 'BIOS' era of MBR booting, this was often called PXE booting, but UEFI changes things around.

UEFI firmware typically has built in support for networking, which is to say that there are UEFI protocols (function calls) for doing common things with the network (also, also). In practice this means that bootloaders and other things don't have to embed their own code to deal with the network (or their own network card drivers); provided that they don't exit from the UEFI preboot environment, they can just use UEFI services. In typical Linux environments, this will handle everything up until the kernel starts with its initial ramdisk (GRUB will load the kernel and initramfs over the network using UEFI services).

As covered in UEFI HTTP Boot, UEFI provides two ways to do network booting. Both ways start with the UEFI firmware doing DHCP to get an initial chunk of information, either by IPv4 or IPv6. In the standard and widely supported way, your DHCP server answers with (among other things) a next-server setting that points to a TFTP server and a 'filename' setting that is the initial EFI file to load and boot from that TFTP server. If you're using UEFI Secure Boot, this EFI file must be signed, so for x86 Linux with GRUB it's typically the (signed) shimx64.efi that you'd use locally (which will then boot 'grubx64.efi', which must really be the (signed) 'grubnetx64.efi'). My understanding is that this looks a lot like old fashioned PXE booting with minor differences in file names, configuration files, and so on.

The other, modern option is to skip using TFTP and load the EFI boot file over HTTP, hence UEFI HTTP Boot; this was apparently added in UEFI 2.5, from 2015. The UEFI firmware signals that it's doing a HTTP boot instead of a TFTP boot by setting special options in its DHCP request; it requests a special architecture and puts special things in its DHCP 'vendor class identifier'. If your DHCP server and your overall environment supports this boot option, you'll reply with a DHCP 'filename' option that is the URL of what to start booting from (often shimx64.efi again) and a special 'vendor class identifier' marker of your own to tell the UEFI firmware that this is a HTTP boot reply.

(See here, here, and the end of here for various DHCP server incantations using either the advertised client DHCP architecture or its vendor class identifier.)

Although the UEFI standard's description of UEFI HTTP Boot is somewhat unclear, it clearly envisions that HTTP boot can be used to 'boot' not just EFI programs but also disk images and even ISOs. These will be set up by UEFI firmware as a (UEFI) RAM disk. How your system installer accesses this ISO RAM image after the installer's kernel has started and UEFI firmware services aren't available any more is up to it.

UEFI HTTP booting has a variety of appealing features, like not using TFTP and supporting DNS (and everything that comes with that), and in modern UEFI firmware you apparently don't even need DHCP if you configure everything in the UEFI boot variables (cf, also). However, it has the potentially significant drawback of being modern, which means that older UEFI firmware (which you may have on systems you're now retaining) may either not support it at all or may have bugs and flaky behavior related to it. For that matter, even your modern UEFI firmware may not be entirely free of bugs, especially if you want to do more exotic things like directly boot an ISO image.

If you're already going to get as much as possible of the installer from your HTTP server, my view is that you might as well enable UEFI HTTP booting in your DHCP server. It probably won't hurt and it may enable somewhat better network booting, especially across subnet boundaries. Although ideally you won't be loading very much via TFTP anyway.

A backup MX will get accessed by various sorts of people

By: cks

We have an extended power outage coming up, one that's long enough that I think we want a backup MX that can stay up during it. I've been building out a stand-alone duplicate of our current inbound mail gateway, and today I added a lower priority DNS MX record that points to it. What happened next is predictable:

This is my absolutely not surprised face that mere moments after I add a secondary MX to one of our zones, various IPs show up to poke its SMTP port, despite our primary MX being up (and the backup MX not actually running a SMTP server right now).

Admittedly, there's a reason for some use of our backup MX, which I discovered after I started the MTA on the backup MX machine:

Oops, I have to retract some of my 'the spammers are showing up on schedule' snark, because our primary MX greylists people sometime and if the primary MX is 4xx'ing things, trying the backup MX is reasonable.

(But surprise, the backup MX will greylist you too because our MXes are running the same configuration.)

Another case where things appear to have more or less legitimately shifted over to the backup MX was when one particular amazonses.com IP address opened up so many simultaneous connections to our primary MX that our primary MX started giving that IP temporary failures on connection. Trying the backup MX when the primary MX gives you an immediate 4xx is reasonable.

As far as I can tell, this isn't general SMTP probes against DNS names or IP addresses:

I did some more digging using our firewall PF logs and it appears pretty definite that some people showed up to do SMTP authentication probes only after this host appeared in DNS MX and got a TLS certificate. It's possible that the TLS certificate is the trigger for SMTP auth attempts, but they're very bad SMTP auth attempts (they aren't starting TLS, for a start, and this backup MX doesn't do SMTP auth).

Most of the sending machines that showed up were clearly bad, and many of them were rejected (or at the least sent things that got extremely high spam scores). Very few of them showed any signs of having tried to contact our primary MX. All of this matches what I think of as the expected behavior, where spammers and other bad actors hope that your backup MX is less well protected than your primary MX, so they prefer to talk to it if they can in the hopes that they'll get more bad stuff through.

(I've heard this story about backup MXes for a long time, but I never had a backup MX around to see this happen. It's nice, in a way, to have this story confirmed right in front of me.)

Some sending machines are more mysterious. For example, one outlook.com machine contacted the backup MX instead of the primary MX for no clear reason that I can see. These days, it's entirely possible that there was a transient network glitch on the path between that machine and us when it was trying to contact the primary MX, so it tried the secondary after its first connection glitched out.

Given all of this, if I was building a backup MX for full time use, it would be tempting to build a system where the MTA (mail server) was only enabled once the backup MX detected that the primary MX wasn't responding. Depending on taste, I could make the backup MX's MTA generate 4xx errors on connection or simply have it not be running at all so people got 'connection refused' if they tried. Checking once a minute or once every few minutes would be fine for our intended uses.

(In our planned one time use, we'll just enable and disable the backup MX's MTA by hand.)

Ignoring missing TLS "Client Authentication" usage in practice

By: cks

One of the slow moving pieces of TLS news is that Google is effectively requiring everyone to stop issuing TLS certificates that can officially be used for "Client Authentication" (although the actual wording may have walked this back a bit). Certificate Authorities can create new roots that can be used to issue TLS certificates that are officially usable for client authentication, but Let's Encrypt isn't currently planning to do this. This was announced last year and then slowed down a bit this year, but it's still happening.

As part of the TLS handshake, a TLS client can optionally present a TLS certificate of its own to the TLS server. Officially, this TLS certificate and its entire certificate chain must be marked as being authorized for this purpose in an 'extended key usage (EKU)', just as TLS certificates used to identify servers must be marked as being authorized for this purpose. When people like Let's Encrypt talk about 'removing TLS Client Certificate support', what they're talking about is no longer issuing TLS certificates with this client authorization EKU.

However, TLS certificates are TLS certificates, regardless of what EKUs they are or aren't marked with. As a result there's nothing that stops servers from validating a TLS certificate presented to them by a TLS client whether or not it has a client EKU. In particular, you can almost certainly simply collect the TLS certificate without validating it, then turn around and ask your TLS library to validate it as if it was a TLS server certificate (or in general, accept either a TLS client certificate or a TLS server certificate). I expect that more or less any TLS library will let you do this; Go certainly will.

Various protocols and systems that are used by various people want and require TLS clients to present TLS certificates that will be used to validate the TLS client, with these TLS certificates being public ones obtained from trusted Certificate Authorities. In theory these TLS certificates should have the 'client authentication' EKU set. In practice, that relies on people being able to obtain such TLS certificates without difficulty. If it becomes difficult or impossible to obtain (public) TLS certificates with the client authentication EKU, the easiest thing for everyone involved to do is to change their server code so that it (also) accepts TLS certificates with the readily available 'server' EKU set, which Let's Encrypt and lots of other people will issue to people.

(This is especially likely in FOSS projects, where the people running clients and servers don't particularly have any budget to go out and find someone who will sell them TLS client certificates.)

I'm pretty certain that I've seen news flying by about at least one project that was starting to accept TLS server certificates from TLS clients, although I can't find it now in some Internet searches (and I foolishly didn't save a reference to it). I expect more projects and systems to do it in the future. It's really the inevitable result of no one blinking on this. We'll know that the cookie has really crumbled when commercial service providers start accepting TLS server certificates from TLS clients for purposes such as authenticating inbound SMTP mail (assuming this hasn't already happened).

All of this illustrates a fundamental issue with TLS security in practice, which I can summarize as the traffic is going to flow. No TLS security measure that prevents desired traffic from happening will survive in the real world. And generally the solution that will survive and thrive is whatever is easiest (here, using public TLS server certificates instead of trying to set up your own CA and certificate issuance infrastructure that will work across multiple organizations). Is this a good outcome for public TLS in general? Probably not. But it is what it is.

Tiny Go and Rust programs appear to start equally fast (on some machines)

By: cks

A while back I said something on the Fediverse:

Do I care enough about a couple of millisecondsΒΉ to make a program I'm considering my first attempt at a Rust program, or do I do it in Go, where I'm confident I can write it without irritation?

ΒΉ This program will be quite short running, so the big difference I expect is in startup times. Go's runtime is (much) more heavyweight (and makes more system calls) than a basic Rust program's 'runtime'.

This is an example of what you'd call a superstition. I assumed that Go had a detectable runtime startup overhead, since Go has to initialize a bunch of things, including a garbage collection and its concurrency system (which involves some background goroutines), and Rust didn't. Eventually I found hyperfine (most basic Unix timing tools can't measure things in the microsecond range) and got around to actually timing things on the machine that I care about.

I've already spoiled the answer, which is that on the machine I care about in this case, any difference in startup time between a 'hello world' program in Rust and Go is down in the noise. Perhaps there is a ten or twenty microsecond difference in timing, but perhaps not and it's an artifact of scheduling, CPU caches, physical memory layout, and other random variations you experience in anything on a normal Unix system. A 'hello world' program written in pure C is typically faster than both the Rust and the Go programs by a visible amount of microseconds, but hyperfine also says it has a higher variation in timing.

(I also compared things to a Python hello world program, which as expected takes many times longer to run than the C, Rust, or Go programs. On this machine, the Python program runs in 13 milliseconds or so as compared to less than a millisecond for all the others.)

The machine I care about here is a FreeBSD machine. But I also use Go on Linux machines, so I pulled all of my test programs over to a pretty capable Linux machine and ran them there, and the results surprised me again. On several Linux systems, the Go hello world runs appreciably slower than the Rust hello world program (and the C hello world program remains faster). Typical hyperfine results say the Go program takes roughly twice as long as the Rust program, and it is indeed in the range of a millisecond or more of difference.

This gives me more to think about (and wonder about). I'm probably still going to stick to Go, but at least now I know that as of now (with the current state of Go and Rust), Rust does indeed seem to have appreciably less runtime startup overhead on Linux, but not on FreeBSD. If I'm trying to shave even a single millisecond off the runtime of something, I probably want Rust instead of Go.

(This assumes that the rest of the code will be equally fast in Rust and Go, which may or may not be true in practice. Without writing the same program in both good Go and good Rust, a real comparison is pretty hard.)

Also, on both FreeBSD and Linux, a statically linked C executable runs appreciably faster than a dynamically linked one. How much faster depends on the OS. Unsurprisingly, a statically linked Rust executable runs appreciably faster than the dynamically linked one that is the default 'rustc' result and that I was using above; on both Linux and FreeBSD, a statically linked Rust 'hello world' is faster than the dynamically linked C one (but not as fast as the statically linked C one). I generated the statically linked executable with 'rustc -O -C target-feature=+crt-static'.

The Go executables were all statically linked, since this is the Go default on both OSes and a simple 'hello world' program doesn't do anything that would force Go to dynamically link things.

(See also my Fediverse thread.)

Hiding the option to leave comments from some visitors to here

By: cks

In a comment on a recent entry, Verisimilitude noticed a feature that I quietly added to here not too long ago:

I've noticed the Add Comment button is now conditionally excluded; that's a neat trick.

I've long had precautions against comment spam and they've mostly worked. But not entirely, and so there have always been some network areas that I disallowed comments from even if they didn't run into those precautions. And if a (bad) network area was a sufficiently high source of automatically blocked comment spam attempts, I would add it to the list of blocked areas in case the software doing the comment spam got smart enough to get past my other precautions.

For a long time the only thing this blocked was direct access to the specific URLs used to write comments here (where the 'add comment' links point to). Then, recently, I realized that it made very little sense to give people and their software the link then block them when they used the link, and it would be better not to give them the link in the first place (as well as still blocking direct access). Among other things, I can hope that this stops software from crawling Wandering Thoughts to collect all the 'add comment' links that it will hit later through, for example, a proxy network.

Adding this feature was made easier because DWiki, the wiki software behind Wandering Thoughts, already had a permissions system for whether or not people could leave comments (and who could). As part of that permission system, DWiki had always done the obvious thing of not generating an 'Add Comment' link unless you had commenting permission. So all I had to do was extend the permissions check a little bit.

(The actual implementation has a collection of markers that can be set during processing of the request to influence what additional links are provided and not provided. For example, if you're a known robot, you don't see links to my syndication feeds because I don't allow known robots to request those. So I have a whole set of what is effectively middleware that scrutinizes the request and decides what should be allowed and not allowed, and then the final, low level dynamic page rendering looks at the result and includes or doesn't include various things.)

So if you visit Wandering Thoughts entries and they don't include an 'add comment' link, that's a sign that something about your request is making my anti-various-things precautions block comments (it might be your IP address or it might be something else).

The general idea strikes me as obvious in retrospect. If you're going to block direct use of something for some request source, you almost certainly want to not serve links to it either. And it's probably a better and less frustrating for any innocent bystanders caught up in a 'you can't comment' area. Previously it would have looked like they can comment, but any attempt would fail; now, they don't see the link at all so they can't get mysterious failures.

(Fortunately, DWiki always blocked all access to the 'add comment' link, even the initial one, so no one ever faced the really frustrating experience of writing a comment only to have posting it fail mysteriously.)

Does your DSL little language really need operator precedence?

By: cks

Every so often I create some sort of little language, of lesser or greater power, and when I do I have some heresies (like using recursive descent parsing). One of those heresies is that I usually leave out real operator precedence, other than support for '(' and ')'.

Operator precedence is nice and there are all sorts of cool algorithms for implementing it without tearing your hair out. But it's mostly nice for arithmetic expressions (or if you have a lot of operators), not for other things you may be using expressions for, such as matching incoming connections against some rules, and implementing real operator precedence will complicate your parser and little language. If you do this regularly and have the relevant algorithms memorized, or if you want an extra learning experience, go ahead and implement operator precedence anyway. Otherwise, well, are you sure you need it? I've been pretty happy with little languages that had little or no operator precedence, among other hacks to make them simpler.

(A certain amount of basic operator precedence can be implemented fairly simply in a recursive descent parser, although it can be increasingly tedious as you add more and more levels.)

The question of whether you need operator precedence is partly one of language design and partly one of how your little language is going to be used in practice. If you have multiple operators and people are going to intermix them, writing out some sample pieces of your little language may rapidly show you that you need operator precedence. Alternately, you may find yourself struggling to find a situation where it's natural to write an expression that requires operator precedence, or at least that requires sophisticated algorithms for it.

Another thing that makes operator precedence easier in little languages is not having very many operators (this especially the case in recursive descent parsers). The cool algorithms for operator precedence mostly come up if you want to have a lot of operators with a lot of precedence levels; if you're happy to just have a couple of operators, life is rather easier.

(Now that I've looked at parts of my past work, there's a little bit more operator precedence in some of it than I was expecting, although it's all done with basic recursive descent parsing.)

PS: Another thing that happens with operators in the kind of little languages that I wind up creating is that the operators are things like 'and', 'or', 'except', or set intersection and difference, where the precedence I should assign to them isn't particularly obvious. Once again, writing out sample expressions, rules, and so on can clarify how you're likely to want to use your thing in practice.

Our options in remote server installation and management

By: cks

For reasons outside of the scope of this entry, we have an increasing number of servers in an inconvenient location (I called it 'offsite' but that's not quite accurate). Since these servers run Ubuntu LTS, they're going to need to be reinstalled with new versions every so often, starting this summer (as 26.04 comes out), and we really don't want to do that in person, so we've been thinking about our options.

The best option would be to only have servers with full support for remote management through their BMC. By full support, I mean full 'KVM over IP' support for remote access as well as remote media, so that we could continue to use our regular install processes, exactly as if we were physically present to plug in a bootable USB stick and so on. Unfortunately lots of our servers either don't have a BMC at all or have a BMC with restricted and limited features (because we haven't bought an expensive license, for example).

The cheapest solution that is the most work to add and implement is network booting (and this only handles our reinstall issue). Assuming that all of our servers can reliably boot from the network (which isn't a sure thing), I believe that we could set up an environment that booted into our install setup, then had us SSH in to the Ubuntu server installer, where we could handle things more or less as if we were booting and installing locally. In a UEFI environment, in theory we can reliably switch boot entries around to do things like a one-shot network boot. Another nearby group does this for all of their servers (running Debian), so it's definitely possible to handle things this way. This option is arguably the technically correct way to handle installs and reinstalls, but will take much more staff time to set up.

The most general and easy to implement solution is a modern external 'KVM over IP' system. These plug into the video and USB ports of your server or servers, then let you access everything over IP, usually through an embedded web server. Since they use USB for their keyboard and mouse, it's easy for good modern ones to also support 'remote media', presenting either a USB disk or a USB DVD drive to the host. Often you can either upload your image (or images) to the internal storage on the KVM over IP system or stream it from somewhere else. For various reasons this isn't as good as a fully capable BMC, but you can get rack-focused multi-server systems that will let you connect one head unit to 8, 16, or more servers, as well as smaller scale single system units (including FOSS-based ones that can do useful tricks like run WireGuard).

(Most of the small scale systems only support HDMI. The rack systems often support VGA and DVI as well, which is good, because lots of our servers are still VGA. You can get converter dongles but it's better to have fewer pieces of hardware.)

For our purposes we'd like a KVM over IP system that supports more than one server at once but we don't need too many. Our major use is reinstalls and other troubleshooting, and we don't too many of those at once, so it's okay if we have to walk over to the datacenter location to shuffle which eight servers in a rack the KVM over IP system is currently connected to. However, for reasons outside of the scope of this entry it would be very useful if the KVM over IP system was rack-mountable (in a switch style two-post fashion would be fine).

The external KVM over IP option is more expensive in direct up front costs than building a network booting environment, but it requires a lot less staff time to build and maintain. On the other hand, at a university staff time is often considered a sunk cost that's ignored, so we may wind up with network booting even though we could probably get an 8-server KVM over IP solution for a cost that works out to only a few days time of a single system administrator.

(We could get a basic, single server, non-rack-mount, HDMI only KVM over IP box for trivial amounts of money, but that doesn't work as well in this situation.)

(A bit of me would enjoy the challenge of designing and building a network boot install environment, but the rest of me is looking at the amount of regular work we have and the fact that this is a repeatedly solved problem and feeling unenthused. Someone out there may even have written up how to go from having an Ubuntu server installer image to network-booting it on some machine, but good luck finding that writeup on today's Internet.)

Universities, email, and the issues of running things in house

By: cks

One of things that has happened over the past N years, for some value of N, is that a lot of universities have outsourced their email to one of the big providers of this (Microsoft and Google are the two most common). Email is far from the only thing that universities have outsourced; for example, most universities don't run their entire authentication stack, because push based MFA is typically only available through vendors. People online regularly decry this outsourcing, especially for email, and say that universities should bring all of these things back in house the way they used to be, especially in today's time of increasing geopolitical tensions. Another thing that people say is that universities shouldn't depend on clouds as much as they sometime do, especially the major clouds.

I don't disagree with these people, but at the same time I'm a realist about what's being asking for and what is involved. Universities did not wake up one day and decide to give large amounts of money to vendors for dubious reasons. Universities (such as mine) carefully studied what it would cost to continue their existing in house systems (in both hardware and staffing) and what they would get from it, and compared that to what it would cost and what they would get from the big vendors. Then they decided that one of the big vendors was a better option, and this decision is not wrong. As I wrote a long time ago, the big providers are simply better at this than a university can be, and it's not just email, it's also things like web CMSes and identity providers.

If you're asking universities to move things back in house, what you're asking them to do is to spend more money and have a worse experience, with fewer features and more annoyances. There are arguments for doing this anyway, on on things like principle and risk mitigation, but you have to explicitly and honestly make this case. Potentially you have to make this case to people well beyond the universities themselves; if it's going to cost a significant amount of money to bring these things back in house, someone is going to have to pay for the staff and the computer hardware and so on.

(To put it bluntly, if the appropriate level of government isn't on board with this shift, it's almost certainly not a sustainable university policy even if the people in the university are on board with it.)

Since you're making this case in a university, it's not enough to convince the senior leaders. You really need to convince a large body of professors to at least go along with it, and ideally to actively advocate for it. If professors are not on board with this, if the overall culture of the university doesn't shift to considering it important to use in house things, what will happen is that people will quietly outsource things all over again, individually or in small groups. People will forward their email to personal addresses at big providers, professors will spend grant funds on clouds, administrative staff will quietly ignore the FOSS office software you'd like them to use and use the big name stuff that they're already familiar with and everyone else uses, people will conduct their discussions on the SaaS discussion forum you like least, and so on.

I'm convinced that such a shift to bring things back in house can be done (partly because some governments are doing it for government departments). But it's not going to be easy or fast. If you're serious about this, you have a lot of work and a long slog ahead of you to organize, advocate relentlessly, slowly bring people on to your side, and build consensus.

Having an inventory of anything is a non-trivial thing

By: cks

Over on the Fediverse I indulged in some snark:

Network inventory hot and grumpy take: Yep, it's not great that sysadmins and network people don't necessarily have a hardware and network inventory, unlike modern software development where famously everyone knows exactly what their entire dependency tree is and why it's there and has full trust in it staying that way.

(That is sarcasm.)

Let's get this out of the way right at the start: inventories are hard. I don't just mean network inventories or machine inventories or software inventories or dependency inventories. I mean any and all inventories, everywhere. For example, some real businesses periodically take a day or two off from doing business in order to check and reconcile their inventory with actual physical reality. It's ordinary to have a business's website say they have something in stock at a location, but when you go to the location, the people there can only shrug and tell you they have no idea where the theoretically in-stock item is, if it even exists.

(I can also assure you that an inventory of other physical items, even very important ones like keys, can become completely hopeless. One reason lots of people like reprogrammable electronic locks is that you can make your inventory be the authoritative state of the world. Of course, this will also lead to you discovering ways in which your inventory did not reflect reality, as people turn up who should have access but aren't in your lock inventory.)

One reason that all inventories are hard is that they're an attempt to keep two (or more) things in sync with each other, those being the inventory itself and the physical or software reality. Not coincidentally, in our field the most accurate inventories tend to be the ones that are built on self-reporting. Unfortunately there is only so much information that can be accurately self-reported. For example, a machine intrinsically knows that it exists and has certain hardware and software states, but it doesn't intrinsically know why it exists. If you try to make a machine 'self report' why it exists, this is generally going to be the machine echoing back to you something that you told it earlier.

This also relies on being able to get a self report from machines or whatever else is of interest. A machine or a piece of software or whatever that doesn't generate a self report is mostly invisible. Generally self reporting is something that has to be added to machines, software, and other things of interest, and if this isn't complete, that creates gaps in a self reported inventory. You can fill these gaps in the inventory by hand, but then you're trying to keep two things in sync with each other.

The less you can trust self reporting, the harder inventories get. We see this in the perpetual struggle of default deny firewalls, which can be seen as an inventory of allowed network traffic except that we can't allow things to self-report that they should be allowed. This creates a burden of inventory maintenance in the form of firewall rule updates (which is often made more annoying by organizational structure, where you can't update the 'inventory' yourself but have to wait for other people to do it before you can do things).

Ultimately, maintaining an inventory takes work. If you want that work to happen, you must budget time for that work and you must make that work rewarded. If your organization's structure of rewards and demerits makes it clear that maintaining an inventory is not as important as other things, well, you will get what you'd expect.

(Locally, we do budget time to maintain several sorts of inventories, but at the same time many of them are imperfect. Partly there is a trade off between the amount of time spent maintaining inventories and their accuracy, and partly people make mistakes, which is another reason why things self reporting themselves is better if you can manage it.)

Vim and 'forward delete' (in modern terminal programs)

By: cks

On the Fediverse, I had a learning experience:

Another what the heck moment in Fedora 43. In Gnome-terminal (only, not xterm), hitting 'Delete' in vim insert mode no longer deletes characters to the left of the cursor, only characters to the right. Delete is generating ^? in both gnome-terminal and xterm, and Delete works to delete characters in vim in g-t on the ':' command prompt.

Whatever vim / gnome-terminal combined stupidity this is, I want it gone. Now.

Actually I got the name of the key wrong. The key I was hitting is BackSpace (in X keysym terminology). I thought of it as Delete because that's what it generates, but the real Delete key is another key, and this turns out to be relevant. What I described happening is what I think is normally called as 'forward delete', as opposed to 'backward delete', the normal BackSpace behavior (and what I want).

I will skip ahead to the fix: for historical reasons, I had Gnome-terminal set to generate ASCII DEL for both the BackSpace and the Delete keys (it's in a per-profile 'Compatibility' tab). The modern proper setting for Delete is 'escape code'. Setting that cured the problem, but how we got here is the interesting bit.

Unix has long had both a stty setting for 'what is your backspace character' and also a termcap/terminfo parameter for 'what does the backspace key generate' (in terminfo, this is 'kbs'). Things such as readline, vim and GNU Emacs can use those to determine the character sequence they should recognize as backspace, or they can ignore both settings and use hard-coded values. This has historically been important because there used to be a great split of what this key generated on serial terminals, and this split propagated into X and people's X configurations.

(Interestingly, terminfo databases aren't consistent across systems about what BackSpace is expected to generate in xterm. Linux and OpenBSD terminfo appears to expect it to generate ^?, but FreeBSD expects ^H. You can check with 'infocmp | fgrep kbs'. Vim appears to take its setting from your backspace character from stty, not terminfo, which is the correct approach.)

Unix has never had a stty setting for 'forward delete', but it did soon get a terminfo and termcap distinction between the 'backspace' and 'delete' keys (well, between what character sequences each sends). In terminfo, what the delete key sends is 'kdch1', and I don't know when it appeared; in termcap, it is 'kd', which appeared no latter than 4.3 BSD in 1985, per the 4.3 BSD termcap(5) manual page. If your program wants to support forward delete at all (which vim does) and you can't see the physical keys being hit, you have to use 'kdch1'. Well, sort of. You're theoretically supposed to use kdch1, but in practice kdch1 doesn't necessarily correspond to the reality of your terminal program and its current settings.

(Termcap was invented first, in BSD Unix. Terminfo came later and apparently was first generally available in System V Release 2, in 1984. It's possible that terminfo had 'kdch1' from the start, since I believe that by 1984 there were Unix machines with 'full' keyboards with both BackSpace and Delete. Plus the DEC VT-100 serial terminal also had both Backspace and Delete keys, and it was introduced in 1978.)

Vim handles keyboard keys through an internal notion of terminal capabilities and key sequences, and for forward delete the internal capability is called t_kD. Vim can get its t_kD value from a number of places; it can get the value from the regular terminfo kdch1 value, it can derive it from your regular backspace value (as is done by :fixdel), or it can ask your terminal program what the various physical keys generate (this is controlled by the xtermcodes option). When vim does the last, it will get whatever your terminal program is reporting about its current settings, not whatever official settings (for 'kdch1' and other things) are published in the system's terminfo database. This is useful when these settings are controlled through preferences, instead of being fixed values.

(Specifically, vim and other programs will use xterm's XTGETTCAP request to read out various live terminfo settings. This is why it matters that there is a terminfo thing for the delete key as separate from the backspace key.)

Vim's xterm-codes behavior officially happens on 'xterm patchlevel 141 or higher'. In practice a lot of other terminal emulators imitate xterm here, and in particular recent enough versions of gnome-terminal do. If you're curious to see what your terminal program is reporting for itself, you can start vim and use ':echo v:termresponse' (or you can run 'tput RV; cat >/dev/null' at your shell command line, then Ctrl-C it once whatever has echoed). Currently xterm reports its patchlevel and gnome-terminal reports some sort of VTE library version, which on modern versions of Gnome-terminal is (much) larger than '141' (mine reports '8203'), and triggers vim's behavior of asking the terminal program for its actual keyboard mapping.

(Technically 'RV' is send device attributes, not XTVERSION. You can find programs that query XTVERSION specifically, such as xtver.)

When Gnome-terminal reported its keyboard mapping to vim, it apparently reported that both my BackSpace and Delete keys generate ASCII DEL, which was true (at the time). When vim received this report, it set both t_kb and t_kD to ASCII DEL (this is visible with eg ':set t_kb'), and then apparently vim decides that the <Del> version should take priority over the <BS> version so that when I hit a key that generates ASCII DEL, vim will do forward delete instead of backward delete.

(Actual xterm apparently reports something different to vim. Although both BackSpace and Delete generate ASCII DEL in my xterm setup, vim reports that t_kD is '^[[3;*~', the normal escape sequence for it that gnome-terminal will also generate when it's set that way. This means I have no access to forward delete in vim in xterm, but that's okay with me; I basically never use it.)

Gnome-terminal support for xterm's XTGETTCAP stuff was apparently added in mid 2025 through VTE in this feature request and this commit. Fedora 42 shipped with a version of gnome-terminal and VTE before this change, and the Fedora 43 versions are afterward, so now vim can actually find out what the current key mappings are and trigger this behavior.

There are at least two ways to fix this through vim in your .vimrc, and also two ways that don't work. In working ways, if you unset t_RV ('set t_RV='), vim never makes the version query and never goes on to ask for keyboard stuff. If you disable xtermcodes ('set noxtermcodes'), vim will also never ask for the keyboard mapping, but now it knows the nominal xterm version number (which isn't necessarily very useful, given that different projects report wildly different numbers).

The two ways that don't work in your .vimrc are unsetting t_kD and using ':fixdel', although both of them work after vim has finished starting. I assume that both of them don't work because their effects are being overridden when vim asks the terminal program for its key bindings. There may be vim magic that you can use to get around this, but it's better to change your (my) historical gnome-terminal profile settings so that all profiles have their 'Delete key generates' setting in the Compatibility tab set to 'Escape sequence'.

(There is probably some way to set this preference for all profiles through the command line but I just did it by hand through the GUI.)

PS: My view is that vim is the party with the bug. Given the behavior of :fixdel, I think that if vim detects that t_kb and t_kD are both set to ASCII DEL in the terminal response, it should either unset t_kD or make it Ctrl-H.

PPS: Since I checked, urxvt (aka rxvt-unicode) does respond to the xterm RV sequence but reports a version number of '95' as of version 9.31, which is far too low to trigger vim asking for termcap stuff (and I don't know if urxvt supports that). The Fedora 43 konsole reports a version number of 115 (for konsole 25.12.3), and I will leave it up to interested parties to investigate what other terminal programs report.

Apache 2.4, ETag values, and (HTTP) response compression

By: cks

One of the things that Apache and other web servers have been able to do for a long time is to compress responses when the requesting agent indicates that it supports this. Accepting compressed responses is so common that not doing so is potentially an bad sign, although a distressing number of syndication feed fetchers don't request (or accept) compressed responses. Apache is sophisticated enough that it can compress output on the fly and do it for unpredictable sources of dynamic content, such as CGIs and Django web applications (and requests it acts as a reverse proxy for, as far as I know).

Another thing that the web has is the ETag header. An ETag header is supposed to be a unique identifier for a specific version of a 'resource', ie a URL. The place I normally think of ETags being used is in conditional GETs, but it also has a lesser appreciated (by me) role in HTTP caching, and as I understand it, that creates a little problem.

An opportunistic cache is allowed to use the same ETag and If-None-Match headers for cache validation. When an ETag value is only used by the origin server for conditional GET, we generally would prefer that the ETag value not vary based on the compression. However, when an intermediate cache uses an ETag for validation, it's apparently more convenient if the ETag is specific to the compression. As a result, RFC 9110's specification for ETag specifically requires that the ETag vary based on the response compression, not just its contents.

In Apache 2.2, Apache ignored this requirement (at least by default). Especially, it ignored this requirement if you provided the ETag in dynamically generated content, such as CGI output. Apache 2.2 would give your ETag to everyone regardless of the compression it did, and then everyone would make the same If-None-Match query to you and you'd be happy because the ETag you (re)generated was matching their If-None-Match and so lots of people were making, for example, conditional syndication feed fetches.

In Apache 2.4, Apache apparently decided that this was no good and it needed to do better. In ETag values you provide (and at least sometimes ETag values it generates), Apache 2.4 sticks on a suffix, such as '-gzip', to make them unique to the Apache-chosen compression. People who receive these altered ETag values then dutifully copy them into their If-None-Match header, which Apache 2.4 passes back unaltered to your CGI or other web application, and then if you're unaware of this you will compare their modified value to your unmodified value and conclude that almost no one is making valid conditional requests any more (for some reason, starting when you upgraded to Apache 2.4).

This behavior is actually something you can change if you want to, through the mod_deflate directive DeflateAlterEtag and the mod_brotli directive BrotliAlterEtag. But it's more correct and probably better in the long run to adjust your CGI or web application code to deal with these altered If-None-Match values, although it would be nice if Apache did it for you somehow. Since I looked it up in relatively current Apache 2.4 source code, the two ETag suffixes you're likely to see in the wild are '-gzip' and '-br'.

I'm now using nftables for (new) static rulesets

By: cks

Over on the Fediverse, I said:

I feel I've now written enough Linux nftables configurations that I've come to like it. It's a more pf-like experience than iptables, that's for sure (and that's a good thing when you're writing a coherent ruleset instead of manipulating things on the fly).

I've had to write a few static IP filtering rulesets recently (on Ubuntu), and in each case I immediately reached for nftables and enjoyed the experience. The nftables documentation isn't what I consider great but I can navigate through it and get things done, and I even managed to get NAT working on a recent machine. I'm now mostly considering my iptables knowledge to be a legacy thing that I'll expect to use less and less in the future, although I'm not going to go out and convert iptables rulesets to nftables rulesets.

(Partly this is a conservation of attention thing. Both iptables and nftables have a lot of dim corners that I have to remember if I'm doing anything complicated with them, and I only have so much of a brain.)

One of the ways that nftables is nicer for me is that the natural way to write a nftables ruleset is to edit /etc/nftables.conf (or some other file). This lets you (me) see all of your rules in one place, think about all of them before you try to use them, revise them, and so on. You can even pre-write a nftables.conf elsewhere (in your home directory or whatever), and it's natural to put comments in. Nftables also has an acceptably PF-like concept of symbolic variables and 'anonymous sets' that can be used to write compact rules in straightforward cases in your nftables.conf; as far as I know there's no equivalent of this in iptables.

(In iptables you can use actual sets that you define and populate, or you can write shell scripts with 'for' loops and so on, but neither of these are entirely fun and as far as I know there's no great way to populate sets from nicely formatted files.)

However, this is only for whole, static rulesets. As I expected before, iptables is going to stay what I use if I need to add and remove rules on the fly, for example to block access to a service on startup or to add and remove some rules as network interfaces come and go. I know that you can do on the fly rule changes with nftables (and many of the nft examples in the manual page are of on the fly changes), but this is an area of nftables that I haven't explored and don't really want to. Unless I need to flip back and forth between two (or more) entire sets of rules, I'm going to keep using iptables for on the fly stuff.

(If I'm moving between several rulesets, 'nft -f /etc/some-file' is the easy way to flush and reload a coherent set of rules all at once, and I can write each ruleset as a coherent thing all in one place with helpful comments and so on.)

This is also only for new rulesets. Even with my new fondness for nftables, I'm not likely to rewrite existing, stable collections of iptables rules into nftables rules even if they can be expressed as a static collection of things. The one case where I can imagine doing a conversion is if I need to change existing iptables rules around substantially and rewriting them as nftables rules is easier than recovering iptables stuff that I may have forgotten by then.

PS: In fact the /etc/nftables.conf experience is sufficiently like the BSD pf experience that it fooled my mind recently. When I was working on the rules for the system with NAT, I kept adding filtering rules for the host to the 'forward' chain and then being confused when they didn't work. BSD pf doesn't have an input versus forward distinction, so my mind drifting into 'I'll just put the host rules here along with the forwarding rules'.

Some quick notes to myself on nftables 'symbolic variables'

By: cks

Nftables is the current generation Linux firewall rule system, supplanting iptables (which supplanted ipchains). As covered in the nft manual page, nftables has the concept of 'symbolic variables'. Since I'm used to BSD PF, I will crudely describe these as a combination of some parts of pf tables and PF macros. I personally feel that the nft manual page doesn't do a good job of documenting what's possible in these, so here are some notes.

The simple case is simple values:

define tundev = "tun0";
define outdev = "eno1";
define natip = 128.100.x.y
define tunnet = 172.29.0.0/16

(It turns out that the ';' here is decorative and I put it in out of superstition, judging from actually reading the "Lexical Conventions" section.)

I'm not sure of the rules of when you have to quote things and when you don't. As covered in the manual page, you use these symbolic values in the relevant nftables bits, for example a SNAT rule:

ip saddr $tunnet oifname $outdev counter snat to $natip;

Nftables also has the concept of 'anonymous sets', which are written in the obvious PF-like syntax of '{ ..., ..., ... }'. You can use symbolic variables to define anonymous sets, and if you do they can span multiple lines and have embedded comments, and of course you can have multiple elements on one line (not shown in this example):

define allowed_udp_ports = {
        # DNS
        53,
        # NTP
        123,
        # for HTTP/3 aka QUIC
        443
}

(I suspect that symbolic values written directly in nftables rules can also span multiple lines and have embedded comments, but I haven't checked.)

A comma on the last entry is optional. Unlike in BSD PF, elements must be separated by commas.

You can use this to define port numbers, IP address ranges, and no doubt other things. However, I don't know how efficient it is if you're defining large numbers of things, and of course you can't update your defined things without reloading your entire ruleset. If you need either of features, you're going to have to figure out named nftables sets or maps.

There's no direct equivalent of the BSD PF syntax for defining a table from a file with eg 'table <SSH_IN> persist file "/etc/pf/SSH-ALLOWED"'. The closest you can come is to define an anonymous set in a file you 'include' in your nftables rules.

(I believe this is also the best you can do for loading named sets and maps from files.)

PS: Apparently there are also anonymous maps, to go with named ones.

Sidebar: Named sets in nftables

Since I just worked this out, well, found an example, here is how you write a set in your nftables.conf:

table inet filter {
    set allowed_tcp_ports {
       typeof tcp dport
       elements = { 22, 25, 80, 443 }
    }

    chain input {
       [...]
       meta iifname $outdev tcp dport @allowed_tcp_ports counter accept;
[...]

Now that I understand the use of 'typeof', I'll probably use it for all sets and maps rather than trying to look up the specific type involved (although nft can help with that with 'nft describe').

Systemd v258's 'systemctl -v restart' and its limitations

By: cks

If you've done much work with systemd services, you've probably gotten entirely used to the traditional dance of 'systemctl restart something; journalctl -f -u something' so you can see the shutdown and restart log messages of what you just theoretically restarted, assuming it's happy with life. In systemd v258, systemctl gained a new feature to help with this, systemctl -v. The help describes it reasonably well:

Display unit log output while executing unit operations.

(This means any unit operation; you can use it with 'systemctl stop', 'systemctl start', and 'systemctl reload' too.)

All of this is nice and I'm certainly going to enjoy using this feature on our future Ubuntu 26.04 machines and on my Fedora machines. However, it has an obvious limitation for 'restart', 'start', and 'reload' that in many cases is going to have me still using the the journalctl stuff as well.

That limitation is right there in the description: 'while executing unit operations'. If you do 'systemctl -v restart something', systemctl stops following your service's log output the moment it considers your service to have started. In some services, this will be when the service has genuinely started and reported this to systemd, for example for a Type=notify service. In many others, for example 'Type=exec' services where you directly run some binary and it sits there doing things, systemd will consider the service started the moment your binary is running. Since systemd considers the service started, it will stop following the logs in 'systemctl -v restart'.

This is often not sufficient. Many services have a certain amount of post-exec work to do before they've genuinely started, such as loading configuration files, opening databases, initializing internal services, and so on. Some services can error out at this point, so that (as systemd sees it) they were successfully started but then immediately failed. Sometimes, the service itself intrinsically is only 'up' after it has talked to the outside world and established something, such as a DSL PPPoE link.

All of this isn't systemd's fault, but it means that 'systemctl -v restart' may only tell you the very early part of the story. And that's why for a lot of services I need to keep doing the 'journalctl' part too.

Users and session classes in Systemd v258 and later (and a gotcha)

By: cks

So I upgraded my home desktop from Fedora 42 to Fedora 43 and sound stopped working. Having your audio stop working is practically a rite of passage for Linux people, so I've been through the drill, but things rapidly turned weird when trying to restart sound daemons through 'systemctl --user restart ...' failed with systemd errors about not being able to contact the (systemd) user service manager.

Let me skip ahead and show you the culprit:

systemd-logind[2524]: New session '1' of user 'cks' with class 'user-light' and type 'tty'.

Establishing your user service manager when you log in is one of the jobs of pam_systemd. One of the things pam_systemd decides about your session is its class. In System v258 and later, one of the possible classes is 'user-light', for which systemd notes:

Similar to user, but sessions of this class will not pull in the user@.service(5) of the user, and thus possibly have no service manager of the user running.

(Emphasis mine.)

This 'possibly' is understated. What it means in practice is that a 'user-light' class session won't have a systemd user service manager running unless something else started it for you, for example another session that wasn't a 'user-light' one (because you only ever have one user service manager; it normally starts with your first session and exits after your last one). In turn, anything that runs as a systemd user service won't start and can't be started indirectly through, for example, systemd socket activation. And in modern Fedora, all of the sound infrastructure is handled as systemd user services (as is your user D-Bus session).

So how did we get here? Well, as the rest of the section notes:

If no session class is specified via either the PAM module option or via the $XDG_SESSION_CLASS environment variable, the class is automatically chosen, depending on various session parameters, such as the session type (if known), whether the session has a TTY or X11 display, and the user disposition.

(The 'user disposition' comes from systemd-userdbd and its JSON User Records. For normal /etc/passwd accounts, the user disposition is determined from their UID.)

The actual process pam_systemd follows is somewhat arcane. To simplify, all SSH logins are 'class=user', root is always 'user-early', and system users on the console are 'user-light'. So if you log in on the console (as I do, also) and you're considered a 'system user', you don't get a user service manager started automatically (and then things break).

Systemd is more or less hard coded to consider all UIDs up to SYS_UID_MAX in /etc/login.defs to be 'system users' (cf). On many machines, this will be all UIDs up to 999, and this number has been drifting upward over time. At various times in the past the first non-system UID and GID has been 200, and then later it was 500, so if you have logins created this long ago, systemd now considers them system users who get special handling. I have been using my Fedora desktops for a very long time, so even without even weirder things I would have fallen victim to this.

(Even on our servers, my UID is 915 and we have a significant number of people with UIDs under 1000. If pam_systemd ever stops forcing all SSH logins into class 'user', we're going to have a whole collection of problems. On my desktops, my 'natural' UID would be either 200 or 500, based on the GIDs that were created to go with it on my home and work desktops.)

Unfortunately there's no way to set a single account parameter in systemd-userdbd, so there's no way to keep using /etc/passwd but tag your historical, low-UID account to be a regular account. There's also no direct way to manipulate pam_systemd's hard coded class (re)mapping process; your only option is to completely override all class assignments with a 'class=' option on pam_systemd. This is made extra difficult on Fedora because (of course) pam_systemd is invoked in a number of generic PAM stacks such as 'system-auth', and you may not want to force all uses of pam_systemd through them to force a 'user' class for all accounts in all situations.

It's possible to work around this with sufficiently complex PAM conditionals (also). Or I could make /etc/pam.d/login use a different version of system-auth that's customized for it, although that would force root logins into class 'user' instead of 'user-early' unless I engaged in other PAM hacks.

PS: Given how much breaks without a user service manager, it feels like either pam_systemd or the 'login' PAM stack should specifically make it so that everyone who logs in on a console tty has one, with all system UIDs being class 'user-early', not just root.

PPS: I won't be working around this by changing my local UID, however peculiar it is. Partly this is because I can't fix it by adopting the same UID as we have on our servers, which would let me usefully NFS mount my home directory from our fileservers on my work desktop; as mentioned, that UID is also under the current Fedora SYS_UID_MAX.

(You can't truly fix NFS UID mapping issues with NFS v4 without Kerberos.)

Sidebar: Why my work machine didn't experience this

One reason I was willing to impulsively upgrade my home desktop last night was that the upgrade to Fedora 43 had gone fine on my work desktop, and it certainly had no sound problems afterward. My console login on my work machine was still a 'user-light' session, but the reason it had a systemd user service manager was that one had been created earlier and was sticking around. To cut a long story short, on my work desktop I was set up as a loginctl 'linger' account (/var/lib/systemd/linger says this happened May 21st 2021). Such a 'linger' account creates a session at system boot, which creates a user service manager as the result, and that session and user service manager remains until system shutdown.

Regardless of how many times you log in, you only ever have one systemd user service manager. So once a user service manager is created for any reason, including the user service manager that's started at boot for a 'linger' account, your console 'user-light' logins will still get access to that user service manager, Pipewire and other things will start normally, sound will work, and you (I) won't notice anything different.

In theory I could work around this today by setting myself up as a 'loginctl linger' account on my home machine too, and skip any PAM changes. In practice, I'm reluctant to assume that pam_systemd will always create systemd user service managers for system UIDs that are set 'linger'. It strikes me as rather the kind of thing that might get optimized some day, much as 'user-light' was optimized into systemd v258 (cf, also, also).

Wondering about the typical retry times for email today

By: cks

Over on the Fediverse, I had a question:

To the sysadmin population of the Fediverse: do people have any numbers on how long common mail senders will retry sending mail if your MX is unreachable? Once upon a time people retried for many days, but my impression is that quite a few places now stop trying and bounce the email after quite short intervals, like a day.

(Boosts and practical experiences welcome, like "my MX was down for three days and I still got all that email sent from GMail".)

The context for my sudden curiosity is that there's a scheduled all-weekend, whole building power outage at the start of May for the building with our machine room. It seems likely that basically all of our systems will be down for roughly two and a half days, and longer if things go wrong, and this obviously includes our incoming email gateway.

As I mentioned, in the old days you could definitely expect mail systems to retry for more than a long weekend and so we wouldn't really worry about it. But I'm not sure about that in practice any more, hence my sudden curiosity. Based on replies to that post (and some additional research), common Unix MTA software still seems reasonably okay on retry durations; postfix and sendmail default to five days, while exim more or less defaults to four. RFC 5321 recommends four to five days in section 4.5.4.1, for what an RFC on SMTP mail is worth these days.

Unfortunately, what matters in practice is how the dominant sources of your email behave, and generally those aren't going to be people running normal Unix mailers in normal configurations. A lot of our email comes from GMail and Office 365, who are obviously using custom mail systems. Office365 covers email from both people at other universities or organizations that use Office 365 and people using the university's central email system to send email to people in my department. It's also possible that configurations vary between organizations using Office365.

There are also all of the people and organizations sending out newsletters, notifications, and so on through the various mailing list service providers, like Amazon SES. These organizations may well have shorter retry times than for individual human-generated email from GMail, Office365, and so on. Another category of email is activity notification emails from places like Github, which people may also want to (eventually) get.

(We have access to an alternate location with a different network and power setup, so we could deploy a backup MX machine to there. There are some potential drawbacks to that, but we may do it as a precaution.)

Finding out what your big RPMs are, in two different 'sizes'

By: cks

Suppose, not hypothetically, that you have an old Fedora system with a lot of packages installed and a 70 GByte root filesystem, which is now awkwardly small during system upgrades and so on. You would like to find out which of your roughly 7,500 packages are contributing the most to your space usage.

(The real solution is to move to a bigger pair of NVMe drives, but that involves various yak shaving and you want to upgrade to Fedora 43 today.)

The simple version of 'how big are your RPMs' is to ask rpm for the ordinary (binary) size of all of your installed binary RPMs:

rpm -qa --qf '%{SIZE:humaniec} %{N}-%{V}-%{R}.%{ARCH}\n' | sort -hr

This will tell you interesting things, like how the Fedora 43 version of wine-core-11.0-2.fc43.x86_64 is 1.3 GBytes all by itself. However, it's not necessarily the full answer for what is using up your disk space, because a single (source) package can create many binary packages (often these mostly get installed together and it's hard to split them apart in any useful way). For instance, on my work machine with the 70 GByte root partition, there are 263 'texlive' packages and 101 'perl' packages (and 66 'qemu' packages).

Often a more useful way to break down packages is by the total installed size for a particular source package. This is where I turn to my 'sumup' script, and also to 'numfmt', to get the following:

rpm -qa --qf '%{SIZE} %{SOURCERPM} %{N}-%{V}-%{R}.%{ARCH}\n' |
  sumup 2 1 | numfmt --format '%8.1f' --to iec 

This may reveal surprises that you didn't know. For example, my home desktop has 847 MBytes of packages derived from 'rocm-compilersupport', despite my home machine having no AMD GPU (it uses the integrated Intel GPU). These appear to be present as dependencies of Blender (based on what 'dnf remove' told me it wanted to do).

(It can also tell you that lots of binary packages derived from a single source package don't necessarily result in a lot of disk space being consumed. All of those 263 texlive packages amount to 289 Mbytes, and those 101 Perl packages, 43 Mbytes.)

I preserved the binary name, version, release, and architecture in the second command, even though it's not used, so that I can later copy and paste the 'rpm' command snippet to grep its output to find out all of the binary packages derived from a source package of interest. A smart approach to this would be to split this up into two commands:

rpm -qa --qf '%{SIZE} %{SOURCERPM} %{N}-%{V}-%{R}.%{ARCH}\n' >/tmp/foo
sumup 2 1 </tmp/foo | numfmt --format '%8.1f' --to iec

Putting the initial output in a file is useful because 'rpm -qa --qf ...' is not necessarily the fastest thing in the world, at least if you're asking it for the 'size' of RPMs. With the initial output saved in a file, I can just grep the file, which is going to be very fast.

PS: If your install of Fedora has been around for a while, this may also reveal various obsolete packages. I have llvm-libs packages that seem to go all the way back to Fedora 32. I probably don't need those any more, or at least I hope I don't. But cleaning up old RPMs from past Fedora releases is its own subject and doesn't at all fit in the margins of this entry.

Two little scripts: addup and sumup

By: cks

(Once again it's been a while since the last little script.)

Every so often I find myself in a situation where I have a bunch of lines with multiple columns and I want to either add up all of the numbers in one column (for example, to get total transfer volume from Apache log files) or add up all of the numbers in one column grouped by the value of a second column. This leads to two scripts, which I call 'addup' and 'sumup'.

Addup is a simple awk script that adds up all the values from some column:

#!/bin/sh
# add up column N
awk '{sum += $('$1') } END {print sum}'

(Looking at this now, I should use printf and specify a format to avoid scientific notation. A more sophisticated version would do things like allow you to set the column separator character(s) rather than just using the awk default of whitespace, but so far I haven't needed anything more.)

My version of sumup is more complicated than I've described, partly it either counts up how many times each value happened for a particular field or it computes a sum of another field for the particular field. This sounds abstract, so let me make it more concrete. Suppose that you have a file of lines that look like:

300 thing1
800 thing2
900 thing1
100 thing3
[...]

Sumup can either tell you how many times each of the second field occurs, or sum up the value of the first field for each of the values of the second field (giving you 1200 for thing1, 800 for thing2, and 100 for thing3 in this simple case).

The actual sumup that I currently use is a Python program, partly so that I can conveniently print output sorted by the breakdown field. However, my older awk-based version is:

#!/bin/sh
# sum up field $1 by field $2
# if no $2 is provided, it just counts by one.
(
if [ -n "$2" ]; then 
        awk '{sums[$'$1'] += $'$2'} END {for (i in sums) print sums[i], i}'
else
        awk '{sums[$'$1'] += 1} END {for (i in sums) print sums[i], i}'
fi
) | sort -nr

My memory is that this version works fine, although it's been a while since I used it.

If there are relatively widely available Unix utilities that will do these jobs, I'm not aware of them, although I wouldn't be surprised if they've emerged by now.

PS: Looking at the sort of things I do with these tools, I should also write an 'avgup', although that strays into the lands of statistical analysis where I may also want things like the median.

Web server ratelimits are a precaution to let me stop worrying

By: cks

These days, Wandering Thoughts has some hacked together HTTP request rate limits. They don't exist for strong technical reasons; my blog engine setup here can generally stand up to even fairly extreme traffic floods (through an extensive series of hacks). It's definitely possible to overwhelm Wandering Thoughts with a high enough request volume, and HTTP rate limits will certainly help with that, but that's not really why they exist. My HTTP rate limits exist for ultimately social reasons and because they let me stop worrying and stop caring about certain sorts of abuse.

As we all know by now here in 2026, abuse definitely happens, even if it isn't killing your web servers. There are things out there who think nothing of making thousands or tens of thousands of requests to your web server a day. Some of them are people running crawlers and other undesired things, and some of them are syndication feed fetchers with very fast polling intervals (which is why the first ratelimits I implemented where syndication feed rate limits). Usually the level of excess requests is moderate. Large abuse doesn't happen very often on typical sites like mine, but it does happen every so often.

The advantage of having HTTP request rate limiting, even in the fallible form I have on Wandering Thoughts, is that I don't have to worry or really care about it. I'll never wake up in the morning to discover that something has made tens of thousands of requests overnight, because these days all but the first few of those requests will have been choked off. I also don't have to be annoyed by badly behaved syndication feed readers and consider various things to maybe get them to behave better, because all that sort of excessive, antisocial behavior gets blocked now.

(I have had the experience of discovering thousands of requests from a single source in the past and not particularly enjoyed it, even if nothing noticed in terms of load and response time and so on.)

For me, HTTP ratelimits have become something that give me peace of mind. I don't expect them to trigger very often (and generally they don't), but despite their infrequent activation I find them valuable and reassuring. They're a precaution against something that I hope is infrequent or, ideally, nonexistent.

(The corollary of this is that I don't regret the programming effort to add them to DWiki, the engine behind Wandering Thoughts, or even how moderately messy and hacked in the change is. For some changes you do care how often they get used and feel annoyed if they aren't used as often as you expected, but for me ratelimits aren't one of those.)

Using 'pkg' for everything on FreeBSD 15 has been nice

By: cks

Traditionally, the FreeBSD base system was managed through freebsd-update (also), which I would call primarily a patch-based system, while third party software was (usually) managed through pkg, a package manager. This was a quite traditional split, but it had some less than ideal aspects, and as of FreeBSD 15 you can choose to manage FreeBSD through pkg using what is called freebsd-base (which is also known as 'pkgbase'). If you're installing FreeBSD 15 from scratch, the installer will let you choose (and I believe it recommends the pkg based approach). If you upgrade from FreeBSD 14 to FreeBSD 15, there's a post-upgrade conversion process using pkgbasify (also, also).

(Technically you can use pkgbasify on FreeBSD 14, but pkgbase is officially experimental on FreeBSD 14.)

At this point I've been running a pkg based FreeBSD 15 system from more or less when FreeBSD 15 was released, first on a machine that I upgraded from 14 to 15 and then used pkgbasify on, and then on a second machine that I installed FreeBSD 15 on from scratch (partly because I wanted to move my test machine to less valuable hardware). In both cases, things have been fine. Over time the system has gone from FreeBSD 15.0 release to FreeBSD 15.0-p5, and each pkg-based update has been painless.

(Now that I look, the one thing that pkg-based updates haven't done is make ZFS snapshots. I honestly can't remember if freebsd-update did that for patch releases. I don't know how I feel about that, since I never made use of the ZFS snapshots that I believe got made in FreeBSD 14 for at least point upgrades, when going from 14.0 to 14.1 and so on.)

That FreeBSD's pkgbase is a bunch of separate packages means that those packages now have a range of versions from '15.0' through '15.0p5' (and now that I look, I have no '15.0p4' packages, which it turns out is because 15.0-p4 was a kernel update that was replaced by 15.0-p5's kernel updates). Fortunately 'freebsd-version' will let me more or less keep straight which patch level my current setup corresponds to.

We installed another FreeBSD 15 system recently and when we did, I recommended picking the pkg option. It's easier to keep everything straight, since we're already used to that sort of experience with Linux.

(I often had to look up the specific options I wanted to use with freebsd-update depending on what I was using it for this time around. Although I have no clear picture yet of how one goes from point release to point release in the pkgbase world (from 15.0 to a future 15.1), or even to the next major release (from 15.x to a future 16.0).)

I should use argument groups in Python's argparse module more than I do

By: cks

For reasons well outside the scope of this entry, the other day I looked at the --help output from one of my old Python programs. This particular program has a lot of options, but when I'd written it, I had used argparse argument groups to break up the large list of options into logical groups, starting with the most important and running down to the 'you should probably ignore these' ones. The result was far more readable than it would have been without the grouping.

(I want to call these 'option groups', because that's what I use them for.)

I've regularly used mutual exclusion groups in my recent Python programs, but for some reason I've fallen so much out of the habit of using argparse groups to break up walls of options that I'd forgot they even existed until I was reminded by my own program's --help output. Now that I've been reminded, there are probably some programs that I should go back to and add some groups to.

(Most or all of my programs with a lot of options have a structure to them; it's not just a kitchen sink of a lot of things. Even if there is no real structure I can at least separate things into frequent, less frequent, and obscure options.)

Although you can't put either sort of group inside a mutual exclusion group, the argparse documentation is explicit that you can put a mutual exclusion group inside a regular argument group (a detail that I hadn't remembered until I reread my entry on this). Now that I look, one reason to do this is so that you can give the block of mutually exclusive options a title and description that actually tells people that they're mutually exclusive.

(Maybe it would be nicer if a a mutual exclusion group could have an optional title and description, but that's not the API we have.)

As the argparse documentation says, anything not in an argument group is put in the usual sections in your --help. Another way to put this is that the moment you put something in an argument group, it drops down to the bottom of your remaining regular --help output (with a blank line between the regular help and the argument groups). Then each argument group is separated from the next with a blank line, whether or not you gave them a title or a description.

My view is that this can make argument groups a relatively all or nothing thing. If you just want to put a blank line and a title to group your already properly ordered options into digestible chunks, the only ones you can leave out of a group are the first options. After you add the first group, everything afterward has to also be in a group or it will get reordered on you. Fortunately this is easy to do in the sort of code I tend to write to set up argparse stuff, but I'm going to have to remember it when I start adding argument groups to my programs.

(Argparse --help prints options in the order you defined them, so it's conventional to put the most important options first and the least important ones last.)

Updating Ubuntu packages that you have local changes for with dgit

By: cks

Suppose, not entirely hypothetically, that you've made local changes to an Ubuntu package using dgit and now Ubuntu has come out with an update to that package that you want to switch to, with your local changes still on top. Back when I wrote about moving local changes to a new Ubuntu release with dgit, I wrote an appendix with a theory of how to do this, based on a conversation. Now that I've actually done this, I've discovered that there is a minor variation and I'm going to write it down explicitly (with additional notes because I forgot some things between then and now).

I'll assume we're starting from an existing dgit based repository with a full setup of local changes, including an updated debian/changelog. Our first step, for safety, is to make a branch to capture the current state of our repository. I suggest you name this branch after the current upstream package version that you're on top of, for example if the current upstream version you're adding local changes to can be summarized as 'ubuntu2.6':

git branch cslab-2.6

Making a branch allows you to use 'git diff cslab-2.6..' later to see exactly what changed between your versions. A useful thing to do here is to exclude the 'debian/' directory from diffs, which can be done with 'git diff cslab-2.6.. -- . :!debian', although your shell may require you to quote the '!' (cf).

Then we need to use dgit to fetch the upstream updates:

dgit fetch -d ubuntu

We need to use '-d ubuntu', at least in current versions of dgit, or 'dgit fetch' gets confused and fails. At this point we have the updated upstream in the remote tracking branch 'dgit/dgit/jammy,-security,-updates' but our local tree is still not updated.

(All of dgit's remote tracking branches start with 'dgit/dgit/', while all of its local branches start with just 'dgit/'. This is less than optimal for my clarity.)

Normally you would now rebase to shift your local changes on top of the new upstream, but we don't want to immediately do that. The problem is that our top commit is our own dgit-based change to debian/changelog, and we don't want to rebase that commit; instead we'll make a new version of it after we rebase our real local changes. So our first step is to discard our top commit:

git reset --hard HEAD~

(In my original theory I didn't realize we had to drop this commit before the rebase, not after, because otherwise things get confused. At a minimum, you wind up with debian/changelog out of order, and I don't know if dropping your HEAD commit after the rebase works right. It's possible you might get debian/changelog rebase conflicts as well, so I feel dropping your debian/changelog change before the rebase is cleaner.)

Now we can rebase, for which the simpler two-argument form does work (but not plain rebasing, or at least I didn't bother testing plain rebasing):

git rebase dgit/dgit/jammy,-security,-updates dgit/jammy,-security,-updates

(If you are wondering how this command possibly works, as I was part way through writing this entry, note that the first branch is 'dgit/dgit/...', ie our remote tracking branch, and then second branch is 'dgit/...', our local branch with our changes on it.)

At this point we should have all of our local changes stacked on top of the upstream changes, but no debian/changelog entry for them that will bump the package version. We create that with:

gbp dch --since dgit/dgit/jammy,-security,-updates --local .cslab. --ignore-branch --commit

Then we can build with 'dpkg-buildpackage -uc -b', and afterward do 'git clean -xdf; git reset --hard' to reset your tree back to its pristine state.

(My view is that while you can prepare a source package for your work if you want to, the 'source' artifact you really want to save is your dgit VCS repository. This will be (much) less bulky when you clean it up to get rid of all of the stuff (to be polite) that dpkg-buildpackage leaves behind.)

Here in 2026, we're retaining old systems instead of discarding them

By: cks

I mentioned recently that at work, we're retaining old systems that we would have normally discarded. We're doing this for the obvious reason that new servers have become increasingly expensive, due to escalating prices of RAM (especially DDR5 RAM) and all forms of SSDs, especially as new servers might really require us to buy ones that support U.2 NVMe instead of SATA SSDs (because I'm not sure how available SATA SSDs are these days).

Our servers are generally fairly old anyways, so our retention takes two forms. The straightforward one is that we're likely going to slow down completely pushing old servers out of service. Instead, we'll keep them on the shelf for if we want test or low importance machines, and along with that we're probably going to be more careful about which generation of hardware we use for new machines. We've traditionally simply used the latest hardware any time we turn over a machine (for example, updating it to a new Ubuntu version), but this time around a bunch of those will reuse what we consider second generation hardware or even older hardware for machines where we don't care too much if it's down for a day or two.

The second form of retention is that we're sweeping up older hardware that other groups at the university are disposing of, when in the past we'd have passed on the offer or taken only a small number of machines. For example, we just inherited a bunch of Supermicro servers and Lenovo P330 desktops (both old enough that they use DDR4 RAM), and in the past we'd have taken only a few of each at most. These inherited servers are likely to be used as part of what we consider 'second generation' hardware, equivalent to Dell R340s and R240s (and perhaps somewhat better in practice), so we'll use them for somewhat less important machines but ones where we still actually care.

(A couple of the inherited servers have already been reused as test servers.)

The hardware we're inheriting is perfectly good hardware and it'll probably work reliably for years to come (and if not, we have a fair number of spares now). But it's hardware with several years of use and wear already on it, and there's nothing special about it that makes it significantly better than the sort of second generation hardware we already have. However, we're looking at a future where we may not be able to afford to get new general purpose 1U servers and our current server fleet is all we'll have for a few years, even as some of them break or increasingly age out. So we're hoarding what we can get, in case. Maybe we won't need them, but if we do need them and we pass them up now, we'll really regret it.

(The same logic applies to the desktops. We don't have any immediate, obvious use for them, but at the same time they're not something we could get a replacement for if we pass on them now. We'll probably put a number of them to use for things we might not have bothered with it we had to get new machines; for example, I may set one up as a backup for my vintage 2017 office desktop.)

I suspect that there will be more of this sort of retention university-wide, whether or not the retained hardware gets used in the end. We're not in a situation where we can assume a ready supply of fresh hardware, so we'd maybe better hold on to what we have if it still works.

How old our servers are (as of 2026)

By: cks

Back in 2022, I wrote about how old our servers were at the time, partly because they're older than you might expect, and today I want to update that with our current situation. My group handles the general departmental infrastructure for the research side of the department (the teaching side is a different group), and we've tended to keep servers for quite a while. Research groups are a different matter; they often have much more modern servers and turn them over much faster.

As in past installments, our normal servers remain Dell 1U servers. What we consider our current generation are Dell R350s, which it looks like we got about two years ago in 2024 (and are now out of production). We still have plenty of Dell R340s and R240s in production, which were our most recent generation in 2022. We still have some Dell R230s and even R210 IIs in production in less important server roles. We also have a fair number of Supermicro servers in production, of assorted ages and in assorted roles (including our fileservers and our giant login server, which is now somewhat old).

(On a casual look, the Dell R210 IIs are all for machines that we consider decidedly unimportant; they're still in service because we haven't had to touch them. Our current view is that R350s are for important servers, and R340s and R240s are acceptable for less important ones.)

In a change from 2022, we turned over the hardware for our fileservers somewhat recently, 'modernizing' all of our ZFS filesystems in the process. The current fileservers have 512 GBytes of RAM in each, so I expect that we'll run this hardware for more than five years unless prices drop drastically back to what they were when we could afford to get a half-dozen machines with a combined multiple terabytes of (DDR5) RAM.

(Today, a single machine with 128 GBytes of DDR5 RAM and some U.2 NVMe drives came out far more expensive than we hoped (and the prices forced us to lower the amount of RAM we were targeting).)

Our SLURM cluster is quite a mix of machines. We have both CPU-focused and GPU-focused machines, and on both sides there's a lot of hand-built machines stuffed into rack cases. On the GPU side, the vendor servers are mostly Dell 3930s; on the CPU side, they're mostly Supermicro servers. A significant number of these servers are relatively old by now; the 3930s appear to date from 2019, for example. We have updated the GPUs somewhat but we mostly haven't bothered to update the servers otherwise, as we assume people mostly want GPU computation in GPU SLURM nodes. Even the CPU nodes are not necessarily the most modern; half of them (still) have Threadripper 2990WX CPUs (launched in 2018, and hand built into the same systems as in 2022). With RAM prices being the way they are, it's unlikely that we'll replace these CPU nodes with anything more recent in the near future.

With current hardware prices being what they are (and current and future likely funding levels), I don't think we're likely to get a new generation of 1U servers in the moderate future. We have one particular important server getting a hardware refresh soon, but apart from that we'll run servers on the hardware we have available today. This may mean we have to accept more hardware failures than usual (our usual amount of server hardware failures is roughly zero), but hopefully we'll have a big enough pool of old spare servers to deal with this.

(I expect us to reuse a lot more old servers than we traditionally have. For instance, our first generation of Linux ZFS fileservers date from 2018 but they've been completely reliable and they have a lot of disk bays and decent amounts of RAM. Surely we can find uses for that.)

PS: If I'm doing the math correctly, we have roughly 10 TBytes of DDR4 RAM of various sizes in machines that report DMI information to our metrics system, compared to roughly 6 TBytes of DDR5 RAM. That DDR5 RAM number is unlikely to go up by much any time soon; the DDR4 number probably will, for various reasons beyond the scope of this entry. This doesn't include our old fileserver hardware, which is currently turned off and not in service (and so not reporting DMI information about their decent amount of DDR4 RAM).

New old systems in the age of hardware shortages

By: cks

Recently I asked something on the Fediverse:

Lazyweb, if you were going to put together new DDR4-based desktop (because you already have the RAM and disks), what CPU would you use? Integrated graphics would probably be ideal because my needs are modest and that saves wrangling a GPU.

(Also I'm interested in your motherboard opinions, but the motherboard needs 2x M.2 and 2x to 4x SATA, which makes life harder. And maybe 4K@60Hz DisplayPort output, for integrated graphics)

If I was thinking of building a new desktop under normal circumstances, I would use all modern components (which is to say, current generation CPU, motherboard, RAM, and so on). But RAM is absurdly expensive these days, so building a new DDR5-based system with the same 64 GBytes of RAM that I currently have would cost over a thousand dollars Canadian just for the RAM. The only particularly feasible way to replace such an existing system today is to reuse as many components as possible, which means reusing my DDR4 RAM. In turn, this means that a lot of the rest of the system will be 'old'. By this I don't necessarily mean that it will have been manufactured a while ago (although it may have) but that its features and capabilities will be from a while back.

If you want an AMD CPU for your DDR4-based system, it will have to be an AM4 CPU and motherboard. I'm not sure how old good CPUs are for AM4, but the one you want may be as old as a 2022 CPU (Ryzen 5 5600; other more recent options don't seem to be as well regarded). Intel's 14th generation CPUs ("Raptor Lake") from late 2023 still support DDR4 with compatible motherboards, but at this point you're still looking at things launched two years or more ago, which at one point was an eternity in CPUs.

(It's still somewhat of an eternity in CPUs, especially AMD, because AMD has introduced support for various useful instructions since then. For instance, Go's latest garbage collector would like you to have AVX-512 support. Intel desktop CPUs appear to have no AVX-512 at all, though.)

Beyond CPU performance, older CPUs and often older motherboards also often mean that you have older PCIe standards, fewer PCIe lanes, less high speed USB ports, and so on. You're not going to get the latest PCIe from an older CPU and chipset. Then you may step down in other components as well (like GPUs and NVMe drives), depending on how long you expect to keep them, or opt to keep your current components if those are good enough.

My impression is that such 'new old systems' have usually been a relatively unusual thing in the PC market, and that historically people have upgraded to the current generation. This lead to a steady increase in baseline capabilities over time as you could assume that desktop hardware would age out on a somewhat consistent basis. If people are buying new old systems and keeping old systems outright, that may significantly affect not just the progress of performance but also the diffusion of new features (such as AVX-512 support) into the CPU population.

The other aspect of this is, well, why bother upgrading to a new old system at all, instead of keeping your existing old old system? If your old system works, you may not get much from upgrading to a new old system. If your old system doesn't have enough performance or features, spending money on a new old system may not get you enough of an improvement to remove your problems (although it may mitigate them a bit). New old systems are effectively a temporary bridge and there's a limit to how much people are willing to spend on temporary bridges unless they have to. This also seems likely to slow down both the diffusion of nice new CPU features and the slow increase in general performance that you could assume.

(At work, the current situation has definitely caused us to start retaining machines that we would have discarded in the past, and in fact were planning to discard until quite recently.)

PS: One potentially useful thing you can get out of a new old system like this is access to newer features like PCIe bifurcation or decent UEFI firmware that your current system doesn't support or have.

Canonical's Netplan is hard to deal with in automation

By: cks

Suppose, not entirely hypothetically, that you've traditionally used /etc/resolv.conf on your Ubuntu servers but you're considering switching to systemd-resolved, partly for fast failover if your normal primary DNS server is unavailable and partly because it feels increasingly dangerous not to, since resolved is the normal configuration and what software is likely to expect. One of the ways that resolv.conf is nice is that you can set the configuration by simply copying a single file that isn't used for anything else. On Ubuntu, this is unfortunately not the case for systemd-resolved.

Canonical expects you to operate all of your Ubuntu server networking through Canonical Netplan. In reality, Netplan will render things down to a systemd-networkd configuration, which has some important effects and creates some limitations. Part of that rendered networkd configuration is your DNS resolution settings, and the natural effect of this is that they have to be associated with some interface, because that's the resolved model of the world. This means that Netplan specifically attaches DNS server information to a specific network interfaces in your Netplan configuration. This means that you must find the specific device name and then modify settings within it, and those settings are intermingled (in the same file) with settings you can't touch.

(Sometimes Netplan goes the other way, separating interface specific configuration out to a completely separate section.)

Netplan does not give you a way to do this; if anything, Netplan goes out of its way to not do so. For example, Netplan can dump its full or partial configuration, but it does so in YAML form with no option for JSON (which you could readily search through in a script with jq). However, if you want to modify the Netplan YAML without editing it by hand, 'netplan set' sometimes requires JSON as input. Lack of any good way to search or query Netplan's YAML matters because for things like DNS settings, you need to know the right interface name. Without support for this in Netplan, you wind up doing hacks to try to get the right interface name.

Netplan also doesn't provide you any good way to remove settings. The current Ubuntu 26.04 beta installer writes a Netplan configuration that locks your interfaces to specific MAC addresses:

  enp1s0:
    match:
      macaddress: "52:54:00:a5:d5:fb"
    [...]
    set-name: "enp1s0"

This is rather undesirable if you may someday swap network cards or transplant server disks from one chassis to another, so we would like to automatically take it out. Netplan provides no support for this; 'netplan set' can't be given a blank replacement, for example (and 'netplan set "network.ethernets.enp1s0.match={}"' doesn't do anything). If Netplan would give you all of the enp1s0 block in JSON format, maybe you could edit the JSON and replace the whole thing, but that's not available so far.

(For extra complication you also need to delete the set-name, which is only valid with a 'match:'.)

Another effect of not being able to delete things in scripts is that you can't write scripts that move things out to a different Netplan .conf file that has only your settings for what you care about. If you could reliably get the right interface name and you could delete DNS settings from the file the installer wrote, you could fairly readily create a '/etc/netplan/60-resolv.conf' file that was something close to a drop-in /etc/resolv.conf. But as it is, you can't readily do that.

There are all sorts of modifications you might want to make through a script, such as automatically configuring a known set of VLANs to attach them to whatever the appropriate host interface is. Scripts are good for automation and they're also good for avoiding errors, especially if you're doing repetitive things with slight differences (such as setting up a dozen VLANs on your DHCP server). Netplan fights you almost all the way about doing anything like this.

My best guess is that all of Canonical's uses of Netplan either use internal tooling that reuses Netplan's (C) API or simply re-write Netplan files from scratch (based on, for example, cloud provider configuration information).

(To save other people the time, the netplan Python package on PyPI seems to be a third party package and was last updated in 2019. Which is a pity, because it theoretically has a quite useful command line tool.)

One bleakly amusing thing I've found out through using 'netplan set' on Ubuntu 26.04 is that the Ubuntu server installer and Netplan itself have slightly different views on how Netplan files should be written. The original installer version of the above didn't have the quotes around the strings; 'netplan set' added them.

(All of this would be better if there was a widely agreed on, generally shipped YAML equivalent of 'jq', or better yet something that could also modify YAML in place as well as query it in forms that were useful for automation. But the 'jq for YAML' ecosystem appears to be fragmented at best.)

Considering mmap() verus plain reads for my recent code

By: cks

The other day I wrote about a brute force approach to mapping IPv4 /24 subnets to Autonomous System Numbers (ASNs), where I built a big, somewhat sparse file of four-byte records, with the record for each /24 at a fixed byte position determined by its first three octets (so 0.0.0.0/24's ASN, if any, is at byte 0, 0.0.1.0/24 is at byte 4, and so on). My initial approach was to open, lseek(), and read() to access the data; in a comment, Aristotle Pagaltzis wondered if mmap() would perform better. The short answer is that for my specific case I think it would be worse, but the issue is interesting to talk about.

(In general, my view is that you should use mmap() primarily if it makes the code cleaner and simpler. Using mmap() for performance is a potentially fraught endeavour that you need to benchmark.)

In my case I have two strikes against mmap() likely being a performance advantage: I'm working in Python (and specifically Python 2) so I can't really directly use the mmap()'d memory, and I'm normally only making a single lookup in the typical case (because my program is running as a CGI). In the non-mmap() case I expect to do an open(), an lseek(), and a read() (which will trigger the kernel possibly reading from disk and then definitely copying data to me). In the mmap() case I would do open(), mmap(), and then access some page, triggering possible kernel IO and then causing the kernel to manipulate process memory mappings to map the page into my address space. In general, it seems unlikely that mmap() plus the page access handling will be cheaper than lseek() plus read().

(In both the mmap() and read() cases I expect two transitions into and out of the kernel. As far as I know, lseek() is a cheap system call (and certainly it seems unlikely to be more expensive than mmap(), which has to do a bunch of internal kernel work), and the extra work the read() does to copy data from the kernel to user space is probably no more work than the kernel manipulating page tables, and could be less.)

If I was doing more lookups in a single process, I could possibly win with the mmap() approach but it's not certain. A lot depends on how often I would be looking up something on an already mapped page and how expensive mapping in a new page is compared to some number of lseek() plus read() system calls (or pread() system calls if I had access to that, which cuts the number of system calls in half). In some scenarios, such as a burst of traffic from the same network or a closely related set of networks, I could see a high hit rate on already mapped pages. In others, the IPv4 addresses are basically random and widely distributed, so many lookups would require mapping new pages.

(Using mmap() makes it unnecessary to keep my own in-process cache, but I don't think it really changes what the kernel will cache for me. Both read()'ing from pages and accessing them through mmap() keeps them recently used.)

Things would also be better in a language where I could easily make zero-copy use of data right out of the mmap()'d pages themselves. Python is not such a language, and I believe that basically any access to the mmap()'d data is going to create new objects and copy some bytes around. I expect that this results in as many intermediate objects and so on as if I used Python's read() stuff.

(Of course if I really cared there's no substitute for actually benchmarking some code. I don't care that much, and the code is simpler with the regular IO approach because I have to use the regular IO approach when writing the data file.)

Early notes on switching some libvirt-based virtual machines to UEFI

By: cks

I keep around a small collection of virtual machines so I don't have to drag out one of our spare physical servers to test things on. These virtual machines have traditionally used traditional MBR-based booting ('BIOS' in libvirt instead of 'UEFI'), partly because for a long time libvirt didn't support snapshots of UEFI based virtual machines and snapshots are very important for my use of these scratch virtual machines. However, I recently discovered that libvirt now can do snapshots of UEFI based virtual machines, and also all of our physical server installs are UEFI based, so in the past couple of days I've experimented with moving some of my Ubuntu scratch VMs from BIOS to UEFI.

As far as I know, virt-manager and virsh don't directly allow you to switch a virtual machine between BIOS and UEFI after it's been created, partly because the result is probably not going to boot (unless you deliberately set up the OS inside the VM with both an EFI boot and a BIOS MBR boot environment). Within virt-manager, you can only select BIOS or UEFI at setup time, so you have to destroy your virtual machine and recreate it. This works, but it's a bit annoying.

(On the other hand, if you've had some virtual machines sitting around for years and years, you might want to refresh all of their settings anyway.)

It's possible to change between BIOS and UEFI by directly editing the libvirt XML to transform the <os> node. You may want to remove any old snapshots first because I don't know what happens if you revert from a 'changed to UEFI' machine to a snapshot where your virtual machine was a BIOS one. In my view, the easiest way to get the necessary XML is to create (or recreate) another virtual machine with UEFI, and then dump and copy its XML with some minor alterations.

For me, on Fedora with the latest libvirt and company, the <os> XML of a BIOS booting machine is:

 <os>
   <type arch='x86_64' machine='pc-q35-6.1'>hvm</type>
 </os>

Here the 'machine=' is the machine type I picked, which I believe is the better of the two options virt-manager gives me.

My UEFI based machines look like this:

 <os firmware='efi'>
   <type arch='x86_64' machine='pc-q35-9.2'>hvm</type>
   <firmware>
     <feature enabled='yes' name='enrolled-keys'/>
     <feature enabled='yes' name='secure-boot'/>
   </firmware>
   <loader readonly='yes' secure='yes' type='pflash' format='qcow2'>/usr/share/edk2/ovmf/OVMF_CODE_4M.secboot.qcow2</loader>
   <nvram template='/usr/share/edk2/ovmf/OVMF_VARS_4M.secboot.qcow2' templateFormat='qcow2' format='qcow2'>/var/lib/libvirt/qemu/nvram/[machine name]_VARS.qcow2</nvram>
 </os>

Here the '[machine-name]' bit is the libvirt name of my virtual machine, such as 'vmguest1'. This nvram file doesn't have to exist in advance; libvirt will create it the first time you start up the virtual machine. I believe it's used to provide snapshots of the UEFI variables and so on to go with snapshots of your physical disks and snapshots of the virtual machine configuration.

(This feature may have landed in libvirt 10.10.0, if I'm reading release notes correctly. Certainly reading the release notes suggests that I don't want to use anything before then with UEFI snapshots.)

Manually changing the XML on one of my scratch machines has worked fine to switch it from BIOS MBR to UEFI booting as far as I can tell, but I carefully cleared all of its disk state and removed all of its snapshots before I tried this. I suspect that I could switch it back to BIOS if I wanted to. Over time, I'll probably change over all of my as yet unchanged scratch virtual machines to UEFI through direct XML editing, because it's the less annoying approach for me. Now that I've looked this up, I'll probably do it through 'virsh edit ...' rather than virt-manager, because that way I get my real editor.

(This is the kind of entry I write for my future use because I don't want to have to re-derive this stuff.)

PS: Much of this comes from this question and answers.

Going from an IPv4 address to an ASN in Python 2 with Unix brute force

By: cks

For reasons, I've reached the point where I would like to be able to map IPv4 addresses into the organizations responsible for them, which is to say their Autonomous System Number (ASN), for use in DWiki, the blog engine of Wandering Thoughts. So today on the Fediverse I mused:

Current status: wondering if I can design an on-disk (read only) data structure of some sort that would allow a Python 2 program to efficiently map an IP address to an ASN. There are good in-memory data structures for this but you have to load the whole thing into memory and my Python 2 program runs as a CGI so no, not even with pickle.

(Since this is Python 2, about all I have access to is gdbm or rolling my own direct structure.)

Mapping IP addresses to ASNs comes up a lot in routing Internet traffic, so there are good in-memory data structures that are designed to let you efficiently answer these questions once you have everything loaded. But I don't think anyone really worries about on-disk versions of this information, while it's the case that I care about, although I only care about some ASNs (a detail I forgot to put in the Fediverse post).

Then I had a realization:

If I'm willing to do this by /24 (and I am) and represent the ASNs by 16-bit ints, I guess you can do this with a 32 Mbyte sparse file of two-byte blocks. Seek to a 16-byte address determined by the first three octets of the IP, read two bytes, if they're zero there's no ASN mapping we care about, otherwise they're the ASN in some byte order I'd determine.

If I don't care about the specific ASN, just a class of ASNs of interest of which there are at most 255, it's only 16 Mbytes.

(And if all I care about is a yes or know answer, I can represent each /24 by a bit, so the storage required drops even more, to only 2 Mbytes.)

This Fediverse post has a mistake. I thought ASNs were 16-bit numbers, but we've gone well beyond that by now. So I would want to use the one-byte 'class of ASN' approach, with ASNs I don't care about mapping to a class of zero. Alternately I could expand to storing three bytes for every /24, or four bytes to stay aligned with filesystem blocks.

That storage requirement is 'at most' because this will be a Unix sparse file, where filesystem blocks that aren't written to aren't stored on disk; when read, the data in them is all zero. The lookup is efficient, at least in terms of system calls; I'd open the file, lseek() to the position, and read two bytes (causing the system to read a filesystem block, however big that is). Python 2 doesn't have access to pread() or we could do it in one system call.

Within the OS this should be reasonably efficient, because if things are active much of the important bits of the mapping file will be cached into memory and won't have to be read from disk. 32 Mbytes is nothing these days, at least in terms of active file cache, and much of the file will be sparse anyway. The OS obviously has reasonably efficient random access to the filesystem blocks of the file, whether in memory or on disk.

This is a fairly brute force approach that's only viable if you're typically making a single query in your process before you finish. It also feels like something that is a good fit for Unix because of sparse files, although 16 Mbytes isn't that big these days even for a non-sparse file.

Realizing the brute force approach feels quite liberating. I've been turning this problem over in my mind for a while but each time I thought of complicated data structures and complicated approaches and it was clear to me that I'd never implement them. This way is simple enough that I could actually do it and it's not too impractical.

PS: I don't know if I'll actually build this, but every time a horde of crawlers descends on Wandering Thoughts from a cloud provider that has a cloud of separate /24s and /23s all over the place, my motivation is going to increase. If I could easily block all netblocks of certain hosting providers all at once, I definitely would.

(To get the ASN data there's pyasn (also). Conveniently it has a simple on-disk format that can be post-processed to go from a set of CIDRs that map to ASNs to a data file that maps from /24s to ASN classes for ASNs (and classes) that I care about.)

Update: After writing most of this entry I got enthused and wrote a stand-alone preliminary implementation (initially storing full ASNs in four-byte records), which can both create the data file and query it. It was surprisingly straightforward and not very much code, which is probably what I should have expected since the core approach is so simple. With four-byte records, a full data file of all recent routes from pyasn is about 53 Mbytes and the data file can be created in less than two minutes, which is pretty good given that the code writes records for about 16.5 million /24s.

(The whole thing even appears to work, although I haven't strongly tested it.)

Fedora's virt-manager started using external snapshots for me as of Fedora 41

By: cks

Today I made an unpleasant discovery about virt-manager on my (still) Fedora 42 machines that I shared on the Fediverse:

This is my face that Fedora virt-manager appears to have been defaulting to external snapshots for some time and SURPRISE, external snapshots can't be reverted by virsh. This is my face, especially as it seems to have completely screwed up even deleting snapshots on some virtual machines.

(I only discovered this today because today is the first time I tried to touch such a snapshot, either to revert to it or to clean it up. It's possible that there is some hidden default for what sort of snapshot to make and it's only been flipped for me.)

Neither virt-manager nor virsh will clearly tell you about this. In virt-manager you need to click on each snapshot and if it says 'external disk only', congratulations, you're in trouble. In virsh, 'virsh snapshot-list --external <vm>' will list external snaphots, and then 'virsh snapshot-list --tree <vm>' will tell you if they depend on any internal snapshots.

My largest problems came from virtual machines where I had earlier internal snapshots and then I took more snapshots, which became external snapshots from Fedora 41 onward. You definitely can't revert to an external snapshot in this situation, at least not with virsh or virt-manager, and the error messages I got were generic ones about not being able to revert external snapshots. I haven't tested reverting external snapshots for a VM with no internal ones.

(Not being able to revert to external snapshots is a long standing libvirt issue, but it's possible they now work if you only have external snapshots. Otherwise, Fedora 41 and Fedora 42 defaulting to external snapshots is extremely hard to understand (to be polite).)

Update: you can revert an external snapshot in the latest libvirt if all of your snapshots are external. You can't revert them if libvirt helpfully gave you external snapshots on top of internal ones by switching the default type of snapshots (probably in Fedora 41).

If you have an external snapshot that you need to revert to, all I can do is point to a libvirt wiki page on the topic (although it may be outdated by now) along with libvirt's documentation on its snapshot XML. I suspect that there is going to be suffering involved. I haven't tried to do this; when it came up today I could afford to throw away the external snapshot.

If you have internal snapshots and you're willing to throw away the external snapshot and what's built on it, you can use virsh or virt-manager to revert to an internal snapshot and then delete the external snapshot. This leaves the external snapshot's additional disk file or files dangling around for you to delete by hand.

If you have only an external snapshot, it appears that libvirt will let you delete the snapshot through 'virsh snapshot-delete <vm> <external-snapshot>', which preserves the current state of the machine's disks. This only helps if you don't want the snapshot any more, but this is one of my common cases (where I take precautionary snapshots before significant operations and then get rid of them later when I'm satisfied, or at least committed).

The worst situation appears to be if you have an external snapshot made after (and thus on top of) an earlier internal snapshot and you to keep the live state of things while getting rid of the snapshots. As far as I can tell, it's impossible to do this through libvirt, although some of the documentation suggests that you should be able to. The process outlined in libvirt's Merging disk image chains didn't work for me (see also Disk image chains).

(If it worked, this operation would implicitly invalidate the snapshots and I don't know how you get rid of them inside libvirt, since you can't delete them normally. I suspect that to get rid of them, you need to shut down all of the libvirt daemons and then delete the XML files that (on Fedora) you'll find in /var/lib/libvirt/qemu/snapshot/<domain>.)

One reason to delete external snapshots you don't need is if you ever want to be able to easily revert snapshots in the future. I wouldn't trust making internal snapshots on top of external ones, if libvirt even lets you, so if you want to be able to easily revert, it currently appears that you need to have and use only internal snapshots. Certainly you can't mix new external snapshots with old internal snapshots, as I've seen.

(The 5.1.0 virt-manager release will warn you to not mix snapshot modes and defaults to whatever snapshot mode you're already using. I don't know what it defaults to if you don't have any snapshots, I haven't tried that yet.)

Sidebar: Cleaning this up on the most tangled virtual machine

I've tried the latest preview releases of the libvirt stuff, but it doesn't make a difference in the most tangled situation I have:

$ virsh snapshot-delete hl-fedora-36 fedora41-preupgrade
error: Failed to delete snapshot fedora41-preupgrade
error: Operation not supported: deleting external snapshot that has internal snapshot as parent not supported

This VM has an internal snapshot as the parent because I didn't clean up the first snapshot (taken before a Fedora 41 upgrade) before making the second one (taken before a Fedora 42 upgrade).

In theory one can use 'virsh blockcommit' to reduce everything down to a single file, per the knowledge base section on this. In practice it doesn't work in this situation:

$ virsh blockcommit hl-fedora-36 vda --verbose --pivot --active
error: invalid argument: could not find base image in chain for 'vda'

(I tried with --base too and that didn't help.)

I was going to attribute this to the internal snapshot but then I tried 'virsh blockcommit' on another virtual machine with only an external snapshot and it failed too. So I have no idea how this is supposed to work.

Since I could take a ZFS snapshot of the entire disk storage, I chose violence, which is to say direct usage of qemu-img. First, I determined that I couldn't trivially delete the internal snapshot before I did anything else:

$ qemu-img snapshot -d fedora40-preupgrade fedora35.fedora41-preupgrade
qemu-img: Could not delete snapshot 'fedora40-preupgrade': snapshot not found

The internal snapshot is in the underlying file 'fedora35.qcow2'. Maybe I could have deleted it safely even with an external thing sitting on top of it, but I decided not to do that yet and proceed to the main show:

$ qemu-img commit -d fedora35.fedora41-preupgrade
Image committed.
$ rm fedora35.fedora41-preupgrade

Using 'qemu-img info fedora35.qcow2' showed that the internal snapshot was still there, so I removed it with 'qemu-img snapshot -d' (this time on fedora35.qcow2).

All of this left libvirt's XML drastically out of step with the underlying disk situation. So I removed the XML for the snapshots (after saving a copy), made sure all libvirt services weren't running, and manually edited the VM's XML, where it turned out that all I needed to change was the name of the disk file. This appears to have worked fine.

I suspect that I could have skipped manually removing the internal snapshot and its XML and libvirt would then have been happy to see it and remove it.

(I'm writing all of the commands and results down partly for my future reference.)

Mass production's effects on the cheapest way to get some things

By: cks

We have a bunch of networks in a number of buildings, and as part of looking after them, we want to monitor whether or not they're actually working. For reasons beyond the scope of this entry we don't do things like collect information from our switches through SNMP, so our best approach is 'ping something on the network in the relevant location'. This requires something to ping. We want that thing to be stable and always on the network, which typically rules out machines and devices run by other people, and we want it to run from standard wall power for various reasons.

You can imagine a bunch of solutions to this for both wired and wireless networks. There are lots of cheap little computers these days that can run Linux, so you could build some yourself or expect to find someone selling them pre-made. However, these are unlikely to be a mass produced volume product, and it turns out that the flipside of things only being cheap when there is volume is that if there is volume, unexpected things can be the cheapest option.

The cheapest wall-powered device you can put on your wireless network to ping these days turns out to be a remote controlled power plug intended for home automation (as a bonus it will report uptime information for you if you set it up right, so you can tell if it lost power recently). They can fail after a few years, but they're inexpensive so we consider them consumables. And if you have another device that turns out to be flaky and has to be power cycled every so often, you can reuse a 'wifi reachability sensor' for its actual remote power control capabilities.

Similarly, as far as we've found, the cheapest wall powered device that plugs into a wired Ethernet and can be given an IP address so it can be pinged is a basic five port managed switch. You give it a 'management IP', plug one port into the network, and optionally plug up its other four ports so no one uses it for connectivity (because it's a cheap switch and you don't necessarily trust it). You might even be able to find one that supports SNMP so you can get some additional information from it (although our current ones don't, as far as I can tell).

In both cases it's clear that these are cheap because of mass production. People are making lots of wireless remote controlled power plugs and five port managed switches, so right now you can get the switches for about $30 Canadian each and the power plugs for $10 Canadian. In both cases what we get is overkill for what we want, and you could do a simpler version that has a smaller, cheaper bill of materials (BOM). But that smaller version wouldn't have the volume so it would cost much more for us to get it or an approximation.

(Even if we designed and built our own, we probably can't beat the price of the wireless remote controlled power plugs. We might be able to get a cheaper BOM for a single-Ethernet simple computer with case and wall plug power supply, but that ignores staff time to design, program, and assemble the thing.)

At one level this makes me sad. We're wasting the reasonably decent capabilities of both devices, and it feels like there should be a more frugal and minimal option. But it's hard to see what it would be and how it could be so cheap and readily available.

A traditional path to getting lingering duplicate systems

By: cks

In yesterday's entry I described a lingering duplicate system and how it had taken us a long time to get rid of it, but I got too distracted by the story to write down the general thoughts I had on how this sort of thing happens and keeps happening (also, the story turned out to be longer than I expected). We've had other long running duplicate systems, and often they have more or less the same story as yesterday's disk space usage tracking system.

The first system built is a basic system. It's not a bad system, but it's limited and you know it. You can only afford to gather disk usage information once a day and you have nowhere to put it other than in the filesystem, which makes it easy to find and independent of anything else but also stops it updating when the filesystem fills up. Over time you may improve this system (cheaper updates that happen more often, a limited amount of high resolution information), but the fundamental issues with it stick around.

After a while it becomes possible to build a different, better system (you gather disk usage information every few minutes and put it in your new metrics system), or maybe you just realize how to do a better version from scratch. But often the initial version of this new system has its own limitations or works a bit differently or both, or you've only implemented part of what you'd need for a full replacement of the first system. And maybe you're not sure it will fully work, that it's really the right answer, or if you'll be able to support it over the long term (perhaps the cardinality of the metrics will be too overwhelming).

(You may also be wary of falling victim to the "second system effect", since you know you're building a second system.)

Usually this means that you don't want to go through the effort and risk of immediately replacing the old system with the new system (if it's even immediately possible without more work on the new system). So you use the new system for new stuff (providing dashboards of disk space usage) and keep the old system for the old stuff (the officially supported commands that people know). The old system is working so it's easier to have it stay "for now". Even if you replace part of the use of the old system with the new system, you don't replace all of it.

(If your second system started out as only a partial version of the old system, you may also not be pushed to evolve it so that it could fully replace the old system, or that may only happen slowly. In some ways this is a good thing; you're getting practical experience with the basic version of the new system rather than immediately trying to build the full version. This is a reasonable way to avoid the "second system effect", and may lead you to find out that in the new system you want things to operate differently than the old one.)

Since both the old system and the new system are working, you now generally have little motivation to do more work to get rid of the old system. Until you run into clear limitations of the old system, moving back to only having one system is (usually) cleanup work, not a priority. If you wanted to let the new system run for a while to prove itself, it's also easy to simply lose track of this as a piece of future work; you won't necessarily put it on a calendar, and it's something that might be months or a year out even in the best of circumstances.

(The times when the cleanup is a potential priority are when the old system is using resources that you want back, including money for hardware or cloud stuff, or when the old system requires ongoing work.)

A contributing factor is that you may not be sure about what specific behaviors and bits of the old system other things are depending on. Some of these will be actual designed features that you can perhaps recover from documentation, but others may be things that simply grew that way and became accidentally load bearing. Figuring these out may take careful reverse engineering of how the system works and what things are doing with it, which takes work, and when the old system is working it's easier to leave it there.

Lingering duplicate systems and the expense of weeding them out (an illustration)

By: cks

We have been operating a fileserver environment for a long time now, back before we used ZFS. When you operate fileservers in a traditional general Unix environment, one of the things you need is disk usage information. So a very long time ago, before I even arrived, people built a very Unix-y system to do this. Every night, raw usage information was generated for each filesystem (for a while with 'du'), written to a special system directory in the filesystem, and then used to create a text file with a report showing currently usage and the daily and weekly change in everyone's usage. A local 'report disk usage' script would then basically run your pager on this file.

After a while, we we able to improve this system by using native ZFS commands to get per-user 'quota' usage information, which made it much faster than the old way (we couldn't do this originally because we started with ZFS before ZFS tracked this information). Later, this made it reasonable to generate a 'frequent' disk usage report every fifteen minutes (with it keeping a day's worth of data), which could be helpful to identify who had suddenly used a lot of disk space; we wrote some scripts to use this information, but never made them as public as the original script. However, all of this had various limitations, including that it stopped updating once the filesystem had filled up.

Shortly after we set up our Prometheus metrics system and actually had a flexible metrics system we could put things into, we started putting disk space usage information into it, giving us more fine grained data, more history (especially fine grained history, where we'd previously only had the past 24 hours), and the ability to put it into Grafana graphs on dashboards. Soon afterward it became obvious that sometimes the best way to expose information is through a command, so we wrote a command to dump out current disk usage information in a relatively primitive form.

Originally this 'getdiskusage' command produced quite raw output because it wasn't really intended for direct use. But over time, people (especially me) kept wanting more features and options and I never quite felt like writing some scripts to sit on top of it when I could just fiddle the code a bit more. Recently, I added some features and tipped myself over a critical edge, where it felt like I could easily re-do the old scripts to get their information from 'getdiskusage' instead of those frequently written files. One thing led to another and so now we have some new documentation and new (and revised) user-visible commands to go with it.

(The raw files were just lines of 'disk-space login', and this was pretty close to what getdiskusage produced already in some modes.)

However, despite replacing the commands, we haven't yet turned off the infrastructure on our fileservers that creates and updates those old disk usage files. Partly this is because I'd want to clean up all the existing generated files rather than leave them to become increasingly out of date, and that's a bit of a pain, and partly it's because of inertia.

Inertia is also a lot of why it took so long to replace the scripts. We've had the raw capability to replace them for roughly six years (since 'getdiskusage' was written, demonstrating that it was easily possible to extract the data from our metrics system in a usable form), and we'd said to each other that we wanted to do it for about that long, but it was always "someday". One reason for the inertia was that the existing old stuff worked fine, more or less, and also we didn't think very many people used it very often because it wasn't really documented or accessible. Perhaps another reason was that we weren't entirely sure we wanted to commit to the new system, or at least to exact form we first implemented our disk space metrics in.

DMARC DNS record inheritance and DMARC alignment requirements

By: cks

To simplify, DMARC is based on the domain in the 'From:' header, and what policy (if any) that domain specifies. As I've written about (and rediscovered) more than once (here and here), DMARC will look up the DNS record for the DMARC policy in exactly one of two places, either in the exact From: domain or on the organization's top level domain. In other words, if a message has a From: of 'someone@breaking.news.example.org', a receiver will first look for a DMARC TXT DNS record with the name _dmarc.breaking.news.example.org and then one with the name _dmarc.example.org.

(But there will be no lookup for _dmarc.news.example.org.)

DMARC also has the concept of policy inheritance, where the example.org DMARC DNS TXT record can specify a different DMARC policy for the organizational domain than for subdomains that don't have their own policy. For example, example.org could specify 'p=reject; sp=none' to say that 'From: user@example.org' should be rejected if it fails DMARC but it has no views on a default for 'From: user@news.example.org'.

If you're an innocent person, you might think that if your organization has 'sp=none' on its organization policy, you don't have to be concerned about the DMARC (and DKIM, and SPF) behavior of sub-names that don't have their own DMARC records, including hosts that send as 'From: local-account@host.dept.example.org'. Your organizational policy says 'sp=none', meaning don't do anything with sub-names for DMARC, and surely everyone will follow that.

This is unfortunately not quite true in an environment where people care about DKIM results regardless of DMARC policy settings. The problem is DKIM (and SPF) alignment. Under relaxed DKIM alignment, a 'From: flash@eng.news.example.org' would pass if it's DKIM signed by anything in example,org, for example 'eng.example.org'. Under strict DKIM alignment, it must be signed specifically by 'eng.news.example.org'.

The choice of what DKIM alignment to require is not a 'policy' and is not covered by 'p=' or 'sp=' in DMARC DNS TXT records. It's instead covered by a separate parameter, 'adkim=', and there is no 'sadkim=' parameter that only applies to subdomains. This means that there's no way for example.org to change the alignment policy for just 'From: user@example.org'; the moment they set 'adkim=s' in the _dmarc.example.org DNS TXT record, all sub-names without their own _dmarc.<whatever> records also switch to strict DKIM alignment. Even if the top level domain specifies 'sp=none', various mail systems out there may actively reject your mail because they no longer consider it properly aligned or increase their suspicion score a bit due to the lack of alignment (in some views your mail went from 'properly DKIM signed' to 'not properly DKIM signed').

The only way to deal with this is the same as with policy inheritance. Any host or domain name within your (sub-)organization that appears in From: headers must have its own valid DMARC DNS TXT record. If you want strict DKIM alignment you need to set that as 'adkim=s'. If you want relaxed alignment in theory that's the default but you might find it clearer to explicitly set 'adkim=r' (and probably 'aspf=r', also for clarity).

(Setting alignment explicitly makes it clear to other people and future you that you're deliberately choosing an alignment that might wind up different from your top level organizational alignment.)

PS: As far as I can see this is the behavior the DMARC RFC implicitly requires for all DMARC settings other than 'p=' (which has the 'sp=' version), but I could be wrong and missing something.

One problem with (Python) docstrings is that they're local

By: cks

When I wrote about documenting my Django forms, I said that I knew I didn't want to put my documentation in docstrings, because I'd written some in the past and then not read it this time around. One of the reasons for that is that Python docstrings have to be attached to functions, or more generally, Python docstrings have to be scattered through your code. The corollary to this is that to find relevant docstrings you have to read through your code and then remember which bits of it are relevant to what you're wondering about.

When your docstring is specifically about the function you already know you want to look at, this is fine. Docstrings work perfectly well for local knowledge, for 'what is this function about' summaries that you want to read before you delve into the function. I feel they work rather less well for finding what function you want to look at (ideally you want some sort of skimmable index for that); if you have to read docstrings to find a function, you're going to be paging through a lot of your code until you hit the right docstring.

This is also why I feel docstrings are a bad fit for documenting my Django forms. Even if I attach them to the Python functions that handle each particular form, the resulting documentation is going to be mingled with my code and spread all through it. Not only is there no overview, but I'd have to skip around my code as I read about how one form interacts with another; there's no single place where I can read about the flow of forms, one leading to another.

(This is the case even if all of the form handling functions are in one spot with nothing between them, because the docstrings will be split up by the code itself and the comments in the code.)

Another issue is that sensible docstrings can only be so big, because they separate the function's 'def' statement from its actual code. You don't want those two too far apart, which pushes docstrings toward being relatively concise. My feeling is that if I have a lot to say about what the function is used for or how it relates to other things, I can't really put it in a docstring. I usually put it in a comment in front of the function (which means that some of my Python code has a mixture of comments and docstrings). The less a function can be described purely by itself (and concisely), the more its docstring is going to sprawl and the more awkward that gets.

(Docstrings on functions are also generally seen as what I could call external documentation, written for people who might want to call the function and understand how it relates to other functions they might also use. Comments are the usual form of internal documentation that you want at hand while reading the function's code.)

It's conventional to say that docstrings are documentation for what they're on. I think it's better to say that docstrings are summaries. Some things can be described purely through summaries (with additional context that the programmer is assumed to have), but not everything can be.

(Comments before a function are also local to some degree, but they intrude less on the function's code since they don't put themselves between 'def' and the rest of things.)

Wayland has good reasons to put the window manager in the display server

By: cks

I recently ran across Isaac Freund's Separating the Wayland Compositor and Window Manager (via), which is excellent news as far as I'm concerned. But in passing, it says:

Traditionally, Wayland compositors have taken on the role of the window manager as well, but this is not in fact a necessary step to solve the architectural problems with X11. Although, I do not know for sure why the original Wayland authors chose to combine the window manager and Wayland compositor, I assume it was simply the path of least resistance. [...]

Unfortunately, I believe that there are excellent reasons to put the window manager into the display server the way Wayland has, and the Wayland people (who were also X people) were quite familiar with them and how X has had problems over the years because of its split.

One large and more or less core problem is that event handling is deeply entwined with window management. As an example, consider this sequence of (input) events:

  1. your mouse starts out over one window. You type some characters.
  2. you move your mouse over to a second window. You type some more characters.
  3. you click a mouse button without moving the mouse.
  4. you type more characters.

Your window manager is extremely involved in the decisions about where all of those input events go and whether the second window receives a mouse button click event in the third step. If the window manager is separate from whatever is handling input events, either some things trigger synchronous delays in further event handling or sufficiently fast typeahead and actions are in a race with the window manager to see if it handles changes in where future events should go fast enough or if some of your typing and other actions are misdirected to the wrong place because the window manager is lagging.

Embedding the window manager in the display server is the simple and obvious approach to insuring that the window manager can see and react to all events without lag, and can freely intercept and modify all events as it wishes without clients having to care. The window manager can even do this using extremely local knowledge if it wants. Do you want your window manager to have key bindings that only apply to browser windows, where the same keys are passed through to other programs? An embedded window manager can easily do that (let's assume it can reliably identify browser windows).

(An outdated example of how complicated you can make mouse button bindings, never mind keyboard bindings, is my mouse button bindings in fvwm.)

X has a collection of mechanisms that try to allow window managers to manage 'focus' (which window receives keyboard input), intercept (some) keys at a window manager level, and do other things that modify or intercept events. The whole system is complex, imperfect, and limited, and a variety of these mechanisms have weird side effects on the X events that regular programs receive; you can often see this with a program such as xev. Historically, not all X programs have coped gracefully with all of the interceptions that window managers like fvwm can do.

(X also has two input event systems, just to make life more complicated.)

X's mechanisms also impose limits on what they'll allow a window manager to do. One famous example is that in X, mouse scroll wheel events always go to the X window under the mouse cursor. Even if your window manager uses 'click (a window) to make it take input', mouse scroll wheel input is special and cannot be directed to a window this way. In Wayland, a full server has no such limitations; its window manager portion can direct all events, including mouse scroll wheels, to wherever it feels like.

(This elaborates on a Fediverse post of mine.)

Cleaning old GPG RPM keys that your Fedora install is keeping around

By: cks

Approximately all RPM packages are signed by GPG keys (or maybe they're supposed to be called PGP keys), which your system stores in the RPM database as pseudo-packages (because why not). If your Fedora install has been around long enough, as mine have, you will have accumulated a drift of old keys and sometimes you either want to clean them up or something unfortunate will happen to one of those keys (I'll get to one case for it).

One basic command to see your collection of GPG keys in the RPM database is (taken from this gist):

rpm -q gpg-pubkey --qf '%{NAME}-%{VERSION}-%{RELEASE}\t%{SUMMARY}\n'

On some systems this will give you a nice short list of keys. On others, your list may be very long.

Since Fedora 42 (cf), DNF has functionality (I believe more or less built in) that should offer to remove old GPG keys that have actually expired. This is in the 'expired PGP keys plugin' which comes from the 'libdnf5-plugin-expired-pgp-keys' if you don't have it installed (with a brief manpage that's called 'libdnf5-expired-pgp-keys'). I believe there was a similar DNF4 plugin. However, there are two situations where this seems to not work correctly.

The first situation is now-obsolete GPG keys that haven't expired yet, for various reasons; these may be for past versions of Fedora, for example. These days, the metadata for every DNF repository you use should list a URL for its GPG keys (see the various .repo files in /etc/yum.repos.d/ and look for the 'gpgkey=' lines). So one way to clean up obsolete keys is to fetch all of the current keys for all of your current repositories (or at least the enabled ones), and then remove anything you have that isn't among the list. This process is automated for you by the 'clean-rpm-gpg-pubkey' command and package, which is mentioned in some Fedora upgrade instructions. This will generally clean out most of your obsolete keys, although rare people will have keys that are so old that it chokes on them.

The second situation is apparently a repository operator who is sufficiently clever to have re-issued an expired key using the same key ID and fingerprint but a new expiry date in the future; this fools RPM and related tools and everything chokes. This is unfortunate, since it will often stall all DNF updates unless you disable the repo. One repository operator who has done this is Google, for their Fedora Chrome repository. To fix this you'll have to manually remove the relevant GPG key or keys. Once you've used clean-rpm-gpg-pubkey to reduce your list of GPG keys to a reasonable level, you can use the RPM command I showed above to list all your remaining keys, spot the likely key or keys (based on who owns it, for example), and then use 'rpm -e --allmatches gpg-pubkey-d38b4796-570c8cd3' (or some other appropriate gpg-pubkey name) to manually scrub out the GPG key. Doing a DNF operation such as installing or upgrading a package from the repository should then re-import the current key.

(This also means that it's theoretically harmless to overshoot and remove the wrong key, because it will be fetched back the next time you need it.)

(When I wrote my Fediverse post about discovering clean-rpm-gpg-pubkey, I apparently thought I would remember it without further prompting. This was wrong, and in fact I didn't even remember to use it when I upgraded my home desktop. This time it will hopefully stick, and if not, I have it written down here where it will probably be easier to find.)

Making empirical decisions about web access (here in 2026)

By: cks

Recently, Denis Warburton wrote in a comment on my entry on how HTTP results today depend on what HTTP User-Agent you use:

Making decisions based on user-provided information is unwise in 2026. The originating ip address is the only source of "truth" ... and even then, that information needs to be further examined before discerning whether or not it is a valid piece of communication.

It's absolutely true that everything except the source IP address is under the control of an attacker (and it always has been), and in one sense you can't trust it. But this doesn't mean you can't use information that's under the attacker's control in making decisions about whether to allow access to something; instead, it means that you have to be thoughtful about how you use the information and what for.

In practice, web agents emit a lot of data in their HTTP headers and requests. Some of these signals are complicated, such as browser version numbers, and some of them require work to use, but this doesn't mean that there's no signal at all that can be derived from all of the data that a web agent emits. For example, consider a web agent that uses the HTTP User-Agent of:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

This web agent is telling you that it's claiming to be Googlebot. Under the right circumstances this can be a valuable signal of malfeasance and worth denying access.

Similarly, a web agent that emits user agent hints while its HTTP User-Agent is claiming to be an authentic version of Firefox 147 is giving you the signal that it's not an unaltered, standard version of Firefox, because standard versions of Firefox 147 don't do that. It's most likely something built on Chromium, but in any case you might decide that this signal means it is suspicious enough to be denied access. Neither the User-Agent nor the Sec-CH-UA headers create true facts to definitively identify the browser and both could be faked by the attacker, but the inconsistency is real.

What an attacker tells you (deliberately or accidentally) is a signal, and it's up to you to interpret and use that signal (which I think you should these days). This is an empirical thing, something that depends on the surrounding environment (for example, you have to interpret the attacker's signal in terms of its difference from the signals of legitimate visitors), what you're doing, and what you care about, but then security is always ultimately people, not math, even though tech loves to avoid this sort of empiricism (which is a bad thing).

As a pragmatic thing, it's usually easier to use attacker signals if you allow things by default rather than deny them by default. If you allow by default, your primary concern is false positives (legitimate visitors who are emitting signals you find too suspicious), rather than false negatives, because an attacker that wants to work hard enough can always obtain access. Conveniently, public web sites (such as Wandering Thoughts) are exactly such an allow by default environment, which is why these days I use a lot of signals here when deciding what to accept or block (including IP addresses and networks).

(If you need a deny by default environment with real security, you need to use something that attackers can't fake. IP addresses can be one option in the right circumstances, but they aren't the only one.)

I think dependency cooldowns would be a good idea for Go

By: cks

Via Filippo Valsorda, I recently heard about a proposal to add dependency cooldowns to Go. The general idea of dependency cooldowns is to make it so that people don't immediately update to new versions of dependencies; instead, you wait some amount of time for people to inspect the new version and so on (either through automated tooling or manual work). Since one of Go's famous features is 'minimum version selection', you might think that a cooldown would be unnecessary, since people have to manually update the version of dependencies anyway and don't automatically get them.

Unfortunately, this is not the actual observed reality. In the actual observed reality, people update dependency versions fast enough to catch out other people who change what a particular version is of a module they publish. This seems to be in part from things like 'Dependabot' automatically cruising around looking for version updates, but in general it seems clear that some amount of people will update to new versions of dependencies the moment those new versions become visible to them. And if a dependency is used widely enough, through random chance there's pretty much always going to be a developer somewhere who is running 'go list -m -u all' right after a new version of the package is released. So I feel that some sort of a cooldown would be useful in practice, even with Go's other protections.

I follow the VCS repositories of a fair number of Go projects, and a lot of their dependency updates are automated, through things like Dependabot. If these things supported dependency cooldowns and if people turned that on, we might get a lot of the benefit without Go's own mechanisms having to add code to support this. On the other hand, not everyone uses Dependabot or equivalent features (especially if people migrate away from Github, as some are) and there's always going to be people checking and doing dependency updates by hand. To support them, we need assistance from tooling.

(In theory this tooling assistance could be showing how old a version is then leaving it up to people to notice and decide, but in practice I feel that's abrogating responsibilities. We've seen that show before; easy support and defaults matter.)

While I don't have any strong or well informed opinions on how this should be implemented in Go, I do feel that both defaults and avoiding mistakes are important. This biases me towards, say, a setting for this in your go.mod, because then that way it's automatically persistent and everyone who works on your project gets it applied automatically, unlike (for example) an environment variable that you have to make sure everyone has set.

(This elaborates on some badly phrased thoughts I posted on the Fediverse.)

On today's web, HTTP results depend on the HTTP User-Agent you use

By: cks

Back in the old days, search engines mostly crawled your sites with their regular, clearly identifying HTTP User-Agent headers, but once in a while they would switch up to fetching with a browser's User-Agent. What they were trying to detect was if you served one set of content to "Googlebot" but another set of content to "Firefox", and if you did they tended to penalize you; you were supposed to serve the same content to both, not SEO-bait to Googlebot and wall to wall ads to browsers. Googlebot identified itself as a standard courtesy, not so you could handle it differently.

Obviously those days are long over. It's now routine and fully accepted to serve different things to Googlebot and to regular browsers. Generally websites offer Googlebot more access and plain text, and browsers less access (even paywalls) and JavaScript encrusted content (leading to people setting their User-Agent to Googlebot to bypass paywalls). Since people give Googlebot special access, people impersonate it and other well accepted crawlers and other people (like me) block that impersonation.

This is part of an increasingly common general pattern, which is that different HTTP User-Agents get different results for the same URL. Especially, some HTTP User-Agents will get errors, HTTP redirections, or challenge pages, and other User-Agents won't; instead they'll get the real content. What this means in concrete terms is it's increasingly bad to take the results from one HTTP User-Agent and assume they apply for another. This isn't just me and Wandering Thoughts; for example, if a site has a standard configuration of Anubis, having a User-Agent that includes 'Mozilla' will cause you to get a challenge page instead of the actual page (cf).

(One of the amusing effects of this is what it does to 'link previews', which require the website displaying the preview to fetch a copy of the URL from the original site. On the Fediverse, fairly often the link preview I see is just some sort of a challenge page.)

In practice, you're probably reasonably safe if you're doing close variations of what's fundamentally the same distinctive User-Agent. But you're living dangerously if you try this with browser-like User-Agent values, either two different ones or a browser-like User-Agent and a distinctive non-browser one, because those are the ones that are most frequently forged and abused by covert web crawlers and other malware. Everyone who wants to look normal is imitating a browser, which means looking like a browser is a bad idea today.

Unfortunately, however bad an idea it is, people seem to keep trying fetches with multiple User-Agent header values and then taking a result from one User-Agent and using it in the context of another. Especially, feed reader companies seem to do it, first Feedly and now Inoreader.

You (I) should document the forms of your Django web application

By: cks

We have a long-standing Django web application to handle (Unix) account requests. Since these are requests, there is some state involved, so for a long time a request could be pending, approved, or rejected, with the extra complexity that an approved request might be incomplete and waiting on the person to pick their login. Recently I added being able to put a request into a new state, 'held', in order to deal with some local complexities where we might have a request that we didn't want to delete but also didn't want to go through to create an account.

(For instance, it's sometimes not clear if new incoming graduate students who've had to defer their arrival are going to turn up later or wind up not coming at all. So now we can put their requests on hold.)

When I initially wrote the new code, I though that this new 'held' status was relatively weak, and in particular that professors (who approve accounts) could easily take an account request out of 'held' status and approve it. At the time I decided that this was probably a feature, since a professor might know that one of their graduate students was about to turn up after all and this way they didn't have to get us to un-hold the account request. Then the other day we sort of wanted to hold an account request even against the professor involved approving it, and because I knew that the 'held' status was weak this way, I didn't bother trying.

Well, it turns out I was wrong. Because I had forgotten how our forms worked, I hadn't realized that my new 'held' status was less 'held' and more 'frozen', and I only learned better today because I took a stab at creating a real 'frozen' status. In the current state, while it's possible for professors to deliberately un-hold a request, it takes a certain amount of work to find the one obscure place it's possible and you can't do it by accident (and it would be easy to close that possibility off if we decided to). You definitely can't accidentally approve a request that's currently held without realizing it.

(So my admittedly modest amount of work to add a 'frozen' status was sort of wasted, although it did lead to greater understanding in the end.)

Past me, immersed in the application, presumably found all of the rules about who could see what form and what they showed to be obvious (at least in context). Present me is a long distance from past me and did not remember all of those things. Brief documentation on each form would have been really quite handy, and if I'm smart I'll spend some time this time around to write some.

I'm not sure where I'll put any new forms documentation. Probably not in our views.py, which is already big enough. I could put it in urls.py, or I could write a separate README.forms file that doesn't try to embed this in code. And I know that I don't want to put it in Python docstrings, because I wrote some things in Python docstrings on the existing forms functions and then didn't read them. Even if I had read them, the existing docstrings don't entirely cover the sort of information I now know I want to know.

(I think there's a good reason for my not reading my own docstrings, but that's for another entry.)

UEFI-only booting with GRUB has gone okay on our (Ubuntu 24.04) servers

By: cks

We've been operating Ubuntu servers for a long time and for most of that time we've booted them through traditional MBR BIOS boots. Initially it was entirely through MBR and then later it was still mostly through MBR (somewhat depending on who installed a particular server; my co-workers are more tolerant of UEFI than I am). But when we built the 24.04 version of our customized install media, my co-worker wound up making it UEFI only, and so for the past two years all of our 24.04 machines have been UEFI (with us switching BIOSes on old servers into UEFI mode as we updated them). The headline news is that it's gone okay, more or less as you'd expect and hope by now.

All of our servers have mirrored system disks, and the one UEFI thing we haven't really had to deal with so far is fixing Ubuntu's UEFI boot disk redundancy stuff after one disk fails. I think we know how to do it in theory but we haven't had to go through it in practice. It will probably work out okay but it does make me a bit nervous, along with the related issue that the Ubuntu installer makes it hard to be consistent about which disk your '/boot/efi' filesystem comes from.

(In the installer, /boot/efi winds up on the first disk that you set as the boot device, but the disks aren't always presented in order so you can do this on 'the first disk' in the installer and discover that the first disk it listed was /dev/sdb.)

The Ubuntu 24.04 default bootloader is GRUB, so that's what we've wound up with even though as a UEFI-only environment we could in theory use simpler ones, such as systemd-boot. I'm not particularly enthused about GRUB but in practice it does what we want, which is to reliably boot our servers, and it has the huge benefit that it's actively supported by Ubuntu (okay, Canonical) so they're going to make sure it works right, including with their UEFI disk redundancy stuff. If Ubuntu switches default UEFI bootloaders in their server installs, I expect we'll follow along.

(I don't know if Canonical has any plans to switch away from GRUB to something else. I suspect that they'll stick with GRUB for as long as they support MBR booting, which I suspect will be a while, especially as people look more and more likely to hold on to old hardware for much longer than normally expected.)

PS: One reason I'm writing this down is that I've been unenthused about UEFI for a long time, so I'm not sure I would have predicted our lack of troubles in advance. So I'm going to admit it, UEFI has been actually okay. And in its favour, UEFI has regularized some things that used to be pretty odd in the MBR BIOS era.

(I'm still not happy about the UEFI non-story around redundant system disks, but I've accepted that hacks like the Ubuntu approach are the best we're going to get. I don't know what distributions such as Fedora are doing here; my Fedora machines are MBR based and staying that way until the hardware gets replaced, which on current trends won't be any time soon.)

The story of one of my worst programming failures

By: cks

Somewhat recently, GeePaw Hill shared the story of what he called his most humiliating experience as a skilled and successful computer programmer. It's an excellent, entertaining story with a lesson for all of us, so I urge you to read it. Today I'm going to tell the story of one of my great failures, where I may have quietly killed part of a professor's research project by developing on a too-small machine.

Once upon a time, back when I was an (advanced) undergraduate, I was hired as a part time research programmer for a Systems professor to work on one of their projects, at first with a new graduate student and then later alone (partly because the graduate student switched from Systems to HCI). One of this professor's research areas was understanding and analyzing disk IO patterns (a significant research area at the time), and my work was to add detailed IO tracing to the Ultrix kernel. Some of this was porting work the professor had done with the 4.x BSD kernel (while a graduate student and postdoc) into the closely related, BSD-derived Ultrix kernel, but we extended the original filesystem level tracing down all the way to capturing block IO traces (still specifically attributed to filesystem events).

We were working on Ultrix because my professor had a research and equipment grant from DEC. DEC was interested in this sort of information for improving the IO performance of the Ultrix kernel, and part of the benefit of working with DEC was that DEC could arrange for us to get IO traces from real customers with real workloads, instead of university research system workloads. Eventually the modified kernel worked, gathered all the data that we wanted (and gave us some insights even on our systems), and was ready for the customer site. We talked to DEC and it was decided that the best approach was that I would go down to Boston with the source code, meet with the DEC people involved, we'd build a kernel for the customer's setup, and then I'd go with the DEC people to the customer site to actually boot into it and turn the tracing on.

Very shortly after we booted the new kernel on the customer's machine and turned tracing on, the kernel paniced. It was a nice, clear panic message from my own code, basically an assertion failure, and what it said was more or less 'disk block number too large to fit into data field'. I looked at that and had a terrible sinking feeling.

This was long enough ago (with small enough disks) that having very compact trace data was extremely important, especially at the block IO layer (where we were generating a lot of trace records). As a result, I'd carefully designed the on-disk trace records to be as small as possible. As part of that I'd tried to cut down the size of fields to be only as big as necessary, and one of the fields I'd minimized was the disk block address of the IO. My minimized field was big enough for the block addresses on our Ultrix machines (donated by DEC), with not very big disks, but it was now obviously too small for the bigger disks that the company had bought from DEC for their servers. In a way I was lucky that I'd taken the precaution of putting in the size check that paniced, because otherwise we could have happily wasted time collecting corrupted traces with truncated block addresses.

(All of this was long enough ago that I can't remember how small the field was, although my mind wants to say 24 bits. If it was 24 bits, I had to be using 4 Kbyte filesystem block addresses, not 512-byte sector addresses.)

Once I saw the panic message, both the mistake and the fix were obvious, and the code and so on were well structured enough that it was simple to make the change; I could almost have done it on the spot (or at least while in Boston). But, well, you only get one kernel panic from your new "we assure you this is going to work" kernel on a customer's machine, especially if you only have one evening to gather your trace data and you can't rebuild a kernel from source while at the customer's site, so the DEC people and I had to pack up and go back empty handed. Afterward, I flew back to Toronto from Boston, made the simple change, and tested everything. But I never went back to Boston for another visit with DEC, and I don't think that part of my professor's research projects went anywhere much after that.

(My visit to Boston and its areas did feature getting driven around at somewhat unnervingly fast speeds on the Massachusetts Turnpike in the sports car of one of the DEC people involved.)

So that's the story of how I may have quietly killed one of my professor's research projects by developing on a too-small machine.

(That's obviously not the only problem. When I was picking the field size, I could have reached out somehow to ask how big DEC's disks got, or maybe ran the field size past my professor to see if it made sense. But I was working alone and being trusted with all of this, and I was an undergraduate, although I had significant professional programming experience by then.)

Sidebar: Fixing an earlier spectacular failure

(All of the following is based on my fallible memory.)

The tracing code worked by adding trace records to a buffer in memory and then writing out the buffer to the trace file when it was necessary. The BSD version of the code that I started with (which traced only filesystem level IO) did this synchronously, created trace records even for writing out the trace buffer, and didn't protect itself against being called again. A recursive call would deadlock but usually it all worked because you didn't add too many new trace records while writing out the buffer.

(Basically, everything that added a trace record to the buffer checked to see if the buffer was too full and if it was, immediately called the 'flush the trace buffer' code.)

This approach blew up spectacularly when I added block IO tracing; the much higher volume of records being added made deadlocks relatively common. The whole approach to writing out the trace buffer had to change completely, into a much more complex one with multiple processes involved and genuinely asynchronous writeout. I still have a vivid memory of making this relatively significant restructuring and then doing a RCS ci with a commit message that included a long, then current computing quote about replacing one set of code with known bugs with a new set of code with new unknown ones.

(At this remove I have no idea what the exact quote was and I can't find it in a quick online search. And unfortunately the code and its RCS history is long since gone.)

Power glitches can leave computer hardware in weird states

By: cks

Late Friday night, the university's downtown campus experienced some sort of power glitch or power event. A few machines rebooted, a number of machines dropped out of contact for a bit (which probably indicates some switches restarting), and most significantly, some of our switches wound up in a weird, non-working state despite being powered on. This morning we cured the situation by fully power cycling all of them.

This isn't the first time we've seen brief power glitches leave things in unusual states. In the past we've seen it with servers, with BMCs (IPMIs), and with switches. It's usually not every machine, either; some machines won't notice and some will. When we were having semi-regular power glitches, there were definitely some models of server that were more prone to problems than others, but even among those models it usually wasn't universal.

It's fun to speculate about reasons why some particular servers of a susceptible model would survive and others not, but that's somewhat beside today's point, which is that power glitches can get your hardware into weird states (and your hardware isn't broken when and because this happens; it can happen to hardware that's in perfectly good order). We'd like to think that the computers around us are binary, either shut off entirely or working properly, but that clearly isn't the case. A power glitch like this peels back the comforting illusion to show us the unhappy analog truth underneath. Modern computers do a lot of work to protect themselves from such analog problems, but obviously it doesn't always work completely.

(My wild speculation is that the power glitch has shifted at least part of the overall system into a state that's normally impossible, and either this can't be recovered from or the rest of the system doesn't realize that it has to take steps to recover, for example forcing a full restart. See also flea power, where a powered off system still retains some power, and sometimes this matters.)

PS: We've also had a few cases where power cycling the hardware wasn't enough, which is almost certainly flea power at work.

PPS: My steadily increasing awareness of the fundamentally analog nature of a lot of what I take as comfortably digital has come in part from exposure on the Fediverse to people who deal with fun things like differential signaling for copper Ethernet, USB, and PCIe, and the spooky world of DDR training, where very early on your system goes to some effort to work out the signal characteristics of your particular motherboard, RAM, and so on so that it can run the RAM as fast as possible (cf).

(Never mind all of the CPU errata about unusual situations that aren't quite handled properly.)

If there are URLs in your HTTP User-Agent, they should exist and work

By: cks

One of the things people put in their HTTP User-Agent header for non-browser software is a URL for their software, project, or whatever (I'm all for this). This is a a good thing, because it allows people operating web servers to check out who and what you are and decide for themselves if they're going to allow it. Increasingly (and partly for social reasons), I block many 'generic' User-Agent values that come to my attention, for example through their volume.

(I don't block all of them, but if your User-Agent shows up and I can't figure out what it is and whether or not it's legitimate and used by real people, that's probably a block.)

However, there's an important and obvious thing about any URLs in your HTTP User-Agent, which is that they should actually work. The domain or host should exist, the URL should exist in the web server, and the URL's contents should actually explain the software, project, or organization involved. Plus, if you use a HTTPS website, the TLS certificate should be valid.

(A related thing is a generic URL that doesn't give me anything to go on. For example, your URL on a code forge, and either it's not obvious which one of your repositories is doing things or you don't have any public repositories.)

For me, a non-working URL is much more suspicious than a missing URL. HTTP User-Agents without URLs are reasonably common (especially in feed readers), so I don't find them immediately suspicious. Non-working URLs in mysterious User-Agents certainly look like you're attempting to distract me with the appearance of a proper web agent but without the reality of it. If a User-Agent with such a non-working URL comes to my attention, I'm very likely to block it in some way (unless it's very clear that it's a legitimate program used by real people, and it merely has bad habits with its User-Agent).

You would think that people wouldn't make this sort of mistake, but I regret to say that I've seen it repeatedly, in all of the variations. One interesting version I've seen is User-Agent strings with the various 'example.<TLD>' domains in their URLs. I suspect that this comes from software that has some sort of 'operator URL' setting and provides a default value if you don't set one explicitly. I've also seen .lan and .local URLs in User-Agents, which takes somewhat more creativity.

As usual, my view is that software shouldn't provide this sort of default value; instead, it should refuse to work until you configure your own value. However, this makes it slightly more annoying to use, so it will be less popular than more accommodating software. Of course, we can change that calculation by blocking everything that mentions 'example.com', 'example.org', 'example.net' and so on in its User-Agent.

Restricting IP address access to specific ports in eBPF: a sketch

By: cks

The other day I covered how I think systemd's IPAddressAllow and IPAddressDeny restrictions work, which unfortunately only allows you to limit this to specific (local) ports only if you set up the sockets for those ports in a separate systemd.socket unit. Naturally this raises the question of whether there is a good, scalable way to restrict access to specific ports in eBPF that systemd (or other interested parties) could use. I think the answer is yes, so here is a sketch of how I think you'd this.

Why we care about a 'scalable' way to do this is because systemd generates and installs its eBPF programs on the fly. Since tcpdump can do this sort of cross-port matching, we could write an eBPF program that did it directly. But such a program could get complex if we were matching a bunch of things, and that complexity might make it hard to generate on the fly (or at least make it complex enough that systemd and other programs didn't want to). So we'd like a way that still allows you to generate a simple eBPF program.

Systemd uses cgroup socket SKB eBPF programs, which attach to a cgroup and filter all network packets on ingress or egress. As far as I can understand from staring at code, these are implemented by extracting the IPv4 or IPv4 address of the other side from the SKB and then querying what eBPF calls a LPM (Longest Prefix Match) map. The normal way to use an LPM map is to use the CIDR prefix length and the start of the CIDR network as the key (for individual IPv4 addresses, the prefix length is 32), and then match against them, so this is what systemd's cgroup program does. This is a nicely scalable way to handle the problem; the eBPF program itself is basically constant, and you have a couple of eBPF maps (for the allow and deny sides) that systemd populates with the relevant information from IPAddressAllow and IPAddressDeny.

However, there's nothing in eBPF that requires the keys to be just CIDR prefixes plus IP addresses. A LPM map key has to start with a 32-bit prefix, but the size of the rest of the key can vary. This means that we can make our keys be 16 bits longer and stick the port number in front of the IP address (and increase the CIDR prefix size appropriately). So to match packets to port 22 from 128.100.0.0/16, your key would be (u32) 32 for the prefix length then something like 0x00 0x16 0x80 0x64 0x00 0x00 (if I'm doing the math and understanding the structure right). When you query this LPM map, you put the appropriate port number in front of the IP address.

This does mean that each separate port with a separate set of IP address restrictions needs its own set of map entries. If you wanted a set of ports to all have a common set of restrictions, you could use a normally structured LPM map and a second plain hash map where the keys are port numbers. Then you check the port and the IP address separately, rather than trying to combine them in one lookup. And there are more complex schemes if you need them.

Which scheme you'd use depends on how you expect port based access restrictions to be used. Do you expect several different ports, each with its own set of IP access restrictions (or only one port)? Then my first scheme is only a minor change from systemd's current setup, and it's easy to extend it to general IP address controls as well (just use a port number of zero to mean 'this applies to all ports'). If you expect sets of ports to all use a common set of IP access controls, or several sets of ports with different restrictions for each set, then you might want a scheme with more maps.

(In theory you could write this eBPF program and set up these maps yourself, then use systemd resource control features to attach them to your .service unit. In practice, at that point you probably should write host firewall rules instead, it's likely to be simpler. But see this blog post and the related VCS repository, although that uses a more hard-coded approach.)

Your terminal program has to be where xterm's ziconbeep feature is handled

By: cks

I recently wrote about things that make me so attached to xterm. One of those things is xterm's ziconbeep feature, which causes xterm to visibly and perhaps audibly react when it's iconified or minimized and gets output. A commentator suggested that this feature should ideally be done in the window manager, where it could be more general. Unfortunately we can't do the equivalent of ziconbeep in the window manager, or at least we can't do all of it.

A window manager can sound an audible alert when a specific type of window changes its title in a certain way. This would give us the 'beep' part of ziconbeep in a general way, although we're treading toward a programmable window manager. But then, Gnome Shell now does a lot of stuff in JavaScript and its extensions are written in JS and the whole thing doesn't usually blow up. So we've got prior art for writing an extension that reacts to window title changes and does stuff.

What the window manager can't really do is reliably detect when the window has new output, in order to trigger any beeping and change the visible window title. As far as I know, neither X nor Wayland give you particularly good visibility into whether the program is rendering things, and in some ways of building GUIs, you're always drawing things. In theory, a program might opt to detect that it's been minimized and isn't visible and so not render any updates at all (although it will be tracking what to draw for when it's not minimized), but in practice I think this is unfashionable because it gets in the way of various sorts of live previews of minimized windows (where you want the window's drawing surface to reflect its current state).

Another limitation of this as a general window manager feature is that the window manager doesn't know what changes in the appearance of a window are semantically meaningful and which ones are happening because, for example, you just changed some font preference and the program is picking up on that. Only the program itself knows what's semantically meaningful enough to signal for people's attention. A terminal program can have a simple definition but other programs don't necessarily; your mail client might decide that only certain sorts of new email should trigger a discreet 'pay attention to me' marker.

(Even in a terminal program you might want more control over this than xterm gives you. For example, you might want the terminal program to not trigger 'zicon' stuff for text output but instead to do it when the running program finishes and you return to the shell prompt. This is best done by being able to signal the terminal program through escape sequences.)

How I think systemd IP address restrictions on socket units works

By: cks

Among the systemd resource controls are IPAddressAllow= and IPAddressDeny=, which allow you to limit what IP addresses your systemd thing can interact with. This is implemented with eBPF. A limitation of these as applied to systemd .service units is that they restrict all traffic, both inbound connections and things your service initiates (like, say, DNS lookups), while you may want only a simple inbound connection filter. However, you can also set these on systemd.socket units. If you do, your IP address restrictions apply only to the socket (or sockets), not to the service unit that it starts. To quote the documentation:

Note that for socket-activated services, the IP access list configured on the socket unit applies to all sockets associated with it directly, but not to any sockets created by the ultimately activated services for it.

So if you have a systemd socket activated service, you can control who can access the socket without restricting who the service itself can talk to.

In general, systemd IP access controls are done through eBPF programs set up on cgroups. If you set up IP access controls on a socket, such as ssh.socket in Ubuntu 24.04, you do get such eBPF programs attached to the ssh.socket cgroup (and there is a ssh.socket cgroup, perhaps because of the eBPF programs):

# pwd
/sys/fs/cgroup/system.slice
# bpftool cgroup list ssh.socket
ID  AttachType      AttachFlags  Name
12  cgroup_inet_ingress   multi  sd_fw_ingress
11  cgroup_inet_egress    multi  sd_fw_egress

However, if you look there are no processes or threads in the ssh.socket cgroup, which is not really surprising but also means there is nothing there for these eBPF programs to apply to. And if you dump the eBPF program itself (with 'ebpftool dump xlated id 12'), it doesn't really look like it checks for the port number.

What I think must be going on is that the eBPF filtering program is connected to the SSH socket itself. Since I can't find any relevant looking uses in the systemd code of the `SO_ATTACH_*' BPF related options from socket(7) (which would be used with setsockopt(2) to directly attach programs to a socket), I assume that what happens is that if you create or perhaps start using a socket within a cgroup, that socket gets tied to the cgroup and its eBPF programs, and this attachment stays when the socket is passed to another program in a different cgroup.

(I don't know if there's any way to see what eBPF programs are attached to a socket or a file descriptor for a socket.)

If this is what's going on, it unfortunately means that there's no way to extend this feature of socket units to get per-port IP access control in .service units. Systemd isn't writing special eBPF filter programs for socket units that only apply to those exact ports, which you could in theory reuse for a service unit; instead, it's arranging to connect (only) specific sockets to its general, broad IP access control eBPF programs. Programs that make their own listening sockets won't be doing anything to get eBPF programs attached to them (and only them), so we're out of luck.

(One could experiment with relocating programs between cgroups, with the initial cgroup in which the program creates its listening sockets restricted and the other not, but I will leave that up to interested parties.)

Sometimes, non-general solutions are the right answer

By: cks

I have a Python program that calculates and prints various pieces of Linux memory information on a per-cgroup basis. In the beginning, its life was simple; cgroups had a total memory use that was split between 'user' and '(filesystem) cache', so the program only needed to display either one field or a primary field plus a secondary field. Then I discovered that there was additional important (ie, large) kernel memory use in cgroups and added the ability to report it as an additional option for the secondary field. However, this wasn't really ideal, because now I had a three-way split and I might want to see all three things at once.

A while back I wrote up my realization about flexible string formatting with named arguments. This sparked all sorts of thoughts about writing a general solution for my program that could show any number of fields. Recently I took a stab at implementing this and rapidly ran into problems figuring out how I wanted to do it. I had multiple things that could be calculated and presented, I had to print not just the values but also a header with the right field names, I'd need to think about how I structured argparse argument groups in light of argparse not supporting nested groups, and so on. At a minimum this wasn't going to be a quick change; I was looking at significantly rewriting how the program printed its output.

The other day, I had an obvious realization: while it would be nice to have a fully general solution that could print any number of additional fields, which would meet my needs now and in the future, all that I needed right now was an additional three-field version with the extra fields hard-coded and the whole thing selected through a new command line argument. And this command line argument could drop right into the existing argparse exclusive group for choosing the second field, even though this feels inelegant.

(The fields I want to show are added with '-c' and '-k' respectively in the two field display, so the morally correct way to select both at once would be '-ck', but currently they're exclusive options, which is enforced by argparse. So I added a third option, literally '-b' for 'both'.)

Actually implementing this hard-coded version was a bit annoying for structural reasons, but I put the whole thing together in not very long; certainly it was much faster than a careful redesign and rewrite (in an output pattern I haven't used before, no less). It's not necessarily the right answer for the long term, but it's definitely the right answer for now (and I'm glad I talked myself into doing it).

(I'm definitely tempted to go back and restructure the whole output reporting to be general. But now there's no rush to it; it's not blocking a feature I want, it's a cleanup.)

A taxonomy of text output (from tools that want to be too clever)

By: cks

One of my long standing gripes with Debian and Ubuntu is, well, I'll quote myself on the Fediverse:

I understand that Debian wants me to use 'apt' instead of apt-get, but the big reason I don't want to is because you can't turn off that progress bar at the bottom of your screen (or at least if you can it's not documented). That curses progress bar is something that I absolutely don't want (and it would make some of our tooling explode, yes we have tooling around apt-get).

Over time, I've developed opinions on what I want to see tools do for progress reports and other text output, and what I feel is increasingly too clever in tools that makes them more and more inconvenient for me. Today I'm going to try to run down that taxonomy, from best to worst.

  1. Line by line output in plain text with no colours.
  2. Represent progress by printing successive dots (or other characters) on the line until finally you print a newline. This is easy to capture and process later, since the end result is a newline terminated line with no control characters.

  3. Reporting progress by printing dots (or other characters) and then backspacing over them to erase them later. Pagers like less have some ability to handle backspaces, but this will give you heartburn in your own programs.

  4. Reporting progress by repeatedly printing a line, backspacing over it, and reprinting it (as apt-get does). This produces a lot more output, but I think less and anything that already deals with backspacing over things will generally be able to handle this. I believe apt-get does this.

  5. Any sort of line output with colours (which don't work in my environment, and when they do work they're usually unreadable). Any sort of terminal codes in the output make it complicated to capture the output with tools like script and then look over them later with pagers like less, although less can process a limited amount of terminal codes, including colours.

  6. Progress bar animation on one line with cursor controls and other special characters. This looks appealing but generates a lot more output and is increasingly hard for programs like less to display, search, or analyze and process. However, your terminal program of choice is probably still going to see this as line by line output and preserve various aspects of scrollback and so on.

  7. Progress output that moves the cursor and the output from its normal line to elsewhere on screen, such as at the bottom (as 'apt autoremove' and other bits of 'apt' do). Now you have a full screen program; viewing, reconstructing, and searching its output later is extremely difficult, and its output will blow up increasingly spectacularly if it's wrong about your window size (including if you resize things while it's running) or what terminal sequences your window responds to. Terminal programs and terminal environments such as tmux or screen may well throw up their hands at doing anything smart with the output, since you look much like a full screen editor, a pager, or programs like top. In some environments this may damage or destroy terminal scrollback.

    An additional reason I dislike this style is that it causes output to not appear at the current line. When I run your command line program, I want your program to print its output right below where I started it, in order, because that's what everything else does. I don't want the output jumping around the screen to random other locations. The only programs I accept that from are genuine full screen programs like top. Programs that insist on displaying things at random places on the screen are not really command line programs, they are TUIs cosplaying being CLIs.

  8. Actual full screen output, as a text UI, with the program clearing the screen and printing status reports all over the place. Fortunately I don't think I've seen any 'command line' programs do this; anything that does tends to be clearly labeled as a TUI program, and people mostly don't provide TUIs for command line tools (partly because it's usually more work).

My strong system administrator's opinion is that if you're tempted to do any of these other than the first, you should provide a command line switch to turn these off. Also, you should detect unusual settings of the $TERM environment variable, like 'dumb' or perhaps 'vt100', and automatically disable your smart output. And you should definitely disable your smart output if $TERM isn't set or you're not outputting to a (pseudo-)terminal.

(Programs that insist on fancy output no matter what make me very unhappy.)

Log messages are mostly for the people operating your software

By: cks

I recently read Evan Hahn's The two kinds of error (via), which talks very briefly in passing about logging, and it sparked a thought. I've previously written my system administrator's view of what an error log level should mean, but that entry leaves out something fundamental about log messages, which is that under most circumstances, log messages are for the people operating your software (I've sort of said this before in a different context). When you're about to add a non-debug log message, one of the questions you should ask is what does someone running your program get out of seeing the message.

Speaking from my own experience, it's very easy to write log messages (and other messages) that are aimed at you, the person developing the program, script, or what have you. They're useful for debugging and for keeping track of the state of the program, and it's natural to write them that way since you're immersed in the program and have all of the context (this is especially a problem for infrequent error messages, which I've learned to make as verbose as possible, and a similar thing applies for infrequently logged messages). But if your software is successful (especially if it gets distributed to other people), most of the people running it won't be the developers, they'll only be operating it.

(This can include a future version of you when you haven't touched this piece of software for months.)

If you want your log messages to be useful for anything other than being mailed to you as part of a 'can you diagnose this' message, they need to be useful for the people operating the software. This doesn't mean 'only report errors that they can fix and need to', although that's part of it. It also means making the information you provide through logs be things that are useful and meaningful to people operating your software, and that they can understand without a magic decoder ring.

If people operating your software won't get anything out of seeing a log message, you probably shouldn't log it by default in the first place (or you need to reword it so that people will get something from it). In Evan Hahn's terminology, this apply to the log messages for both expected errors and unexpected errors, although if the program aborts, it should definitely tell system administrators why it did.

For a system administrator, log messages about expected errors let us diagnose what went wrong to cause something to fail, and how interested we are in them depends partly on how common they are. However, how common they are isn't the only thing. MTAs often have what would be considered relatively verbose logs of message processing and will log every expected error like 'couldn't do a DNS lookup' or 'couldn't connect to a remote machine', even though they can happen a lot. This is very useful because one thing we sometimes care a lot about is what happened to and with a specific email message.

The things that make me so attached to xterm as my terminal program

By: cks

I've said before in various contexts (eg) that I'm very attached to the venerable xterm as my terminal (emulator) program, and I'm not looking forward to the day that I may have to migrate away from it due to Wayland (although I probably can keep running it under XWayland, now that I think about it). But I've never tried to write down a list of the things that make me so attached to it over other alternatives like urxvt, much less more standard ones like gnome-terminal. Today I'm going to try to do that, although my list is probably going to be incomplete.

  • Xterm's ziconbeep feature, which I use heavily. Urxvt can have an equivalent but I don't know if other terminal programs do.

  • I routinely use xterm's very convenient way of making large selections, which is supported in urxvt but not in gnome-terminal (and it can't be since gnome-terminal uses mouse button 3 for its own purposes).

  • The ability to turn off all terminal colours, because they often don't work in my preferred terminal colours. Other terminal programs have somewhat different and sometimes less annoying colours, but it's still far to easy for programs to display things in unreadable colours.

    Yes, I can set my shell environment and many programs to not use colours, but I can't set all of them; some modern programs simply always use colours on terminals. Xterm can be set to completely ignore them.

  • I'm very used to xterm's specific behavior when it comes to what is a 'word' for double-click selection. You can read the full details in the xterm manual page's section on character classes. I'm not sure if it's possible to fully emulate this behavior in other terminal programs; I once made an incomplete attempt in urxvt, while gnome-terminal is quite different and has little or no options for customizing that behavior (in the Gnome way). Generally the modern double click selection behavior is too broad for me.

    (For instance, I'm extremely attached to double-click selecting only individual directories in full paths, rather than the entire thing. I can always swipe to select an entire path, but if I can't pick out individual path elements with a double click my only choice is character by character selection, which is a giant pain.)

    Based on a quick experiment, I think I can make KDE's konsole behave more or less the way I want by clearing out its entire set of "Word characters" in profiles. I think this isn't quite how xterm behaves but it's probably close enough for my reflexes.

  • Xterm doesn't treat text specially because of its contents, for example by underlining URLs or worse, hijacking clicks on them to do things. I already have well evolved systems for dealing with things like URLs and I don't want my terminal emulator to provide any 'help'. I believe that KDE's konsole can turn this off, but gnome-terminal doesn't seem to have any option for it.

  • Many of xterm's behaviors can be controlled from command line switches. Some other terminal emulators (like gnome-terminal) force you to bundle these behaviors together as 'profiles' and only let you select a profile. Similarly, a lot of xterm's behavior can be temporarily changed on the fly through its context menus, without having to change the profile's settings (and then change them back).

  • Every xterm window is a completely separate program that starts from scratch, and xterm is happy to run on remote servers without complications; this isn't something I can say for all other competitors. Starting from scratch also means things like not deciding to place yourself where your last window was, which is konsole's behavior (and infuriates me).

Of these, the hardest two to duplicate are probably xterm's double click selection behavior of what is a word and xterm's large selection behavior. The latter is hard because it requires the terminal program to not use mouse button 3 for a popup menu.

I use some other xterm features, like key binding, including duplicating windows, but I could live without them, especially if the alternate terminal program directly supports modern cut and paste in addition to xterm's traditional style. And I'm accustomed to a few of xterm's special control characters, especially Ctrl-space, but I think this may be pretty universally supported by now (Ctrl-space is in gnome-terminal).

There are probably things that other terminal programs like konsole, gnome-terminal and so on do that I don't want them to (and that xterm doesn't). But since I don't use anything other than xterm (and a bit of gnome-terminal and once in a while a bit of urxvt), I don't know what those undesired features are. Experimenting with konsole for this entry taught me some things I definitely don't want, such as it automatically placing itself where it was before (including placing a new konsole window on top of one of the existing ones, if you have multiple ones).

(This elaborates on a comment I made elsewhere.)

Sometimes the simplest version of a text table is printed from a command

By: cks

Back when we had just started with our current metrics and dashboards adventure, I wrote about how sometimes the simplest version of a graph is a text table. Today I will extend that further: sometimes the simplest version of a text table is to have a command that prints it out, rather than making people look at a web page.

We recently had a major power outage at work, and in the aftermath not all of our machines came back. One of my co-workers is an extreme early bird and he came in to the university about as early as it's possible to on the TTC, and started work on troubleshooting what was going on. One of the things he needed to know was what machines were still down, so he could figure out any common elements to them (and see what machines were stubbornly not coming back on even though they ought to be).

We have Grafana dashboards for this, and the information about what machines are down is present in some of them in tabular form. But it's a table embedded in a widget in a web page, and you need a browser to look at it, which you may not have from the server console of some server you just powered up. Since I like command line tools, at one point I wrote some little scripts that make queries to our Prometheus server with curl and run the result through 'jq' to extract things. One of them is called 'promdownhosts' and it prints out what you'd expect. Initially this was just something I used, but several years ago I mentioned my collection of these scripts to my co-workers and we wound up making them group scripts in a central location.

(I initially wrote this script and a few others for use during our planned power outages and other downtimes, because it was a convenient way of seeing what we hadn't yet turned on or might have missed.)

Early in the morning of that Tuesday, bringing machines back up after the power outage and finding dead PDUs, my co-worker used the 'promdownhosts' script extensively to troubleshoot things. One of the nice aspects of it being a script was that he could put the names of uninteresting machines in a file and then exclude them easily with things like 'promdownhosts | fgrep -v -f /tmp/ignore-these' (something that's much harder to do in a web page dashboard interface, especially if the designer hasn't thought of that). And in general, the script made (and makes) this information quite readily accessible in a compact format that was quick to skim and definitely free of distractions.

Not everything can be presented this way, in a list or a table printed out in plain text from a command line tool. Sometimes tables on a web page are the better option, and it's good to have options in general; sometimes we want to look at this information along with other information too. As I've found out the hard way sometimes, there's only so much information you can cram into a plain text table before the result is increasingly hard to read.

(I have a command that summarizes our current Prometheus alerts and its output is significantly harder to read because I need it to be compact and there's more information to present. It's probably only really suitable for my use because I understand all of its shorthand notations, including the internal Prometheus names for our alerts.)

On the Bourne shell's distinction between shell variables and exported ones

By: cks

One of the famous things that people run into with the Bourne shell is that it draws a distinction between plain shell variables and special exported shell variables, which are put into the environment of processes started by the shell. This distinction is a source of frustration when you set a variable, run a program, and the program doesn't have the variable available to it:

$ GODEBUG=...
$ go-program
[doesn't see your $GODEBUG setting]

It's also a source of mysterious failures, because more or less all of the environment variables that are present automatically become exported shell variables. So whether or not 'GODEBUG=..; echo running program; go-program' works can depend on whether $GODEBUG was already set when your shell started. The environment variables of regular shell sessions are usually fairly predictable, but the environment variables present when shell scripts get run can be much more varied. This makes it easy to write a shell script that only works right for you, because in your environment it runs with certain environment variables set and so they automatically become exported shell variables.

I've told you all of that because despite these pains, I believe that the Bourne shell made the right choice here, in addition to a pragmatically necessary choice at the time it was created, in V7 (Research) Unix. So let's start with the pragmatics.

The Bourne shell was created along side environment variables themselves, and on the comparatively small machines that V7 ran on, you didn't have much room for the combination of program arguments and the new environment. If either grew too big, you got 'argument list too long' when you tried to run programs. This made it important to minimize and control the size of the environment that the shell gave to new processes. If you want to do that without limiting the use of shell variables so much, a split between plain shell variables and exported ones makes sense and requires only a minor bit of syntax (in the form of 'export').

Both machines and exec() size limits are much larger now, so you might think that getting rid of the distinction is a good thing. The Bell Labs Research Unix people thought so, so they did do this in Tom Duff's rc shell for V10 Unix and Plan 9. Having used both the Bourne shell and a version of rc for many years, I both agree and disagree with them.

For interactive use, having no distinction between shell variables and exported shell variables is generally great. If I set $GODEBUG, $PYTHONPATH, or any number of any other environment variables that I want to affect programs I run, I don't have to remember to do a special 'export' dance; it just works. This is a sufficiently nice (and obvious) thing that it's an option for the POSIX 'sh', in the form of 'set -a' (and this set option is present in more or less all modern Bourne shells, including Bash).

('Set -a' wasn't in the V7 sh, but I haven't looked to see where it came from. I suspect that it may have come from ksh, since POSIX took a lot of the specification for their 'sh' from ksh.)

For shell scripting, however, not having a distinction is messy and sometimes painful. If I write an rc script, every shell variable that I use to keep track of something will leak into the environment of programs that I run. The shell variables for intermediate results, the shell variables for command line options, the shell variables used for for loops, you name it, it all winds up in the environment unless I go well out of my way to painfully scrub them all out. For shell scripts, it's quite useful to have the Bourne shell's strong distinction between ordinary shell variables, which are local to your script, and exported shell variables, which you deliberately act to make available to programs.

(This comes up for shell scripts and not for interactive use because you commonly use a lot more shell variables in shell scripts than you do in interactive sessions.)

For a new Unix shell today that's made primarily or almost entirely for interactive use, automatically exporting shell variables into the environment is probably the right choice. If you wanted to be slightly more selective, you could make it so that shell variables with upper case names are automatically exported and everything else can be manually exported. But for a shell that's aimed at scripting, you want to be able to control and limit variable scope, only exporting things that you explicitly want to.

How to redirect a Bash process substitution into a while loop

By: cks

In some sorts of shell scripts, you often find yourself wanting to work through a bunch of input in the shell; some examples of this for me are here and here. One of the tools for this is a 'while read -r ...' loop, using the shell's builtin read to pull in one or more fields of data (hopefully not making a mistake). Suppose, not hypothetically, that you have a situation where you want to use such a 'while read' loop to accumulate some information from the input, setting shell variables, and then using them later. The innocent and non-working way to write this is:

accum=""
sep=""
some-program |
while read -r avalue; do
   accum="$accum$sep$avalue"
   sep=" or "
done

# Now we want to use $accum

(The recent script where I ran into this issue does much more complex things in the while loop that can't easily be done in other ways.)

This doesn't work because the 'while' is actually happening in a subshell, so the shell variables it sets are lost at the end. To make this work we have to wrap everything from the 'while ...' onward up into a subshell, with that part looking like:

some-program |
(
while read -r avalue; do
   accum="$accum$sep$avalue"
   sep=" or "
done
[...]
)

(You can't get around this with '{ while ...; ... done; }', Bash will still put the 'while' in a subshell.)

The way around this starts with how you can use a file redirection with a while loop (it goes on the 'done'):

some-program >/some/file
while read -r avalue; do
  [...]
done </some/file
# $accum is still set

So far this is all generic Bourne shell things. Bash has a special feature of process substitution, which allows us to use a process instead of a file, using the otherwise illegal syntax '<(...)'. This is great and exactly what we want to avoid creating a temporary file and then have to clean it up. So the innocent and obvious way to try to write things is this:

while read -r avalue; do
  [...]
done <(some-program)

If you try this, you will get the sad error message from Bash of:

line N: syntax error near unexpected token `<(some-program)'
line N: 'done <(some-program)'

This is not a helpful error message. I will start by telling you the cure, and then what is going on at a narrow technical level to produce this error message. The cure is:

while read -r avalue; do
  [...]
done < <(some-program)

Note that you must have a space between the two <'s, writing this as '<<(some-program)' will get you a similar syntax error.

The technical reason for this error is that although it looks like redirection, process substitution is a form of substitution, like '$var' (it's in the name, but you, like me, may not know what Bash calls it off the top of your head). The result of process substitution will be, for example, a /dev/fd/N name (and a subprocess that is running our 'some-program' and feeding into the other end of the file descriptor). We can see this directly:

$ echo <(cat /dev/null)
/dev/fd/63

(Your number may vary.)

You can't write 'while ...; done /dev/fd/63'. That's a syntax error. Even though the pre-substitution version looks like redirection, it's not, so it's not accepted.

That '<(...)' is actually a substitution is why our revised version works. Reading '< <(some-program)' right to left, the '<(some-program)' is process substitution, and it (along with other shell expansions) are done first, before redirections. After substitution this looks like '< /dev/fd/NN', which is acceptable syntax. If we leave out the space and write this as '<<(some-program)', the shell throws up its hands at the '<<' bit.

(So from Bash's perspective, this is very similar to 'file=/some/file; while ... ; done < $file', which is perfectly legal.)

PS: Before I wrote this entry, I didn't know how to get around the 'done <(some-program)' syntax error. Until the penny dropped about the difference between redirections and process substitution, I thought that Bash simply forbade this to make its life easier.

With disk caches, you want to be able to attribute hits and misses

By: cks

Suppose that you have a disk or filesystem cache in memory (which you do, since pretty much everything has one these days). Most disk caches will give you simple hit and miss information as part of their basic information, but if you're interested in the performance of your disk cache (or in improving it), you want more information. The problem with disk caches is that there are a lot of different sources and types of disk IO, and you can have hit rates that are drastically different between them. Your hit rate for reading data from files may be modest, while your hit rate on certain sorts of metadata may be extremely high. Knowing this is important because it means that your current good performance on things involving that metadata is critically dependent on that hit rate.

(Well, it may be, depending on what storage media you're using and what its access speeds are like. A lot of my exposure to this dates from the days of slow HDDs.)

This potential vast difference is why you want more detailed information in both cache metrics and IO traces. The more narrowly you can attribute IO and the more you know about it, the more useful things you can potentially tell about the performance of your system and what matters to it. This is not merely 'data' versus 'metadata', and synchronous versus asynchronous; ideally you want to know the sort of metadata read being done, and whether the file data being read is synchronous or not, and whether this is a prefetching read or a 'demand' read that really needs the data.

A lot of the times, operating systems are not set up to pass this information down through all of the layers of IO from the high level filesystem code that knows what it's asking for to the disk driver code that's actually issuing the IOs. Part of the reason for this is that it's a lot of work to pass all of this data along, which means extra CPU and memory on what is an increasingly hot path (especially with modern NVMe based storage). These days you may get some of this fine grained details in metrics and perhaps IO traces (eg, for (Open)ZFS), but probably not all the way to types of metadata.

Of course, disk and filesystem caches (and IO) aren't the only place that this can come up. Any time you have a cache that stores different types of things that are potentially queried quite differently, you can have significant divergence in the types of activity and the activity rates (and cache hit rates) that you're experiencing. Depending on the cache, you may be able to get detailed information from it or you may need to put more detailed instrumentation into the code that queries your somewhat generic cache.

Modern general observability features in operating systems can sometimes let you gather some of this detailed attribute yourself (if the OS doesn't already provide them). However, it's not a certain thing and there are limits; for example, you may have trouble tracing and tracking IO once it gets dispatched asynchronously inside the OS (and most OSes turn IO into asynchronous operations before too long).

Systemd resource controls on user.slice and system.slice work fine

By: cks

We have a number of systems where we traditionally set strict overcommit handling, and for some time this has caused us some heartburn. Some years ago I speculated that we might want to use resource controls on user.slice or systemd.slice if they worked, and then recently in a comment here I speculated that this was the way to (relatively) safely limit memory use if it worked.

Well, it does (as far as I can tell, without deep testing). If you want to limit how much of the system's memory people who log in can use so that system services don't explode, you can set MemoryMin= on system.slice to guarantee some amount of memory to it and all things under it. Alternately, you can set MemoryMax= on user.slice, collectively limiting all user sessions to that amount of memory. In either case my view is that you might want to set MemorySwapMax= on user.slice so that user sessions don't spend all of their time swapping. Which one you set things on depends on which is easier and you trust more; my inclination is MemoryMax, although that means you need to dynamically size it depending on this machine's total memory.

(If you want to limit user memory use you'll need to make sure that things like user cron jobs are forced into user sessions, rather than running under cron.service in system.slice.)

Of course this is what you should expect, given systemd's documentation and the kernel documentation. On the other hand, the Linux kernel cgroup and memory system is sufficiently opaque and ever changing that I feel the need to verify that things actually do work (in our environment) as I expect them to. Sometimes there are surprises, or settings that nominally work but don't really affect things the way I expect.

This does raise the question of how much memory you want to reserve for the system. It would be nice if you could use systemd-cgtop to see how much memory your system.slice is currently using, but unfortunately the number it will show is potentially misleadingly high. This is because the memory attributed to any cgroup includes (much) more than program RAM usage. For example, on our it seems typical for system.slice to be using under a gigabyte of 'user' RAM but also several gigabytes of filesystem cache and other kernel memory. You probably want to allow for some of that in what memory you reserve for system.slice, but maybe not all of the current usage.

(You can get the current version of the 'memdu' program I use as memdu.py.)

Gnome, GSettings, gconf, and which one you want

By: cks

On the Fediverse a while back, I said:

Ah yes, GNOME, it is of course my mistake that I used gconf-editor instead of dconf-editor. But at least now Gnome-Terminal no longer intercepts F11, so I can possibly use g-t to enter F11 into serial consoles to get the attention of a BIOS. If everything works in UEFI land.

Gnome has had at least two settings systems, GSettings/dconf (also) and the older GConf. If you're using a modern Gnome program, especially a standard Gnome program like gnome-terminal, it will use GSettings and you will want to use dconf-editor to modify its settings outside of whatever Preferences dialogs it gives you (or doesn't give you). You can also use the gsettings or dconf programs from the command line.

(This can include Gnome-derived desktop environments like Cinnamon, which has updated to using GSettings.)

If the program you're using hasn't been updated to the latest things that Gnome is doing, for example Thunderbird (at least as of 2024), then it will still be using GConf. You need to edit its settings using gconf-editor or gconftool-2, or possibly you'll need to look at the GConf version of general Gnome settings. I don't know if there's anything in Gnome that synchronizes general Gnome GSettings settings into GConf settings for programs that haven't yet been updated.

(This is relevant for programs, like Thunderbird, that use general Gnome settings for things like 'how to open a particular sort of thing'. Although I think modern Gnome may not have very many settings for this because it always goes to the GTK GIO system, based on the Arch Wiki's page on Default Applications.)

Because I've made this mistake between gconf-editor and dconf-editor more than once, I've now created a personal gconf-editor cover script that prints an explanation of the situation when I run it without a special --really argument. Hopefully this will keep me sorted out the next time I run gconf-editor instead of dconf-editor.

PS: Probably I want to use gsettings instead of dconf-editor and dconf as much as possible, since gsettings works through the GSettings layer and so apparently has more safety checks than dconf-editor and dconf do.

PPS: Don't ask me what the equivalents are for KDE. KDE settings are currently opaque to me.

❌