❌

Reading view

There are new articles available, click to refresh the page.

Syndication feed fetchers, HTTP redirects, and conditional GET

By: cks

In response to my entry on how ETag values are specific to a URL, a Wandering Thoughts reader asked me in email what a syndication feed reader (fetcher) should do when it encounters a temporary HTTP redirect, in the context of conditional GET. I think this is a good question, especially if we approach it pragmatically.

The specification compliant answer is that every final (non-redirected) URL must have its ETag and Last-Modified values tracked separately. If you make a conditional GET for URL A because you know its ETag or Last-Modified (or both) and you get a temporary HTTP redirection to another URL B that you don't have an ETag or Last-Modified for, you can't make a conditional GET. This means you have to insure that If-None-Match and especially If-Modified-Since aren't copied from the original HTTP request to the newly re-issued redirect target request. And when you make another request for URL A later, you can't send a conditional GET using ETag or Last-Modified values you got from successfully fetching URL B; you either have to use the last values observed for URL A or make an unconditional GET. In other words, saved ETag and Last-Modified values should be per-URL properties, not per-feed properties.

(Unfortunately this may not fit well with feed reader code structures, data storage, or uses of low-level HTTP request libraries that hide things like HTTP redirects from you.)

Pragmatically, you can probably get away with re-doing the conditional GET when you get a temporary HTTP redirect for a feed, with the feed's original saved ETag and Last-Modified information. There are three likely cases for a temporary HTTP redirection of a syndication feed that I can think of:

  • You're receiving a generic HTTP redirection to some sort of error page that isn't a valid syndication feed. Your syndication feed fetcher isn't going to do anything with a successful fetch of it (except maybe add an 'error' marker to the feed), so a conditional GET that fools you with "nothing changed" is harmless.

  • You're being redirected to an alternate source of the normal feed, for example a feed that's normally dynamically generated might serve a (temporary) HTTP redirect to a static copy under high load. If the conditional GET matches the ETag (probably unlikely in practice) or the Last-Modified (more possible), then you almost certainly have the most current version and are fine, and you've saved the web server some load.

  • You're being (temporarily) redirected to some kind of error feed; a valid syndication feed that contains one or more entries that are there to tell the person seeing them about a problem. Here, the worst thing that happens if your conditional GET fools you with "nothing has changed" is that the person reading the feed doesn't see the error entry (or entries).

The third case is a special variant of an unlikely general case where the normal URL and the redirected URL are both versions of the feed but each has entries that the other doesn't. In this general case, a conditional GET that fools you with a '304 Not Modified' will cause you to miss some entries. However, this should cure itself when the temporary HTTP redirect stops happening (or when a new entry is published to the temporary location, which should change its ETag and reset its Last-Modified date to more or less now).

A feed reader that keeps a per-feed 'Last-Modified' value and updates it after following a temporary HTTP redirect is living dangerously. You may not have the latest version of the non-redirected feed but the target of the HTTP redirection may be 'more recent' than it for various reasons (even if it's a valid feed; if it's not a valid feed then blindly saving its ETag and Last-Modified is probably quite dangerous). When the temporary HTTP redirection goes away and the normal feed's URL resumes responding with the feed again, using the target's "Last-Modified" value for a conditional GET of the original URL could cause you to receive "304 Not Modified" until the feed is updated again (and its Last-Modified moves to be after your saved value), whenever that happens. Some feeds update frequently; others may only update days or weeks later.

Given this and the potential difficulties of even noticing HTTP redirects (if they're handled by some underlying library or tool), my view is that if a feed provides both an ETag and a Last-Modified, you should save and use only the ETag unless you're sure you're going to handle HTTP redirects correctly. An ETag could still get you into trouble if used across different URLs, but it's much less likely (see the discussion at the end of my entry about Last-Modified being specific to the URL).

(All of this is my view as someone providing syndication feeds, not someone writing syndication feed fetchers. There may be practical issues I'm unaware of, since the world of feeds is very large and it probably contains a lot of weird feed behavior (to go with the weird feed fetcher behavior).)

The HTTP Last-Modified value is specific to the URL (technically so is the ETag value)

By: cks

Last time around I wrote about how If-None-Match values (which come from ETag values) must come from the actual URL itself, not (for example) from another URL that you were at one point redirected to. In practice, this is only an issue of moderate concern for ETag/If-None-Match; you can usually make a conditional GET using an ETag from another URL and get away with it. This is very much an issue if you make the mistake of doing the same thing with an If-Modified-Since header based on another URL's Last-Modified header. This is because the Last-Modified header value isn't unique to a particular document, in a way that ETag values can often be.

If you take the Last-Modified timestamp from URL A and perform a conditional GET for URL B with an 'If-Modified-Since' of that timestamp, the web server may well give you exactly what you asked for but not what you wanted by saying 'this hasn't been modified since then' even though the contents of those URLs are entirely different. You told the web server to decide purely on the basis of timestamps without reference to anything that might even vaguely specify the content, and so it did. This can happen even if the server is requiring an exact timestamp match (as it probably should), because there are any number of ways for the 'Last-Modified' timestamp of a whole bunch of URLs to be exactly the same because some important common element of them was last updated at that point.

(This is how DWiki works. The Last-Modified date of a page is the most recent timestamp of all of the elements that went into creating it, so if I change some shared element, everything will promptly take on the Last-Modified of that element.)

This means that if you're going to use Last-Modified in conditional GETs, you must handle HTTP redirects specially. It's actively dangerous (to actually getting updates) to mingle Last-Modified dates from the original URL and the redirection URL; you either have to not use Last-Modified at all, or track the Last-Modified values separately. For things that update regularly, any 'missing the current version' problems will cure themselves eventually, but for infrequently updated things you could go quite a while thinking that you have the current content when you don't.

In theory this is also true of ETag values; the specification allows them to be calculated in ways that are URL-specific (the specification mentions that the ETag might be a 'revision number'). A plausible implementation of serving a collection of pages from a Git repository could use the repository's Git revision as the common ETag for all pages; after all, the URL (the page) plus that git revision uniquely identifies it, and it's very cheap to provide under the right circumstances (eg, you can record the checked out git revision).

In practice, common ways of generating ETags will make them different across different URLs, potentially unless the contents are the same. DWiki generates ETag values using a cryptographic hash, so two different URLs will only have the same ETag if they have the same contents, which I believe is a common approach for pages that are generated dynamically. Apache generates ETag values for static files using various file attributes that will be different for different files, which is probably also a common approach for things that serve static files. Pragmatically you're probably much safer sending an ETag value from one URL in an If-None-Match header to another URL (for example, through repeating it while following a HTTP redirection). It's still technically wrong, though, and it may cause problems someday.

(This feels obvious but it was only today that I realized how it interacts with conditional GETs and HTTP redirects.)

Go's builtin 'new()' function will take an expression in Go 1.26

By: cks

An interesting little change recently landed in the development version of Go, and so will likely appear in Go 1.26 when it's released. The change is that the builtin new() function will be able to take an expression, not just a type. This change stems from the proposal in issue 45624, which dates back to 2021 (and earlier for earlier proposals). The new specifications language is covered in, for example, this comment on the issue. An example is in the current development documentation for the release notes, but it may not sound very compelling.

A variety of uses came up in the issue discussion, some of which were a surprise to me. One case that's apparently surprisingly common is to start with a pointer and want to make another pointer to a (shallow) copy of its value. With the change to 'new()', this is:

np = new(*p)

Today you can write this as a generic function (apparently often called 'ref()'), or do it with a temporary variable, but in Go 1.26 this will (probably) be a built in feature, and perhaps the Go compiler will be able to optimize it in various ways. This sort of thing is apparently more common than you might expect.

Another obvious use for the new capability is if you're computing a new value and then creating a pointer to it. Right now, this has to be written using a temporary variable:

t := <some expression>
p := &t

With 'new(expr)' this can be written as one line, without a temporary variable (although as before a 'ref()' generic function can do this today).

The usage example from the current documentation is a little bit peculiar, at least as far as providing a motivation for this change. In a slightly modified form, the example is:

type Person struct {
    Name string `json:"name"`
    Age  *int   `json:"age"` // age if known; nil otherwise
}

func newPerson(name string, age int) *Person {
    return &Person{
        Name: name,
	   Age:  new(age),
     }
}

The reason this is a bit peculiar is that today you can write 'Age: &age' and it works the same way. Well, at a semantic level it works the same way. The theoretical but perhaps not practical complication is inlining combined with escape analysis. If newPerson() is inlined into a caller, then the caller's variable for the 'age' parameter may be unused after the (inlined) call to newPerson, and so could get mapped to 'Age: &callervar', which in turn could force escape analysis to put that variable in the heap, which might be less efficient than keeping the variable in the stack (or registers) until right at the end.

A broad language reason is that allowing new() to take an expression removes the special privilege that structs and certain other compound data structures have had, where you could construct pointers to initialized versions of them. Consider:

type ints struct { i int }
[...]
t := 10
ip := &t
isp := &ints{i: 10}

You can create a pointer to the int wrapped in a struct on a single line with no temporary variable, but a pointer to a plain int requires you to materialize a temporary variable. This is a bit annoying.

A pragmatic part of adding this is that people appear to write and use equivalents of new(value) a fair bit. The popularity of an expression is not necessarily the best reason to add a built-in equivalent to the language, but it does suggest that this feature will get used (or will eventually get used, since the existing uses won't exactly get converted instantly for all sorts of reasons).

This strikes me as a perfectly fine change for Go to make. The one thing that's a little bit non-ideal is that 'new()' of constant numbers has less type flexibility than the constant numbers themselves. Consider:

var ui uint
var uip *uint

ui = 10       // okay
uip = new(10) // type mismatch error

The current error that the compiler reports is 'cannot use new(10) (value of type *int) as *uint value in assignment', which is at least relatively straightforward.

(You fix it by casting ('converting') the untyped constant number to whatever you need. The now more relevant than before 'default type' of a constant is covered in the specification section on Constants.)

The broad state of ZFS on Illumos, Linux, and FreeBSD (as I understand it)

By: cks

Once upon a time, Sun developed ZFS and put it in Solaris, which was good for us. Then Sun open-sourced Solaris as 'OpenSolaris', including ZFS, although not under the GPL (a move that made people sad and Scott McNealy is on record as regretting). ZFS development continued in Solaris and thus in OpenSolaris until Oracle bought Sun and soon afterward closed Solaris source again (in 2010); while Oracle continued ZFS development in Oracle Solaris, we can ignore that. OpenSolaris was transmogrified into Illumos, and various Illumos distributions formed, such as OmniOS (which we used for our second generation of ZFS fileservers).

Well before Oracle closed Solaris, separate groups of people ported ZFS into FreeBSD and onto Linux, where the effort was known as "ZFS on Linux". Since the Linux kernel community felt that ZFS's license wasn't compatible with the kernel's license, ZoL was an entirely out of (kernel) tree effort, while FreeBSD was able to accept ZFS into their kernel tree (I believe all the way back in 2008). Both ZFS on Linux and FreeBSD took changes from OpenSolaris into their versions up until Oracle closed Solaris in 2010. After that, open source ZFS development split into three mostly separate strands.

(In theory OpenZFS was created in 2013. In practice I think OpenZFS at the time was not doing much beyond coordination of the three strands.)

Over time, a lot more people wanted to build machines using ZFS on top of FreeBSD or Linux (including us) than wanted to keep using Illumos distributions. Not only was Illumos a different environment, but Illumos and its distributions didn't see the level of developer activity that FreeBSD and Linux did, which resulted in driver support issues and other problems (cf). For ZFS, the consequence of this was that many more improvements to ZFS itself started happening in ZFS on Linux and in FreeBSD (I believe to a lesser extent) than were happening in Illumos or OpenZFS, the nominal upstream. Over time the split of effort between Linux and FreeBSD became an obvious problem and eventually people from both sides got together. This resulted in ZFS on Linux v2.0.0 becoming 'OpenZFS 2.0.0' in 2020 (see also the Wikipedia history) and also becoming portable to FreeBSD, where it became the FreeBSD kernel ZFS implementation in FreeBSD 13.0 (cf).

The current state of OpenZFS is that it's co-developed for both Linux and FreeBSD. The OpenZFS ZFS repository routinely has FreeBSD specific commits, and as far as I know OpenZFS's test suite is routinely run on a variety of FreeBSD machines as well as a variety of Linux ones. I'm not sure how OpenZFS work propagates into FreeBSD itself, but it does (some spelunking of the FreeBSD source repository suggests that there are periodic imports of the latest changes). On Linux, OpenZFS releases and development versions propagate to Linux distributions in various ways (some of them rather baroque), including people simply building their own packages from the OpenZFS repository.

Illumos continues to use and maintain its own version of ZFS, which it considers separate from OpenZFS. There is an incomplete Illumos project discussion on 'consuming' OpenZFS changes (via, also), but my impression is that very few changes move from OpenZFS to Illumos. My further impression is that there is basically no one on the OpenZFS side who is trying to push changes into Illumos; instead, OpenZFS people consider it up to Illumos to pull changes, and Illumos people aren't doing much of that for various reasons. At this point, if there's an attractive ZFS change in OpenZFS, the odds of it appearing in Illumos on a timely basis appear low (to put it one way).

(Some features have made it into Illumos, such as sequential scrubs and resilvers, which landed in issue 10405. This feature originated in what was then ZoL and was ported into Illumos.)

Even if Illumos increases the pace of importing features from OpenZFS, I don't ever expect it to be on the leading edge and I think that's fine. There have definitely been various OpenZFS features that needed some time before they became fully ready for stable production use (even after they appeared in releases). I think there's an ecological niche for a conservative ZFS that only takes solidly stable features, and that fits Illumos's general focus on stability.

PS: I'm out of touch with the Illumos world these days, so I may have mis-characterized the state of affairs there. If so, I welcome corrections and updates in the comments.

If-None-Match values must come from the actual URL itself

By: cks

Because I recently looked at the web server logs for Wandering Thoughts, I said something on the Fediverse:

It's impressive how many ways feed readers screw up ETag values. Make up their own? Insert ETags obtained from the target of a HTTP redirect of another request? Stick suffixes on the end? Add their own quoting? I've seen them all.

(And these are just the ones that I can readily detect from the ETag format being wrong for the ETags my techblog generates.)

(Technically these are If-None-Match values, not ETag values; it's just that the I-N-M value is supposed to come from an ETag you returned.)

One of these mistakes deserves special note, and that's the HTTP redirect case. Suppose you request a URL, receive a HTTP 302 temporary redirect, follow the redirect, and get a response at the new URL with an ETag value. As a practical matter, you cannot then present that ETag value in an If-None-Match header when you re-request the original URL, although you could if you re-requested the URL that you were redirected to. The two URLs are not the same and they don't necessarily have the same ETag values or even the same format of ETags.

(This is an especially bad mistake for a feed fetcher to make here, because if you got a HTTP redirect that gives you a different format of ETag, it's because you've been redirected to a static HTML page served directly by Apache (cf) and it's obviously not a valid syndication feed. You shouldn't be saving the ETag value for responses that aren't valid syndication feeds, because you don't want to get them again.)

This means that feed readers can't just store 'an ETag value' for a feed. They need to associate the ETag value with a specific, final URL, which may not be the URL of the feed (because said feed URL may have been redirected). They also need to (only) make conditional requests when they have an ETag for that specific URL, and not copy the If-None-Match header from the initial GET into a redirected GET.

This probably clashes with many low level HTTP client APIs, which I suspect want to hide HTTP redirects from the caller. For feed readers, such high level APIs are a mistake. They actively need to know about HTTP redirects so that, for example, they can consider updating their feed URL if they get permanent HTTP redirects to a new URL. And also, of course, to properly handle conditional GETs.

A hack: outsourcing web browser/client checking to another web server

By: cks

A while back on the Fediverse, I shared a semi-cursed clever idea:

Today I realized that given the world's simplest OIDC IdP (one user, no password, no prompting, the IdP just 'logs you in' if your browser hits the login URL), you could put @cadey's Anubis in front of anything you can protect with OIDC authentication, including anything at all on an Apache server (via mod_auth_openidc). No need to put Anubis 'in front' of anything (convenient for eg static files or CGIs), and Anubis doesn't even have to be on the same website or machine.

This can be generalized, of course. There are any number of filtering proxies and filtering proxy services out there that will do various things for you, either for free or on commercial terms; one example of a service is geoblocking that's maintained by someone else who's paid to be on top of it and be accurate. Especially with services, you may not want to put them in front of your main website (that gives the service a lot of power), but you would be fine with putting a single-purpose website behind the service or the proxy, if your main website can use the result. With the world's simplest OIDC IdP, you can do that, at least for anything that will do OIDC.

(To be explicit, yes, I'm partly talking about Cloudflare.)

This also generalizes in the other direction, in that you don't necessarily need to use OIDC. You just need some system for passing authenticated information back and forth between your main website and your filtered, checked, proxied verification website. Since you don't need to carry user identity information around this can be pretty simple (although it's going to involve some cryptography, so I recommend just using OIDC or some well-proven option if you can). I've thought about this a bit and I'm pretty certain you can make a quite simple implementation.

(You can also use SAML if you happen to have an extremely simple SAML server and appropriate SAML clients, but really, why. OIDC is today's all-purpose authentication hammer.)

A custom system can pass arbitrary information back and forth between the main website and the verifier, so you can know (for example) if the two saw the same client details. I think you can do this to some extent with OIDC as well if you have a custom IdP, because nothing stops your IdP and your OIDC client from agreeing on some very custom OIDC claims, such as (say) 'clientip'.

(I don't know of any such minimal OIDC server, although I wouldn't be surprised if one exists, probably as a demonstration or test server. And I suppose you can always put a banner on your OIDC IdP's login page that tells people what login and password to use, if you can only find a simple IdP that requires an actual login.)

Unix mail programs have had two approaches to handling your mail

By: cks

Historically, Unix mail programs (what we call 'mail clients' or 'mail user agents' today) have had two different approaches to handling your email, what I'll call the shared approach and the exclusive approach, with the shared approach being the dominant one. To explain the shared approach, I have to back up to talk about what Unix mail transfer agents (MTAs) traditionally did. When a Unix MTA delivered email to you, at first it delivered email into a single file in a specific location (such as '/usr/spool/mail/<login>') in a specific format, initially mbox; even then, this could be called your 'inbox'. Later, when the maildir mailbox format became popular, some MTAs gained the ability to deliver to maildir format inboxes.

(There have been a number of Unix mail spool formats over the years, which I'm not going to try to get into here.)

A 'shared' style mail program worked directly with your inbox in whatever format it was in and whatever location it was in. This is how the V7 'mail' program worked, for example. Naturally these programs didn't have to work on your inbox; you could generally point them at another mailbox in the same format. I call this style 'shared' because you could use any number of different mail programs (mail clients) on your mailboxes, providing that they all understood the format and also provided that all of them agreed on how to lock your mailbox against modifications, including against your system's MTA delivering new email right at the point where your mail program was, for example, trying to delete some.

(Locking issues are one of the things that maildir was designed to help with.)

An 'exclusive' style mail program (or system) was designed to own your email itself, rather than try to share your system mailbox. Of course it had to access your system mailbox a bit to get at your email, but broadly the only thing an exclusive mail program did with your inbox was pull all your new email out of it, write it into the program's own storage format and system, and then usually empty out your system inbox. I call this style 'exclusive' because you generally couldn't hop back and forth between mail programs (mail clients) and would be mostly stuck with your pick, since your main mail program was probably the only one that could really work with its particular storage format.

(Pragmatically, only locking your system mailbox for a short period of time and only doing simple things with it tended to make things relatively reliable. Shared style mail programs had much more room for mistakes and explosions, since they had to do more complex operations, at least on mbox format mailboxes. Being easy to modify is another advantage of the maildir format, since it outsources a lot of the work to your Unix filesystem.)

This shared versus exclusive design choice turned out to have some effects when mail moved to being on separate servers and accessed via POP and then later IMAP. My impression is that 'exclusive' systems coped fairly well with POP, because the natural operation with POP is to pull all of your new email out of the server and store it locally. By contrast, shared systems coped much better with IMAP than exclusive ones did, because IMAP is inherently a shared mail environment where your mail stays on the IMAP server and you manipulate it there.

(Since IMAP is the dominant way that mail clients/user agents get at email today, my impression is that the 'exclusive' approach is basically dead at this point as a general way of doing mail clients. Almost no one wants to use an IMAP client that immediately moves all of their email into a purely local data storage of some sort; they want their email to stay on the IMAP server and be accessible from and by multiple clients and even devices.)

Most classical Unix mail clients are 'shared' style programs, things like Alpine, Mutt, and the basic Mail program. One major 'exclusive' style program, really a system, is (N)MH (also). MH is somewhat notable because in its time it was popular enough that a number of other mail programs and mail systems supported its basic storage format to some degree (for example, procmail can deliver messages to MH-format directories, although it doesn't update all of the things that MH would do in the process).

Another major source of 'exclusive' style mail handling systems is GNU Emacs. I believe that both rmail and GNUS normally pull your email from your system inbox into their own storage formats, partly so that they can take exclusive ownership and don't have to worry about locking issues with other mail clients. GNU Emacs has a number of mail reading environments (cf, also) and I'm not sure what the others do (apart from MH-E, which is a frontend on (N)MH).

(There have probably been other 'exclusive' style systems. Also, it's a pity that as far as I know, MH never grew any support for keeping its messages in maildir format directories, which are relatively close to MH's native format.)

Maybe I should add new access control rules at the front of rule lists

By: cks

Not infrequently I wind up maintaining slowly growing lists of filtering rules to either allow good things or weed out bad things. Not infrequently, traffic can potentially match more than one filtering rule, either because it has multiple bad (or good) characteristics or because some of the match rules overlap. My usual habit has been to add new rules to the end of my rule lists (or the relevant section of them), so the oldest rules are at the top and the newest ones are at the bottom.

After writing about how access control rules need some form of usage counters, it's occurred to me that maybe I want to reverse this, at least in typical systems where the first matching rule wins. The basic idea is that the rules I'm most likely to want to drop are the oldest rules, but by having them first I'm hindering my ability to see if they've been made obsolete by newer rules. If an old rule matches some bad traffic, a new rule matches all of the bad traffic, and the new rule is last, any usage counters will show a mix of the old rule and the new rule, making it look like the old rule is still necessary. If the order was reversed, the new rule would completely occlude the old rule and usage counters would show me that I could weed the old rule out.

(My view is that it's much less likely that I'll add a new rule at the bottom that's completely ineffectual because everything it matches is already matched by something earlier. If I'm adding a new rule, it's almost certainly because something isn't being handled by the collection of existing rules.)

Another possible advantage to this is that it will keep new rules at the top of my attention, because when I look at the rule list (or the section of it) I'll probably start at the top. Currently, the top is full of old rules that I usually ignore, but if I put new rules first I'll naturally see them right away.

(I think that most things I deal with are 'first match wins' systems. A 'last match wins' system would naturally work right here, but it has other confusing aspects. I also have the impression that adding new rules at the end is a common thing, but maybe it's just in the cultural water here.)

Our Django model class fields should include private, internal names

By: cks

Let me tell you about a database design mistake I made in our Django web application for handling requests for Unix accounts. Our current account request app evolved from a series of earlier systems, and one of the things that these earlier systems asked people for was their 'status' with the university; were they visitors, graduate students, undergraduate students, (new) staff, or so on. When I created the current system I copied this and so the database schema includes a 'Status' model class. The only thing I put in this model class was a text field that people picked from in our account request form, and I didn't really think of the text there as what you could call load bearing. It was just a piece of information we asked people for because we'd always asked people for and faithfully duplicating the old CGI was the easy way to implement the web app.

Before too long, it turned out that we wanted to do some special things if people were graduate students (for example, notifying the department's administrative people so they could update their records to include the graduate student's Unix login and email address here). The obvious simple way to implement this was to do a text match on the value of the 'status' field for a particular person; if their 'status' was "Graduate Student", we knew they were a graduate student and we could do various special things. Over time, this knowledge of what the people-visible "Graduate Student" status text was wormed its way into a whole collection of places around our account systems.

For reasons beyond the scope of this entry, we now (recently) want to change the people-visible text to be not exactly "Graduate Student" any more. Now we have a problem, because a bunch of places know that exact text (in fact I'm not sure I remember where all of those places are).

The mistake I made, way back when we first wanted things to know that an account or account request was a 'graduate student', was in not giving our 'Status' model an internal 'label' field that wasn't shown to people in addition to the text shown to people. You can practically guarantee that anything you show to people will want to change sooner or later, so just like you shouldn't make actual people-exposed fields into primary or foreign keys, none of your code should care about their value. The correct solution is an additional field that acts as the internal label of a Status (with values that make sense to us), and then using this internal label any time the code wants to match on or find the 'Graduate Student' status.

(In theory I could use Django's magic 'id' field for this, since we're having Django create automatic primary keys for everything, including the Status model. In practice, the database IDs are completely opaque and I'd rather have something less opaque in code instead of everything knowing that ID '14' is the Graduate Student status ID.)

Fortunately, I've had a good experience with my one Django database migration so far, so this is a fixable problem. Threading the updates through all of the code (and finding all of the places that need updates, including in outside programs) will be a bit of work, but that's what I get for taking the quick hack approach when this first came up.

(I'm sure I'm not the only person to stub my toe this way, and there's probably a well known database design principle involved that would have told me better if I'd known about it and paid attention at the time.)

These days, systemd can be a cause of restrictions on daemons

By: cks

One of the traditional rites of passage for Linux system administrators is having a daemon not work in the normal system configuration (eg, when you boot the system) but work when you manually run it as root. The classical cause of this on Unix was that $PATH wasn't fully set in the environment the daemon was running in but was in your root shell. On Linux, another traditional cause of this sort of thing has been SELinux and a more modern source (on Ubuntu) has sometimes been AppArmor. All of these create hard to see differences between your root shell (where the daemon works when run by hand) and the normal system environment (where the daemon doesn't work). These days, we can add another cause, an increasingly common one, and that is systemd service unit restrictions, many of which are covered in systemd.exec.

(One pernicious aspect of systemd as a cause of these restrictions is that they can appear in new releases of the same distribution. If a daemon has been running happily in an older release and now has surprise issues in a new Ubuntu LTS, I don't always remember to look at its .service file.)

Some of systemd's protective directives simply cause failures to do things, like access user home directories if ProtectHome= is set to something appropriate. Hopefully your daemon complains loudly here, reporting mysterious 'permission denied' or 'file not found' errors. Some systemd settings can have additional, confusing effects, like PrivateTmp=. A standard thing I do when troubleshooting a chain of programs executing programs executing programs is to shim in diagnostics that dump information to /tmp, but with PrivateTmp= on, my debugging dump files are mysteriously not there in the system-wide /tmp.

(On the other hand, a daemon may not complain about missing files if it's expected that the files aren't always there. A mailer usually can't really tell the difference between 'no one has .forward files' and 'I'm mysteriously not able to see people's home directories to find .forward files in them'.)

Sometimes you don't get explicit errors, just mysterious failures to do some things. For example, you might set IP address access restrictions with the intention of blocking inbound connections but wind up also blocking DNS queries (and this will also depend on whether or not you use systemd-resolved). The good news is that you're mostly not going to find standard systemd .service files for normal daemons shipped by your Linux distribution with IP address restrictions. The bad news is that at some point .service files may start showing up that impose IP address restrictions with the assumption that DNS resolution is being done via systemd-resolved as opposed to direct DNS queries.

(I expect some Linux distributions to resist this, for example Debian, but others may declare that using systemd-resolved is now mandatory in order to simplify things and let them harden service configurations.)

Right now, you can usually test if this is the problem by creating a version of the daemon's .service file with any systemd restrictions stripped out of it and then seeing if using that version makes life happy. In the future it's possible that some daemons will assume and require some systemd restrictions (for instance, assuming that they have a /tmp all of their own), making things harder to test.

Some stuff on how Linux consoles interact with the mouse

By: cks

On at least x86 PCs, Linux text consoles ('TTY' consoles or 'virtual consoles') support some surprising things. One of them is doing some useful stuff with your mouse, if you run an additional daemon such as gpm or the more modern consolation. This is supported on both framebuffer consoles and old 'VGA' text consoles. The experience is fairly straightforward; you install and activate one of the daemons, and afterward you can wave your mouse around, select and paste text, and so on. How it works and what you get is not as clear, and since I recently went diving into this area for reasons, I'm going to write down what I now know before I forget it (with a focus on how consolation works).

The quick summary is that the console TTY's mouse support is broadly like a terminal emulator. With a mouse daemon active, the TTY will do "copy and paste" selection stuff on its own. A mouse aware text mode program can put the console into a mode where mouse button presses are passed through to the program, just as happens in xterm or other terminal emulators.

The simplest TTY mode is when a non-mouse-aware program or shell is active, which is to say a program that wouldn't try to intercept mouse actions itself if it was run in a regular terminal window and would leave mouse stuff up to the terminal emulator. In this mode, your mouse daemon reads mouse input events and then uses sub-options of the TIOCLINUX ioctl to inject activities into the TTY, for example telling it to 'select' some text and then asking it to paste that selection to some file descriptor (normally the console itself, which delivers it to whatever foreground program is taking terminal input at the time).

(In theory you can use the mouse to scroll text back and forth, but in practice that was removed in 2020, both for the framebuffer console and for the VGA console. If I'm reading the code correctly, a VGA console might still have a little bit of scrollback support depending on how much spare VGA RAM you have for your VGA console size. But you're probably not using a VGA console any more.)

The other mode the console TTY can be in is one where some program has used standard xterm-derived escape sequences to ask for xterm-compatible "mouse tracking", which is the same thing it might ask for in a terminal emulator if it wanted to handle the mouse itself. What this does in the kernel TTY console driver is set a flag that your mouse daemon can query with TIOCL_GETMOUSEREPORTING; the kernel TTY driver still doesn't directly handle or look at mouse events. Instead, consolation (or gpm) reads the flag and, when the flag is set, uses the TIOCL_SELMOUSEREPORT sub-sub-option to TIOCLINUX's TIOCL_SETSEL sub-option to report the mouse position and button presses to the kernel (instead of handling mouse activity itself). The kernel then turns around and sends mouse reporting escape codes to the TTY, as the program asked for.

(As I discovered, we got a CVE this year related to this, where the kernel let too many people trigger sending programs 'mouse' events. See the stable kernel commit message for details.)

A mouse daemon like consolation doesn't have to pay attention to the kernel's TTY 'mouse reporting' flag. As far as I can tell from the current Linux kernel code, if the mouse daemon ignores the flag it can keep on doing all of its regular copy and paste selection and mouse button handling. However, sending mouse reports is only possible when a program has specifically asked for it; the kernel will report an error if you ask it to send a mouse report at the wrong time.

(As far as I can see there's no notification from the kernel to your mouse daemon that someone changed the 'mouse reporting' flag. Instead you have to poll it; it appears consolation does this every time through its event loop before it handles any mouse events.)

PS: Some documentation on console mouse reporting was written as a 2020 kernel documentation patch (alternate version) but it doesn't seem to have made it into the tree. According to various sources, eg, the mouse daemon side of things can only be used by actual mouse daemons, not by programs, although programs do sometimes use other bits of TIOCLINUX's mouse stuff.

PPS: It's useful to install a mouse daemon on your desktop or laptop even if you don't intend to ever use the text TTY. If you ever wind up in the text TTY for some reason, perhaps because your regular display environment has exploded, having mouse cut and paste is a lot nicer than not having it.

Free and open source software is incompatible with (security) guarantees

By: cks

If you've been following the tech news, one of the recent things that's happened is that there has been another incident where a bunch of popular and widely used packages on a popular package repository for a popular language were compromised, this time with a self-replicating worm. This is very inconvenient to some people, especially to companies in Europe, for some reason, and so some people have been making the usual noises. On the Fediverse, I had a hot take:

Hot take: free and open source is fundamentally incompatible with strong security *guarantees*, because FOSS is incompatible with strong guarantees about anything. It says so right there on the tin: "without warranty of any kind, either expressed or implied". We guarantee nothing by default, you get the code, the project, everything, as-is, where-is, how-is.

Of course companies find this inconvenient, especially with the EU CRA looming, but that's not FOSS's problem. That's a you problem.

To be clear here: this is not about the security and general quality of FOSS (which is often very good), or the responsiveness of FOSS maintainers. This is about guarantees, firm (and perhaps legally binding) assurances of certain things (which people want for software in general). FOSS can provide strong security in practice but it's inimical to FOSS's very nature to provide a strong guarantee of that or anything else. The thing that makes most of FOSS possible is that you can put out software without that guarantee and without legal liability.

An individual project can solemnly say it guarantees its security, and if it does so it's an open legal question whether that writing trumps the writing in the license. But in general a core and absolutely necessary aspect of free and open source is that warranty disclaimer, and that warranty disclaimer cuts across any strong guarantees about anything, including security and lack of bugs.

Are the compromised packages inconvenient to a lot of companies? They certainly are. But neither the companies nor commentators can say that the compromise violated some general strong security guarantee about packages, because there is and never will be such a guarantee with FOSS (see, for example, Thomas Depierre's I am not a supplier, which puts into words a sentiment a lot of FOSS people have).

(But of course the companies and sympathetic commentators are framing it that way because they are interested in the second vision of "supply chain security", where using FOSS code is supposed to magically absolve companies of the responsibility that people want someone to take.)

The obvious corollary of this is that widespread usage of FOSS packages and software, especially with un-audited upgrades of package versions (however that happens), is incompatible with having any sort of strong security or quality guarantee about the result. The result may have strong security and high quality, but if so, those come without guarantees; you've just been lucky. If you want guarantees, you will have to arrange them yourself and it's very unlikely you can achieve strong guarantees while using the typical every-changing pile of FOSS code.

(For example, if dependencies auto-update before you can audit them and their changes, or faster than you can keep up, you have nothing in practice.)

My Fedora machines need a cleanup of their /usr/sbin for Fedora 42

By: cks

One of the things that Fedora is trying to do in Fedora 42 is unifying /usr/bin and /usr/sbin. In an ideal (Fedora) world, your Fedora machines will have /usr/sbin be a symbolic link to /usr/bin after they're upgraded to Fedora 42. However, if your Fedora machines have been around for a while, or perhaps have some third party packages installed, what you'll actually wind up with is a /usr/sbin that is mostly symbolic links to /usr/bin but still has some actual programs left.

One source of these remaining /usr/sbin programs is old packages from past versions of Fedora that are no longer packaged in Fedora 41 and Fedora 42. Old packages are usually harmless, so it's easy for them to linger around if you're not disciplined; my home and office desktops (which have been around for a while) still have packages from as far back as Fedora 28.

(An added complication of tracking down file ownership is that some RPMs haven't been updated for the /sbin to /usr/sbin merge and so still believe that their files are /sbin/<whatever> instead of /usr/sbin/<whatever>. A 'rpm -qf /usr/sbin/<whatever>' won't find these.)

Obviously, you shouldn't remove old packages without being sure of whether or not they're important to you. I'm also not completely sure that all packages in the Fedora 41 (or 42) repositories are marked as '.fc41' or '.fc42' in their RPM versions, or if there are some RPMs that have been carried over from previous Fedora versions. Possibly this means I should wait until a few more Fedora versions have come to pass so that other people find and fix the exceptions.

(On what is probably my cleanest Fedora 42 test virtual machine, there are a number of packages that 'dnf list --extras' doesn't list that have '.fc41' in their RPM version. Some of them may have been retained un-rebuilt for binary compatibility reasons. There's also the 'shim' UEFI bootloaders, which date from 2024 and don't have Fedora releases in their RPM versions, but those I expect to basically never change once created. But some others are a bit mysterious, such as 'libblkio', and I suspect that they may have simply been missed by the Fedora 42 mass rebuild.)

PS: In theory anyone with access to the full Fedora 42 RPM repository could sweep the entire thing to find packages that still install /usr/sbin files or even /sbin files, which would turn up any relevant not yet rebuilt packages. I don't know if there's any easy way to do this through dnf commands, although I think dnf does have access to a full file list for all packages (which is used for certain dnf queries).

Access control rules need some form of usage counters

By: cks

Today, for reasons outside the scope of this entry, I decided to spend some time maintaining and pruning the access control rules for Wandering Thoughts, this blog. Due to the ongoing crawler plague (and past abuses), Wandering Thoughts has had to build up quite a collection of access control rules, which are mostly implemented as a bunch of things in an Apache .htaccess file (partly 'Deny from ...' for IP address ranges and partly as rewrite rules based on other characteristics). The experience has left me with a renewed view of something, which is that systems with access control rules need some way of letting you see which rules are still being used by your traffic.

It's in the nature of systems with access control rules to accumulate more and more rules over time. You hit another special situation, you add another rule, perhaps to match and block something or perhaps to exempt something from blocking. These rules often interact in various ways, and over time you'll almost certainly wind up with a tangled thicket of rules (because almost no one goes back to carefully check and revisit all existing rules when they add a new one or modify an existing one). The end result is a mess, and one of the ways to reduce the mess is to weed out rules that are now obsolete. One way a rule can be obsolete is that it's not used any more, and often these are the easiest rules to drop once you can recognize them.

(A rule that's still being matched by traffic may be obsolete for other reasons, and rules that aren't currently being matched may still be needed as a precaution. But it's a good starting point.)

If you have the necessary log data, you can sometimes establish if a rule was actually ever used by manually checking your logs. For example, if you have logs of rejected traffic (or logs of all traffic), you can search it for an IP address range to see if a particular IP address rule ever matched anything. But this requires tedious manual effort and that means that only determined people will go through it, especially regularly. The better way is to either have this information provided directly, such as by counters on firewall rules, or to have something in your logs that makes deriving it easy.

(An Apache example would be to augment any log line that was matched by some .htaccess rule with a name or a line number or the like. Then you could go readily through your logs to determine which lines were matched and how often.)

The next time I design an access control rule system, I'm hopefully going to remember this and put something in its logging to (optionally) explain its decisions.

(Periodically I write something that has an access control rule system of some sort. Unfortunately all of mine to date have been quiet on this, so I'm not at all without sin here.)

The idea of /usr/sbin has failed in practice

By: cks

One of the changes in Fedora Linux 42 is unifying /usr/bin and /usr/sbin, by moving everything in /usr/sbin to /usr/bin. To some people, this probably smacks of anathema, and to be honest, my first reaction was to bristle at the idea. However, the more I thought about it, the more I had to concede that the idea of /usr/sbin has failed in practice.

We can tell /usr/sbin has failed in practice by asking how many people routinely operate without /usr/sbin in their $PATH. In a lot of environments, the answer is that very few people do, because sooner or later you run into a program that you want to run (as yourself) to obtain useful information or do useful things. Let's take FreeBSD 14.3 as an illustrative example (to make this not a Linux biased entry); looking at /usr/sbin, I recognize iostat, manctl (you might use it on your own manpages), ntpdate (which can be run by ordinary people to query the offsets of remote servers), pstat, swapinfo, and traceroute. There are probably others that I'm missing, especially if you use FreeBSD as a workstation and so care about things like sound volumes and keyboard control.

(And if you write scripts and want them to send email, you'll care about sendmail and/or FreeBSD's 'mailwrapper', both in /usr/sbin. There's also DTrace, but I don't know if you can DTrace your own binaries as a non-root user on FreeBSD.)

For a long time, there has been no strong organizing principle to /usr/sbin that would draw a hard line and create a situation where people could safely leave it out of their $PATH. We could have had a principle of, for example, "programs that don't work unless run by root", but no such principle was ever followed for very long (if at all). Instead programs were more or less shoved in /usr/sbin if developers thought they were relatively unlikely to be used by normal people. But 'relatively unlikely' is not 'never', and shortly after people got told to 'run traceroute' and got 'command not found' when they tried, /usr/sbin (probably) started appearing in $PATH.

(And then when you asked 'how does my script send me email about something', people told you about /usr/sbin/sendmail and another crack appeared in the wall.)

If /usr/sbin is more of a suggestion than a rule and it appears in everyone's $PATH because no one can predict which programs you want to use will be in /usr/sbin instead of /usr/bin, I believe this means /usr/sbin has failed in practice. What remains is an unpredictable and somewhat arbitrary division between two directories, where which directory something appears in operates mostly as a hint (a hint that's invisible to people who don't specifically look where a program is).

(This division isn't entirely pointless and one could try to reform the situation in a way short of Fedora 42's "burn the entire thing down" approach. If nothing else the split keeps the size of both directories somewhat down.)

PS: The /usr/sbin like idea that I think is still successful in practice is /usr/libexec. Possibly a bunch of things in /usr/sbin should be relocated to there (or appropriate subdirectories of it).

My machines versus the Fedora selinux-policy-targeted package

By: cks

I upgrade Fedora on my office and home workstations through an online upgrade with dnf, and as part of this I read (or at least scan) DNF's output to look for problems. Usually this goes okay, but DNF5 has a general problem with script output and when I did a test upgrade from Fedora 41 to Fedora 42 on a virtual machine, it generated a huge amount of repeated output from a script run by selinux-policy-targeted, repeatedly reporting "Old compiled fcontext format, skipping" for various .bin files in /etc/selinux/targeted/contexts/files. The volume of output made the rest of DNF's output essentially unreadable. I would like to avoid this when I actually upgrade my office and home workstations to Fedora 42 (which I still haven't done, partly because of this issue).

(You can't make this output easier to read because DNF5 is too smart for you. This particular error message reportedly comes from 'semodule -B', per this Fedora discussion.)

The 'targeted' policy is one of several SELinux policies that are supported or at least packaged by Fedora (although I suspect I might see similar issues with the other policies too). My main machines don't use SELinux and I have it completely disabled, so in theory I should be able to remove the selinux-policy-targeted package to stop it from repeatedly complaining during the Fedora 42 upgrade process. In practice, selinux-policy-targeted is a 'protected' package that DNF will normally refuse to remove. Such packages are listed in /etc/dnf/protected.d/ in various .conf files; selinux-policy-targeted installs (well, includes) a .conf file to protect itself from removal once installed.

(Interestingly, sudo protects itself but there's nothing specifically protecting su and the rest of util-linux. I suspect util-linux is so pervasively a dependency that other protected things hold it down, or alternately no one has ever worried about people removing it and shooting themselves in the foot.)

I can obviously remove this .conf file and then DNF will let me remove selinux-policy-targeted, which will force the removal of some other SELinux policy packages (both selinux-policy packages themselves and some '*-selinux' sub-packages of other packages). I tried this on another Fedora 41 test virtual machine and nothing obvious broke, but that doesn't mean that nothing broke at all. It seems very likely that almost no one tests Fedora without the selinux-policy collective installed and I suspect it's not a supported configuration.

I could reduce my risks by removing the packages only just before I do the upgrade to Fedora 42 and put them back later (well, unless I run into a dnf issue as a result, although that issue is from 2024). Also, now that I've investigated this, I could in theory delete the .bin files in /etc/selinux/targeted/contexts/files before the upgrade, hopefully making it so that selinux-policy-targeted has less or nothing to complain about. Since I'm not using SELinux, hopefully the lack of these files won't cause any problems, but of course this is less certain a fix than removing selinux-policy-targeted (for example, perhaps the .bin files would get automatically rebuilt early on in the upgrade process as packages are shuffled around, and bring the problem back with them).

Really, though, I wish DNF5 didn't have its problem with script output. All of this is hackery to deal with that underlying issue.

Some notes on (Tony Finch's) exponential rate limiting in practice

By: cks

After yesterday's entry where I discovered it, I went and implemented Tony Finch's exponential rate limiting for HTTP request rate limiting in DWiki, the engine underlying this blog, replacing the more brute force and limited version I had initially implemented. I chose exponential rate limiting over GCRA or leaky buckets because I found it much easier to understand how to set the limits (partly because I'm somewhat familiar with the whole thing from Exim). Exponential rate limiting needed me to pick a period of time and a number of (theoretical) requests that can be made in that time interval, which was easy enough; GCRA 'rate' and 'burst' numbers were less clear to me. However, exponential rate limiting has some slightly surprising things that I want to remember.

(Exponential ratelimits don't have a 'burst' rate as such but you can sort of achieve this by your choice of time intervals.)

In my original simple rate limiting, any rate limit record that had a time outside of my interval was irrelevant and could be dropped in order to reduce space usage (my current approach uses basically the same hack as my syndication feed ratelimits, so I definitely don't want to let its space use grow without bound). This is no longer necessarily true in exponential rate limiting, depending on how big of a rate the record (the source) had built up before it took a break. This old rate 'decays' at a rate I will helpfully put in a table for my own use:

Time since last seen Old rate multiplied by
1x interval 0.37
2x interval 0.13
3x interval 0.05
4x interval 0.02

(This is, eg, 'exp(-1)' for we only last saw the source 'interval' time ago.)

Where this becomes especially relevant is if you opt for 'strict' rate limiting instead of 'leaky', where every time the source makes a request you increase its recorded rate even if you reject the request for being rate limited. A high-speed source that insists on hammering you for a while can build up a very large current rate under a strict rate limit policy, and that means its old past behavior can affect it (ie, possibly cause it to be rate limited) well beyond your nominal rate limit interval. Especially with 'strict' rate limiting, you could opt to cap the maximum age a valid record could have and drop everything that you last saw over, say, 3x your interval ago; this would be generous to very high rate old sources, but not too generous (since their old rate would be reduced to 0.05 or less of what it was even if you counted it).

As far as I can see, the behavior with leaky rate limiting and a cost of 1 (for the simple case of all HTTP requests having the same cost) is that if the client keeps pounding away at you, one of its requests will get through on a semi-regular basis. The client will make a successful request, the request will push its rate just over your limit, it will get rate limited some number of times, then enough time will have passed since its last successful request that its new request will be just under the rate limit and succeed. In some environments, this is fine and desired. However, my current goal is to firmly cut off clients that are making requests too fast, so I don't want this; instead, I implemented the 'strict' behavior so you don't get through at all until your request rate and the interval since your last request drops low enough.

Mathematically, a client that makes requests with little or no gap between them (to the precision of your timestamps) can wind up increasing its rate by slightly over its 'cost' per request. If I'm understanding the math correctly, how much over the cost is capped by Tony Finch's 'max(interval, 1.0e-10)' step, with 1.0e-10 being a small but non-zero number that you can move up or down depending on, eg, your language and its floating point precision. Having looked at it, in Python the resulting factor with 1.0e-10 is '1.000000082740371', so you and I probably don't need to worry about this. If the client doesn't make requests quite that fast, its rate will go up each time by slightly less than the 'cost' you've assigned. In Python, a client that makes a request every millisecond has a factor for this of '0.9995001666249781' of the cost; slower request rates make this factor smaller.

This is probably mostly relevant if you're dumping or reporting the calculated rates (for example, when a client hits the rate limit) and get puzzled by the odd numbers that may be getting reported.

I don't know how to implement proper ratelimiting (well, maybe I do now)

By: cks

In theory I have a formal education as a programmer (although it was a long time ago). In practice my knowledge from it isn't comprehensive, and every so often I run into an area where I know there's relevant knowledge and algorithms but I don't know what they are and I'm not sure how to find them. Today's area is scalable rate-limiting with low storage requirements.

Suppose, not hypothetically, that you want to ratelimit a collection of unpredictable sources and not use all that much storage per source. One extremely simple and obvious approach is to store, for each source, a start time and a count. Every time the source makes a request, you check to see if the start time is within your rate limit interval; if it is, you increase the count (or ratelimit the source), and if it isn't, you reset the start time to now and the count to 1.

(Every so often you can clean out entries with start times before your interval.)

The disadvantage of this simple approach is that it completely forgets about the past history of each source periodically. If your rate limit intervals are 20 minutes, a prolific source gets to start over from scratch every 20 minutes and run up its count until it gets rate limited again. Typically you want rate limiting not to forget about sources so fast.

I know there are algorithms that maintain decaying averages or moving (rolling) averages. The Unix load average is maintained this way, as is Exim ratelimiting. The Unix load average has the advantage that it's updated on a regular basis, which makes the calculation relatively simple. Exim has to deal with erratic updates that are unpredictable intervals from the previous update, and the comment in the source is a bit opaque to me. I could probably duplicate the formula in my code but I'd have to do a bunch of work to convince myself the result was correct.

(And now I've found Tony Finch's exponential rate limiting (via), which I'm going to have to read carefully, along with the previous GCRA: leaky buckets without the buckets.)

Given that rate limiting is such a common thing these days, I suspect that there are a number of algorithms for this with various different choices about how the limits work. Ideally, it would be possible to readily find writeups of them with internet searches, but of course as you know internet search is fairly broken these days.

(For example you can find a lot of people giving high level overviews of rate limiting without discussing how to actually implement it.)

Now that I've found Tony Finch's work I'm probably going to rework my hacky rate limiting code to do things better, because my brute force approach is using the same space as leaky buckets (as covered in Tony Finch's article) with inferior results. This shows the usefulness of knowing algorithms instead of just coding away.

(Improving the algorithm in my code will probably make no practical difference, but sometimes programming is its own pleasure.)

ZFS snapshots aren't as immutable as I thought, due to snapshot metadata

By: cks

If you know about ZFS snapshots, you know that one of their famous properties is that they're immutable; once a snapshot is made, its state is frozen. Or so you might casually describe it, but that description is misleading. What is frozen in a ZFS snapshot is the state of the filesystem (or zvol) that it captures, and only that. In particular, the metadata associated with the snapshot can and will change over time.

(When I say it this way it sounds obvious, but for a long time my intuition about how ZFS operated was misled by me thinking that all aspects of a snapshot had to be immutable once made and trying to figure out how ZFS worked around that.)

One visible place where ZFS updates the metadata of a snapshot is to maintain information about how much unique space the snapshot is using. Another is that when a ZFS snapshot is deleted, other ZFS snapshots may require updates to adjust the list of snapshots (every snapshot points to the previous one) and the ZFS deadlist of blocks that are waiting to be freed.

Mechanically, I believe that various things in a dsl_dataset_phys_t are mutable, with the exception of things like the creation time and the creation txg, and also the block pointer, which points to the actual filesystem data of the snapshot. Things like the previous snapshot information have to be mutable (you might delete the previous snapshot), and things like the deadlist and the unique bytes are mutated as part of operations like snapshot deletion. The other things I'm not sure of.

(See also my old entry on a broad overview of how ZFS is structured on disk. A snapshot is a 'DSL dataset' and it points to the object set for that snapshot. The root directory of a filesystem DSL dataset, snapshot or otherwise, is at a fixed number in the object set; it's always object 1. A snapshot freezes the object set as of that point in time.)

PS: Another mutable thing about snapshots is their name, since 'zfs rename' can change that. The manual page even gives an example of using (recursive) snapshot renaming to keep a rolling series of daily snapshots.

How I think OpenZFS's 'written' and 'written@<snap>' dataset properties work

By: cks

Yesterday I wrote some notes about ZFS's 'written' dataset property, where the short summary is that 'written' reports the amount of space written in a snapshot (ie, that wasn't in the previous snapshot), and 'written@<snapshot>' reports the amount of space written since the specified snapshot (up to either another snapshot or the current state of the dataset). In that entry, I left un-researched the question of how ZFS actually gives us those numbers; for example, if there was a mechanism in place similar to the complicated one for 'used' space. I've now looked into this and as far as I can see the answer is that ZFS determines information on the fly.

The guts of the determination are in dsl_dataset_space_written_impl(), which has a big comment that I'm going to quote wholesale:

Return [...] the amount of space referenced by "new" that was not referenced at the time the bookmark corresponds to. "New" may be a snapshot or a head. The bookmark must be before new, [...]

The written space is calculated by considering two components: First, we ignore any freed space, and calculate the written as new's used space minus old's used space. Next, we add in the amount of space that was freed between the two time points, thus reducing new's used space relative to old's. Specifically, this is the space that was born before zbm_creation_txg, and freed before new (ie. on new's deadlist or a previous deadlist).

(A 'bookmark' here is an internal ZFS thing.)

When this talks about 'used' space, this is not the "used" snapshot property; this is the amount of space the snapshot or dataset refers to, including space shared with other snapshots. If I'm understanding the code and the comment right, the reason we add back in freed space is because otherwise you could wind up with a negative number. Suppose you wrote a 2 GB file, made one snapshot, deleted the file, and then made a second snapshot. The difference in space referenced between the two snapshots is slightly less than negative 2 GB, but we can't report that as 'written', so we go through the old stuff that got deleted and add its size back in to make the number positive again.

To determine the amount of space that's been freed between the bookmark and "new", the ZFS code walks backward through all snapshots from "new" to the bookmark, calling another ZFS function to determine how much relevant space got deleted. This uses the ZFS deadlists that ZFS is already keeping track of to know when it can free an object.

This code is used both for 'written@<snap>' and 'written'; the only difference between them is that when you ask for 'written', the ZFS kernel code automatically finds the previous snapshot for you.

Some notes on OpenZFS's 'written' dataset property

By: cks

ZFS snapshots and filesystems have a 'written' property, and a related 'written@snapshot one. These are documented as:

written
The amount of space referenced by this dataset, that was written since the previous snapshot (i.e. that is not referenced by the previous snapshot).

written@snapshot
The amount of referenced space written to this dataset since the specified snapshot. This is the space that is referenced by this dataset but was not referenced by the specified snapshot. [...]

(Apparently I never noticed the 'written' property before recently, despite it being there from very long ago.)

The 'written' property is related to the 'used' property, and it's both more confusing and less confusing as it relates to snapshots. Famously (but not famously enough), for snapshots the used property ('USED' in the output of 'zfs list') only counts space that is exclusive to that snapshot. Space that's only used by snapshots but that is shared by more than one snapshot is in 'usedbysnapshots'.

To understand 'written' better, let's do an experiment: we'll make a snapshot, write a 2 GByte file, make a second snapshot, write another 2 GByte file, make a third snapshot, and then delete the first 2 GB file. Since I've done this, I can tell you the results.

If there are no other snapshots of the filesystem, the first snapshot's 'written' value is the full size of the filesystem at the time it was made, because everything was written before it was made. The second snapshot's 'written' is 2 GBytes, the data file we wrote between the first and the second snapshot. The third snapshot's 'written' is another 2 GB, for the second file we wrote. However, at the end, after we delete one of the data files, the filesystem's 'written' is small (certainly not 2 GB), and so would be the 'written' of a fourth snapshot if we made one.

The reason the filesystem's 'written' is so small is that ZFS is counting concrete on-disk (new) space. Deleting a 2 GB file frees up a bunch of space but it doesn't require writing very much to the filesystem, so the 'written' value is low.

If we look at the 'used' values for all three snapshots, they're all going to be really low. This is because both 2 GByte data files we wrote are shared between the second and the third snapshot. Since they're both in multiple snapshots, they're in 'usedbysnapshots' but not 'used'.

(ZFS has a somewhat complicated mechanism to maintain all of this information.)

There is one interesting 'written' usage that appears to show you deleted space, but it is a bit tricky. The manual page implies that the normal usage of 'written@<snapshot>' is to ask for it for the filesystem itself; however, in experimentation you can ask for it for a snapshot too. So take the three snapshots above, and the filesystem after deleting the first data file. If you ask for 'written@first' for the filesystem, you will get 2 GB, but if you ask for 'written@first' for the third snapshot, you will get 4 GB. What the filesystem appears to be reporting is how much still-live data has been written between the first snapshot and now, which is only 2 GB because we deleted the other 2 GB. Meanwhile, all four GB are still alive in the third snapshot.

My conclusion from looking into this is that I can use 'written' as an indication of how much new data a snapshot has captured, but I can't use it as an indication of how much changed in a snapshot. As I've seen, deleting data is a potentially big change but a small 'written' value. If I'm understanding 'written' correctly, one useful thing about it is that it shows roughly how much data an incremental 'zfs send' of just that snapshot would send. Under some circumstances it will also give you an idea of how much data your backup system may need to back up; however, this works best if people are creating new files (and deleting old ones), instead of updating or appending to existing files (where ZFS only updates some blocks but a backup system probably needs to re-save the whole thing).

Why Firefox's media autoplay settings are complicated and imperfect

By: cks

In theory, a website that wanted to play video or audio could throw in a '<video controls ...>' or '<audio controls ...>' element in the HTML of the page and be done with it. This would make handling playing media simple and blocking autoplay reliable; you'd ignore the autoplay element and the person using the browser would directly trigger playing media by interacting with things that the browser directly controlled and so the browser could know for sure that a person had directly clicked on them and the media should be played.

As anyone who's seen websites with audio and video on the web knows, in practice almost no one does it this way, with browser controls on the <video> or <audio> element. Instead, everyone displays controls of their own somehow (eg as HTML elements styled through CSS), attaches JavaScript actions to them, and then uses the HTMLMediaElement browser API to trigger playback and various other things. As a result of this use of JavaScript, browsers in general and Firefox in particular no longer have a clear, unambiguous view of your intentions to play media. At best, all they can know is that you interacted with the web page, this interaction triggered some JavaScript, and the JavaScript requested that media play.

(Browsers can know somewhat of how you interacted with a web page, such as whether you clicked or scrolled or typed a key.)

On good, well behaved websites, this interaction is with visually clear controls (such as a visual 'play' button) and the JavaScript that requests media playing is directly attached to those controls. And even on these websites, JavaScript may later legitimately act asynchronously to request more playing of things, or you may interact with media playback in other ways (such as spacebar to pause and then restart media playing). On not so good websites, well, any piece of JavaScript that manages to run can call HTMLMediaElement.play() to try to start playing the media. There are lots of ways to have JavaScript run automatically and so a web page can start trying to play media the moment its JavaScript starts running, and it can keep trying to trigger playback over and over again if it wants to through timers or suchlike.

Since Firefox only blocking the actual autoplay attribute and allowing JavaScript to trigger media playing any time it wants to would be a pretty obviously bad 'Block Autoplay' experience, it must try harder. Firefox's approach is to (also) block use of HTMLMediaElement.play() until you have done some 'user gesture' on the page. As far as I can tell from Firefox's description of this, the list of 'user gestures' is fairly expansive and covers much of how you interact with a page. Certainly, if a website can cause you to click on something, regardless of what it looks like, this counts as a 'user gesture' in Firefox.

(I'm sure that Firefox's selection of things that count as 'user gestures' are drawn from real people on real hardware doing things to deliberately trigger playback, including resuming playback after it's been paused by, for example, tapping spacebar.)

In Firefox, this makes it quite hard to actually stop a bad website from playing media while preserving your ability to interact with the site. Did you scroll the page with the spacebar? I think that counts as a user gesture. Did you use your mouse scroll wheel? Probably a user gesture. Did you click on anything at all, including to dismiss some banner? Definitely a user gesture. As far as I can tell, the only reliable way you can prevent a web page from starting media playback is to immediately close the page. Basically anything you do to use it is dangerous.

Firefox does have a very strict global 'no autoplay' policy that you can turn on through about:config, which they call click-to-play, where Firefox tries to limit HTMLMediaElement.play() to being called as the direct result of a JavaScript event handler. However, their wiki notes that this can break some (legitimate) websites entirely (well, for media playback), and it's a global setting that gets in the way of some things I want; you can't set it only for some sites. And even with click-to-play, if a website can get you to click on something of its choice, it's game over as far as I know; if you have to click or tap a key to dismiss an on-page popup banner, the page can trigger media playing from that event handler.

All of this is why I'd like a per-website "permanent mute" option for Firefox. As far as I know, there's literally no other way in standard Firefox to reliably prevent a potentially bad website (or advertising network that it uses) from playing media on you.

(I suspect that you can defeat a lot of such websites with click-to-play, though.)

PS: Muting a tab in Firefox is different from stopping media playback (or blocking it from starting). All it does is stop Firefox from outputting audio from that tab (to wherever you're having Firefox send audio). Any media will 'play' or continue to play, including videos displaying moving things and being distracting.

We can't expect people to pick 'good' software

By: cks

One of the things I've come to believe in (although I'm not consistent about it) is that we can't expect people to pick software that is 'good' in a technical sense. People certainly can and do pick software that is good in that it works nicely, has a user interface that works for them, and so on, which is to say all of the parts of 'good' that they can see and assess, but we can't expect people to go beyond that, to dig deeply into the technical aspects to see how good their choice of software is. For example, how efficiently an IMAP client implements various operations at the protocol level is more or less invisible to most people. Even if you know enough to know about potential technical quality aspects, realistically you have to rely on any documentation the software provides (if it provides anything). Very few people are going to set up an IMAP server test environment and point IMAP clients at it to see how they behave, or try to read the source code of open source clients.

(Plus, you have to know a lot to set up a realistic test environment. A lot of modern software varies its behavior in subtle ways depending on the surrounding environment, such as the server (or client) at the other end, what your system is like, and so on. To extend my example, the same IMAP client may behave differently when talking to two different IMAP server implementations.)

Broadly, the best we can do is get software to describe important technical aspects of itself, to document them even if the software doesn't, and to explain to people why various aspects matter and thus what they should look for if they want to pick good software. I think this approach has seen some success in, for example, messaging apps, where 'end to end encrypted' or similar things has become a technical quality measure that's typically relatively legible to people. Other technical quality measures in other software are much less legible to people in general, including in important software like web browsers.

(One useful way to make technical aspects legible is to create some sort of scorecard for them. Although I don't think it was built for this purpose, there's caniuse for browsers and their technical quality for various CSS and HTML5 features.)

To me, one corollary to this is that there's generally no point in yelling at people (in various ways) or otherwise punishing them because they picked software that isn't (technically) good. It's pretty hard for a non-specialist to know what is actually good or who to trust to tell them what's actually good, so it's not really someone's fault if they wind up with not-good software that does undesirable things. This doesn't mean that we should always accept the undesirable things, but it's probably best to either deal with them or reject them as gracefully as possible.

(This definitely doesn't mean that we should blindly follow Postel's Law, because a lot of harm has been done to various ecosystems by doing so. Sometimes you have to draw a line, even if it affects people who simply had bad luck in what software they picked. But ideally there's a difference between drawing a line and yelling at people about them running into the line.)

Our too many paths to 'quiet' Prometheus alerts

By: cks

One of the things our Prometheus environment has is a notion of different sorts of alerts, and in particular of less important alerts that should go to a subset of people (ie, me). There are various reasons for this, including that the alert is in testing, or it concerns a subsystem that only I should have to care about, or that it fires too often for other people (for example, a reboot notification for a machine we routinely reboot).

For historical reasons, there are at least four different ways that this can be done in our Prometheus environment:

  • a special label can be attached to the Prometheus alert rule, which is appropriate if the alert rule itself is in testing or otherwise is low priority.

  • a special label can be attached to targets in a scrape configuration, although this has some side effects that can be less than ideal. This affects all alerts that trigger based on metrics from, for example, the Prometheus host agent (for that host).

  • our Prometheus configuration itself can apply alert relabeling to add the special label for everything from a specific host, as indicated by a "host" label that we add. This is useful if we have so many exporters being scraped from a particular host, or if I want to keep metric continuity (ie, the metrics not changing their label set) when a host moves into production.

  • our Alertmanager configuration can specifically route certain alerts about certain machines to the 'less important alerts' destination.

The drawback of these assorted approaches is that now there are at least three places to check and possibly to update when a host moves from being a testing host into being a production host. A further drawback is some of these (the first two) are used a lot more often than others of these (the last two). When you have multiple things, some of which are infrequently used, and fallible humans have to remember to check them all, you can guess what can happen next.

And that is the simple version of why alerts about one of our fileservers wouldn't have gone to everyone here for about the past year.

How I discovered the problem was that I got an alert about one of the fileserver's Prometheus exporters restarting, and decided that I should update the alert configuration to make it so that alerts about this service restarting only went to me. As I was in the process of doing this, I realized that the alert already had only gone to me, despite there being no explicit configuration in the alert rule or the scrape configuration. This set me on an expedition into the depths of everything else, where I turned up an obsolete bit in our general Prometheus configuration.

On the positive side, now I've audited our Prometheus and Alertmanager configurations for any other things that shouldn't be there. On the negative side, I'm now not completely sure that there isn't a fifth place that's downgrading (some) alerts about (some) hosts.

Could NVMe disks become required for adequate performance?

By: cks

It's not news that full speed NVMe disks are extremely fast, as well as extremely good at random IO and doing a lot of IO at once. In fact they have performance characteristics that upset general assumptions about how you might want to design systems, at least for reading data from disk (for example, you want to generate a lot of simultaneous outstanding requests, either explicitly in your program or implicitly through the operating system). I'm not sure how much write bandwidth normal NVMe drives can really deliver for sustained write IO, but I believe that they can absorb very high write rates for a short period as you flush out a few hundred megabytes or more. This is a fairly big sea change from even SATA SSDs (and I believe SAS SSDs), never mind HDDs.

About a decade ago, I speculated that everyone was going to be forced to migrate to SATA SSDs because developers would build programs that required SATA SSD performance. It's quite common for developers to build programs and systems that run well on their hardware (whether that's laptops, desktops, or servers, cloud or otherwise), and developers often use the latest and best. These days, that's going to have NVMe SSDs, and so it wouldn't be surprising if developers increasingly developed for full NVMe performance. Some of this may be inadvertent, in that the developer doesn't realize what the performance impact of their choices are on systems with less speedy storage. Some of this will likely be deliberate, as developers choose to optimize for NVMe performance or even develop systems that only work well with that level of performance.

This is a potential problem because there are a number of ways to not have that level of NVMe performance. Most obviously, you can simply not have NVMe drives; instead you may be using SATA SSDs (as we mostly are, including in our fileservers), or even HDDs (as we are in our Prometheus metrics server). Less obviously, you may have NVMe drives but be driving them in ways that don't give you the full NVMe bandwidth. For instance, you might have a bunch of NVMe drives behind a 'tri-mode' HBA, or have (some of) your NVMe drives hanging off the chipset with shared PCIe lanes to the CPU, or have to drive some of your NVMe drives with fewer than x4 PCIe because of limits on slots or lanes.

(Dedicated NVMe focused storage servers will be able to support lots of NVMe devices at full speed, but such storage servers are likely to be expensive. People will inevitably build systems with lower end setups, us included, and I believe that basic 1U servers are still mostly SATA/SAS based.)

One possible reason for optimism is that in today's operating systems, it can take careful system design and unusual programming patterns to really push NVMe disks to high performance levels. This may make it less likely that software accidentally winds up being written so it only performs well on NVMe disks; if it happens, it will be deliberate and the project will probably tell you about it. This is somewhat unlike the SSD/HDD situation a decade ago, where the difference in (random) IO operations per second was both massive and easily achieved.

(This entry was sparked in part by reading this article (via), which I'm not taking a position on.)

HTTP headers that tell syndication feed fetchers how soon to come back

By: cks

Programs that fetch syndication feeds should fetch them only every so often. But how often? There are a variety of ways to communicate this, and for my own purposes I want to gather them in one place.

I'll put the summary up front. For Atom syndication feeds, your HTTP feed responses should contain a Cache-Control: max-age=... HTTP header that gives your desired retry interval (in seconds), such as '3600' for pulling the feed once an hour. If and when people trip your rate limits and get HTTP 429 responses, your 429s should include a Retry-After header with how long you want feed readers to wait (although they won't).

There are two syndication feed formats in general usage, Atom and RSS2. Although generally not great (and to be avoided), RSS2 format feeds can optionally contain a number of elements to explicitly tell feed readers how frequently they should poll the feed. The Atom syndication feed format has no standard element to communicate polling frequency. Instead, the nominally standard way to do this is through a general Cache-Control: max-age=... HTTP header, which gives a (remaining) lifetime in seconds. You can also set an Expires header, which gives an absolute expiry time, but not both.

(This information comes from Daniel Aleksandersen's Best practices for syndication feed caching. One advantage of HTTP headers over feed elements is that they can be returned on HTTP 304 Not Modified responses; one drawback is that you need to be able to set HTTP headers.)

If you have different rate limit policies for conditional GET requests and unconditional ones, you have a choice to make about the time period you advertise on successful unconditional GETs of your feed. Every feed reader has to do an unconditional GET the first time it fetches your feed, and many of them will periodically do unconditional GETs for various reasons. You could choose to be optimistic, assume that the feed reader's next poll will be a conditional GET, and give it the conditional GET retry interval, or you could be pessimistic and give it a longer unconditional GET one. My personal approach is to always advertise the conditional GET retry interval, because I assume that if you're not going to do any conditional GETs you're probably not paying attention to my Cache-Control header either.

As rachelbythebay's ongoing work on improving feed reader behavior has uncovered, a number of feed readers will come back a bit earlier than your advertised retry interval. So my view is that if you have a rate limit, you should advertise a retry interval that is larger than it. On Wandering Thoughts my current conditional GET feed rate limit is 45 minutes, but I advertise a one hour max-age (and I would like people to stick to once an hour).

(Unconditional GETs of my feeds are rate limited down to once every four hours.)

Once people trip your rate limits and start getting HTTP 429 responses, you theoretically can signal how soon they can come back with a Retry-After header. The simplest way to implement this is to have a constant value that you put in this header, even if your actual rate limit implementation would allow a successful request earlier. For example, if you rate limit to one feed fetch every half hour and a feed fetcher polls after 20 minutes, the simple Retry-After value is '1800' (half an hour in seconds), although if they tried again in just over ten minutes they could succeed (depending on how you implement rate limits). This is what I currently do, with a different Retry-After (and a different rate limit) for conditional GET requests and unconditional GETs.

My suspicion is that there are almost no feed fetchers that ignore your Cache-Control max-age setting but that honor your HTTP 429 Retry-After setting (or that react to 429s at all). Certainly I see a lot of feed fetchers here behaving in ways that very strongly suggest they ignore both, such as rather frequent fetch attempts. But at least I tried.

Sidebar: rate limit policies and feed reader behavior

When you have a rate limit, one question is whether failed (rate limited) requests should count against the rate limit, or if only successful ones count. If you nominally allow one feed fetch every 30 minutes and a feed reader fetches at T (successfully), T+20, and T+33, this is the difference between the third fetch failing (since it's less than 30 minutes from the previous attempt) or succeeding (since it's more than 30 minutes from the last successful fetch).

There are various situations where the right answer is that your rate limit counts from the last request even if the last request failed (what Exim calls a strict ratelimit). However, based on observed feed reader behavior, doing this strict rate limiting on feed fetches will result in quite a number of syndication feed readers never successfully fetching your feed, because they will never slow down and drop under your rate limit. You probably don't want this.

Mapping from total requests per day to average request rates

By: cks

Suppose, not hypothetically, that a single IP address with a single User-Agent has made 557 requests for your blog's syndication feed in about 22 and a half hours (most of which were rate-limited and got HTTP 429 replies). If we generously assume that these requests were distributed evenly over one day (24 hours), what was the average interval between requests (the rate of requests)? The answer is easy enough to work out and it's about two and a half minutes between requests, if they were evenly distributed.

I've been looking at numbers like this lately and I don't feel like working out the math each time, so here is a table of them for my own future use.

Total requests Theoretical interval (rate)
6 Four hours
12 Two hours
24 One hour
32 45 minutes
48 30 minutes
96 15 minutes
144 10 minutes
288 5 minutes
360 4 minutes
480 3 minutes
720 2 minutes
1440 One minute
2880 30 seconds
5760 15 seconds
8640 10 seconds
17280 5 seconds
43200 2 seconds
86400 One second

(This obviously isn't comprehensive; instead I want it to give me a ballpark idea, and I care more about higher request counts than lower ones. But not too high because I mostly don't deal with really high rates. Every four hours and every 45 minutes are relevant to some ratelimiting I do.)

Yesterday there were about 20,240 requests for the main syndication feed for Wandering Thoughts, which is an aggregate rate of more than one request every five seconds. About 10,570 of those requests weren't blocked in various ways or ratelimited, which is still more than one request every ten seconds (if they were evenly spread out, which they probably weren't).

(There were about 48,000 total requests to Wandering Thoughts, and about 18,980 got successful responses, although almost 2,000 of those successful responses were a single rogue crawler that's now blocked. This is of course nothing compared to what a busy website sees. Yesterday my department's web server saw 491,900 requests, although that seems to have been unusually high. Interested parties can make their own tables for that sort of volume level.)

It's a bit interesting to see this table written out this way. For example, if I thought about it I knew there was a factor of ten difference between one request every ten seconds and one request every second, but it's more concrete when I see the numbers there with the extra zero.

In GNU Emacs, I should remember that the basics still work

By: cks

Over on the Fediverse, I said something that has a story attached:

It sounds obvious to say it, but I need to remember that I can always switch buffers in GNU Emacs by just switching buffers, not by using, eg, the MH-E commands to switch (back) to another folder. The MH-E commands quite sensibly do additional things, but sometimes I don't want them.

GNU Emacs has a spectrum of things that range from assisting your conventional editing (such as LSP clients) to what are essentially nearly full-blown applications that happen to be embedded in GNU Emacs, such as magit and MH-E and the other major modes for reading your email (or Usenet news, or etc). One of my personal dividing lines is to what extent the mode takes over from regular Emacs keybindings and regular Emacs behaviors. On this scale, MH-E is quite high on the 'application' side; in MH-E folder buffers, you mostly do things through custom keybindings.

(Well, sort of. This is actually overselling the case because I use regular Emacs buffer movement and buffer searching commands routinely, and MH-E uses Emacs marks to select ranges of messages, which you establish through normal Emacs commands. But actual MH-E operations, like switching to another folder, are done through custom keybindings that involve MH-E functions.)

My dominant use of GNU Emacs at the moment is as a platform for MH-E. When I'm so embedded in an MH-E mindset, it's easy to wind up with a form of tunnel vision, where I think of the MH-E commands as the only way to do something like 'switch to another (MH) folder'. Sometimes I do need or want to use the MH-E commands, and sometimes they're the easiest way, but part of the power of GNU Emacs as a general purpose environment is that ultimately, MH-E's displays of folders and messages, the email message I'm writing, and so on, are all just Emacs buffers being displayed in Emacs windows. I don't have to switch between these things through MH-E commands if I don't want to; I can just switch buffers with 'C-x b'.

(Provided that the buffer already exists. If the buffer doesn't exist, I need to use the MH-E command to create it.)

Sometimes the reason to use native Emacs buffer switching is that there's no MH-E binding for the functionality, for example to switch from a mail message I'm writing back to my inbox (either to look at some other message or to read new email that just came in). Sometimes it's because, for example, the MH-E command to switch to a folder wants to rescan the MH folder, which forces me to commit or discard any pending deletions and refilings of email.

One of the things that makes this work is that MH-E uses a bunch of different buffers for things. For example, each MH folder gets its own separately named buffer, instead of MH-E simply loading the current folder (whatever it is) into a generic 'show a folder' buffer. Magit does something similar with buffer naming, where its summary buffer isn't called just 'magit' but 'magit: <directory>' (I hadn't noticed that until I started writing this entry, but of course Magit would do it that way as a good Emacs citizen).

Now that I've written this, I've realized that a bit of my MH-E customization uses a fixed buffer name for a temporary buffer, instead of a buffer name based on the current folder. I'm in good company on this, since a number of MH-E status display commands also use fixed-name buffers, but perhaps I should do better. On the other hand, using a fixed buffer name does avoid having a bunch of these buffers linger around just because I used my command.

(This is using with-output-to-temp-buffer, and a lot of use of it in GNU Emacs' standard Lisp is using fixed names, so maybe my usage here is fine. The relevant Emacs Lisp documentation doesn't have style and usage notes that would tell me either way.)

Some thoughts on Ubuntu automatic ('unattended') package upgrades

By: cks

The default behavior of a stock Ubuntu LTS server install is that it enables 'unattended upgrades', by installing the package unattended-upgrades (which creates /etc/apt/apt.conf.d/20auto-upgrades, which controls this). Historically, we haven't believed in unattended automatic package upgrades and eventually built a complex semi-automated upgrades system (which has various special features). In theory this has various potential advantages; in practice it mostly results in package upgrades being applied after some delay that depends on when they come out relative to working days.

I have a few machines that actually are stock Ubuntu servers, for reasons outside the scope of this entry. These machines naturally have automated upgrades turned on and one of them (in a cloud, using the cloud provider's standard Ubuntu LTS image) even appears to automatically reboot itself if kernel updates need that. These machines are all in undemanding roles (although one of them is my work IPv6 gateway), so they aren't necessarily indicative of what we'd see on more complex machines, but none of them have had any visible problems from these unattended upgrades.

(I also can't remember the last time that we ran into a problem with updates when we applied them. Ubuntu updates still sometimes have regressions and other problems, forcing them to be reverted or reissued, but so far we haven't seen problems ourselves; we find out about these problems only through the notices in the Ubuntu security lists.)

If we were starting from scratch today in a greenfield environment, I'm not sure we'd bother building our automation for manual package updates. Since we have the automation and it offers various extra features (even if they're rarely used), we're probably not going to switch over to automated upgrades (including in our local build of Ubuntu 26.04 LTS when that comes out next year).

(The advantage of switching over to standard unattended upgrades is that we'd get rid of a local tool that, like all local tools, is all our responsibility. The less local weird things we have, the better, especially since we have so many as it is.)

I wish Firefox had some way to permanently mute a website

By: cks

Over on the Fediverse, I had a wish:

My kingdom for a way to tell Firefox to never, ever play audio and/or video for a particular site. In other words, a permanent and persistent mute of that site. AFAIK this is currently impossible.

(For reasons, I cannot set media.autoplay.blocking_policy to 2 generally. I could if Firefox had a 'all subdomains of ...' autoplay permission, but it doesn't, again AFAIK.)

(This is in a Firefox setup that doesn't have uMatrix and that runs JavaScript.)

Sometimes I visit sites in my 'just make things work' Firefox instance that has JavaScript and cookies and so on allowed (and throws everything away when it shuts down), and it turns out that those sites have invented exceedingly clever ways to defeat Firefox's default attempts to let you block autoplaying media (and possibly their approach is clever enough to defeat even the strict 'click to start' setting for media.autoplay.blocking_policy). I'd like to frustrate those sites, especially ones that I keep winding up back on for various reasons, and never hear unexpected noises from Firefox.

(In general I'd probably like to invert my wish, so that Firefox never played audio or video by default and I had to specifically enable it on a site by site basis. But again this would need an 'all subdomains of' option. This version might turn out to be too strict, I'd have to experiment.)

You can mute a tab, but only once it starts playing, and your mute isn't persistent. As far as I know there's no (native) way to get Firefox to start a tab muted, or especially to always start tabs for a site in a muted state, or to disable audio and/or video for a site entirely (the way you can deny permission for camera or microphone access). I'm somewhat surprised that Firefox doesn't have any option for 'this site is obnoxious, put them on permanent mute', because there are such sites out there.

Both uMatrix and apparently NoScript can selectively block media, but I'd have to add either of them to this profile and I broadly want it to be as plain as reasonable. I do have uBlock Origin in this profile (because I have it in everything), but as far as I can tell it doesn't have a specific (and selective) media blocking option, although it's possible you can do clever things with filter rules, especially if you care about one site instead of all sites.

(I also think that Firefox should be able to do this natively, but evidently Firefox disagrees with me.)

PS: If Firefox actually does have an apparently well hidden feature for this, I'd love to know about it.

Argparse will let you have multiple long (and short) options for one thing

By: cks

Argparse is the standard Python module for handling (Unix style) command line options, in the expected way (which not all languages follow). Or at least more or less the expected way; people are periodically surprised that by default argparse allows you to abbreviate long options (although you can safely turn that off if you assume Python 3.8 or later and you remember this corner case).

What I think of as the typical language API for specifying short and long options allows you to specify (at most) one of each; this is the API of, for example, the Go package I use for option handling. When I've written Python programs using argparse, I've followed this usage without thinking very much about it. However, argparse doesn't actually require you to restrict yourself this way. The addargument()_ accepts a list of option strings, and although the documentation's example shows a single short option and a single long option, you can give it more than one of each and it will work.

So yes, you can perfectly reasonably create an argparse option that can be invoked as either '--ns' or '--no-something', so that on the one hand you have a clear canonical version and on the other hand you have something short for convenience. If I'm going to do this (and sometimes I am), the thing I want to remember is that argparse's help output will report these options in the order I gave them to addargument()_ so I probably want to list the long one first, as the canonical and clearest form. In other words:

parser.add_argument("--no-something", "--ns", ....)

so that the -h output I get says:

--no-something, --ns     Don't do something

(If you have multiple '--no-...' options, abbreviated options aren't as compact as this '--ns' style. Of course it's a little bit unusual to have several long options that mean the same thing, but my view is that long options are sort of a zoo anyway and you might as well be convenient.)

Having multiple short (single letter) options for the same thing is also possible but much less in the Unix style, so I'm not sure I'd ever use it. One plausible use is mapping old short options to your real ones for compatibility (or just options that people are accustomed to using for some particular purpose from other programs, and keep using with yours).

(This is probably not news to anyone who's really used argparse. I'm partly writing this down so that I'll remember it in the future.)

You can only customize GNU Emacs so far due to primitives

By: cks

GNU Emacs is famous as an editor written largely in itself, well, in Emacs Lisp, with a C core for some central high performance things and things that have to be done in C (called 'primitives' in Emacs jargon). It's perhaps popular to imagine that the overall structure of this is that the C parts of GNU Emacs expose a minimal and direct API that's mostly composed of primitive operations, so that as much of Emacs as possible can be implemented in Emacs Lisp. Unfortunately, this isn't really the case, or at least not necessarily as you'd like it, and one consequence of this is to limit the amount of customization you can feasibly do to GNU Emacs.

An illustration of this is in how GNU Emacs de-iconifies frames in X. In a minimal C API version of GNU Emacs, there might be various low level X primitives, including 'x-deiconify-frame', and the Emacs Lisp code for frame management would call these low level X primitives when running under X, and other primitives when running under Windows, and so on. In the actual GNU Emacs, deiconification of frames happens at multiple points and the exposed primitives are things like raise-frame and make-frame-visible. As their names suggest, these primitives aren't there to give Emacs Lisp code access to low level X operations, they're there to do certain higher level logical things.

This is a perfectly fair and logical decision by the GNU Emacs developers. To put it one way, GNU Emacs is opinionated. It and its developers have a certain model of how it works and how things should behave, what it means for the program to be 'GNU Emacs' as opposed to a hypothetical editor construction kit, and what the C code does is a reflection of that. To the Emacs developers, 'make a frame visible' is a sensible thing to do and is best done in C, so they did it that way.

(Buffers are another area where Emacs is quite opinionated on how it wants to work. This sometimes gets awkward, as anyone who's wrestled with temporarily displaying some information from Emacs Lisp may have experienced.)

The drawback of this is that sometimes you can only easily customize GNU Emacs in ways that line up with how the developers expected, since you can't change the inside of C level primitives. If your concept of an operation you want to hook, modify, block, or otherwise fiddles with matches with how GNU Emacs sees things, all is probably good. But if your concept of 'an operation' doesn't match up with how GNU Emacs sees it, you may find that what you want to touch is down inside the C layer and isn't exposed as a separate primitive.

(Even if it is exposed as a primitive in its own right, you can have problems, because when you advise a primitive, this doesn't affect calls to the primitive from other C functions. If there was a separate 'x-deiconify-frame' primitive, I could hook it for calls from Lisp, but not a call from 'make-frame-visible' if that was still a primitive. So to really have effective hooking of a primitive, you need it to be only called from Lisp code (at least for cases you care about).)

PS: This can lead to awkward situations even when everything you want to modify is in Emacs Lisp code, because the specific bit you want to change may be in the middle of a large function. Of course with Emacs Lisp you can always redefine the function, copying its code and modifying it to taste, but there are still drawbacks. You can make this somewhat more reliable in the face of changes (via a comment on this entry, but it's still not great.

The Bash Readline bindings and settings that I want

By: cks

Normally I use Bash (and Readline in general) in my own environment, where I have a standard .inputrc set up to configure things to my liking (although it turns out that one particular setting doesn't work now (and may never have), and I didn't notice). However, sometimes I wind up using Bash in foreign environments, for example if I'm su'd to root at the moment, and when that happens the differences can be things that I get annoyed by. I spent a bit of today running into this again and being irritated enough that this time I figured out how to fix it on the fly.

The general Bash command to do readline things is 'bind', and I believe it accepts all of the same syntax as readline init files do, both for keybindings and for turning off (mis-)features like bracketed paste (which we dislike enough that turning it off for root is a standard feature of our install framework). This makes it convenient if I forget the exact syntax, because I can just look at my standard .inputrc and copy lines from it.

What I want to do is the following:

  • Switch Readline to the Unix word erase behavior I want:

    set bind-tty-special-chars off
    Control-w: backward-kill-word

    Both of these are necessary because without the first, Bash will automatically bind Ctrl-w (my normal word-erase character) to 'unix-word-rubout' and not let you override that with your own binding.

    (This is the difference that I run into all the time, because I'm very used to be able to use Ctrl-W to delete only the most recent component of a path. I think this partly comes from habit and partly because you tab-complete multi-component paths a component at a time, so if I mis-completed the latest component I want to Ctrl-W just it. M-Del is a standard Readline binding for this, but it's less convenient to type and not something I remember.)

  • Make readline completion treat symbolic links to directories as if they were directories:

    set mark-symlinked-directories on

    When completing paths and so on, I mostly don't bother thinking about the difference between an actual directory (such as /usr/bin) and a symbolic link to a directory (such as /bin on modern Linuxes). If I type '/bi<TAB>' I want this to complete to '/bin/', not '/bin', because it's basically guaranteed that I will go on to tab-complete something in '/bin/'. If I actually want the symbolic link, I'll delete the trailing '/' (which does happen every so often, but much less frequently than I want to tab-complete through the symbolic link).

  • Make readline forget any random edits I did to past history lines when I hit Return to finally do something:

    set revert-all-at-newline on

    The behavior I want from readline is that past history is effectively immutable. If I edit some bit of it and then abandon the edit by moving to another command in the history (or just start a command from scratch), the edited command should revert to being what I actually typed back when I executed it no later than when I hit Return on the current command and start a new one. It infuriates me when I cursor-up (on a fresh command) and don't see exactly the past commands that I typed.

    (My notes say I got this from Things You Didn't Know About GNU Readline.)

This is more or less in the order I'm likely to fix them. The different (and to me wrong) behavior of C-w is a relatively constant irritation, while the other two are less frequent.

(If this irritates me enough on a particular system, I can probably do something in root's .bashrc, if only to add an alias to use 'bind -f ...' on a prepared file. I can't set these in /root/.inputrc, because my co-workers don't particularly agree with my tastes on these and would probably be put out if standard readline behavior they're used to suddenly changed on them.)

(In other Readline things I want to remember, there's Readline's support for fishing out last or first or Nth arguments from earlier commands.)

Why Wandering Thoughts has fewer comment syndication feeds than yesterday

By: cks

Over on the Fediverse I said:

My techblog used to offer Atom syndication feeds for the comments on individual entries. I just turned that off because it turns out to be a bad idea on the modern web when you have many years of entries. There are (were) any number of 'people' (feed things) that added the comment feeds for various entries years ago and then never took them out, despite those entries being years old and in some cases never having gotten comments in the first place.

DWiki, the engine behind Wandering Thoughts, is nothing if not general. Syndication feeds, for example, are a type of 'view' over a directory hierarchy, and are available for both pages and comments. A regular (page) syndication feed view can only be done over (on) a directory, because if it was applied to an individual page the feed would only ever contain that page. However, when I wrote DWiki it was obvious that a comment syndication feed for a particular page made sense; it would give you all of the comments 'under' that page (ie, on it). And so for almost all of the time that Wandering Thoughts has been in operation, you could have looked down to the bottom of an entry's page (on the web) and seen in small type 'Atom Syndication: Recent Comments' (with the 'recent comments' being a HTML link giving you the URL of that page's comment feed).

(The comment syndication feed for a directory is all comments on all pages underneath the directory.)

That's gone now, because I decided that it didn't make sense in what Wandering Thoughts has become and because I was slowly accumulating feed readers that were pulling the comment syndication feeds for more and more entries. This is exactly the behavior I should have expected from feed readers from the start; once someone puts a feed in, that feed is normally forever even if it's extremely inactive or has never had an entry. The feed reader will dutifully poll every feed for years to come (well, certainly every feed that responds with HTTP success and a valid syndication feed, which all of my comment feeds did).

(There weren't very many pages having their comment syndication feeds hit, but there were enough that I kept noticing them, especially when I added things like hacky rate limiting for feed fetching. I actually put in some extra hacks to deal with how requests for these feeds interacted with my rate limiting.)

There are undoubtedly places on the Internet where discussion (in the form of comments) continues on for years on certain pages, and so a comment feed for an individual page could make sense; you really might keep up (in your feed reader) with a slow moving conversation that lasts years. Other places on the Internet put definite cut-offs on further discussion (comments) on individual pages, which provides a natural deadline to turn off the page's comment syndication feed. But neither of those profiles describes Wandering Thoughts, where my entries remain open for comments more or less forever (and sometimes people do comment on quite old entries), but comments and discussions don't tend to go on for very long.

Of course, the other thing that this change prevents is that it stops (LLM) web crawlers from trying to crawl all of those URLs for comment syndication feeds. You can't crawl URLs that aren't advertised any more and no longer exist (well, sort of, they technically exist but the code for handling them arranges to return 404s if the new 'no comment feeds for actual pages' configuration option is turned on).

Giving up on Android devices using IPv6 on our general-access networks

By: cks

We have a couple of general purpose, general access networks that anyone can use to connect their devices to; one is a wired network (locally, it's called our 'RED' network after the colour of the network cables used for it), and the other is a departmental wireless network that's distinct from the centrally run university-wide network. However, both of these networks have a requirement that we need to be able to more or less identify who is responsible for a machine on them. Currently, this is done through (IPv4) DHCP and registering the Ethernet address of your device. This is a problem for any IPv6 deployment, because the Android developers refuse to support DHCPv6.

We're starting to look more seriously at IPv6, including sort of planning out how our IPv6 subnets will probably work, so I came back to thinking about this issue recently. My conclusion and decision was to give up on letting Android devices use IPv6 on our networks. We can't use SLAAC (StateLess Address AutoConfiguration) because that doesn't require any sort of registration, and while Android devices apparently can use IPv6 Prefix Delegation, that would consume /64s at a prodigious rate using reasonable assumptions. We'd also have to build a system to do it. So there's no straightforward answer, and while I can think of potential hacks, I've decided that none of them are particular good options compared to the simple choice to not support IPv6 for Android by way of only supporting DHCPv6.

(Our requirement for registering a fixed Ethernet address also means that any device that randomizes its wireless Ethernet address on every connection has to turn that off. Hopefully all such devices actually have such an option.)

I'm only a bit sad about this, because you can only hope that a rock rolls uphill for so long before you give up. IPv6 is still not a critical thing in my corner of the world (as shown by how no one is complaining to us about the lack of it), so some phones continuing to not have IPv6 is not likely to be a big deal to people here.

(Android devices that can be connected to wired networking will be able to get IPv6 on some research group networks. Some research groups ask for their network to be open and not require pre-registration of devices (which is okay if it only exists in access-controlled space), and for IPv6 I expect we'll do this by turning on SLAAC on the research group's network and calling it a day.)

Connecting M.2 drives to various things (and not doing so)

By: cks

As a result of discovering that (M.2) NVMe SSDs seem to have become the dominant form of SSDs, I started looking into what you could connect M.2 NVMe SSDs to. Especially I started looking to see if you could turn M.2 NVMe SSDs into SATA SSDs, so you could connect high capacity M.2 NVMe SSDs to, for example, your existing stock of ZFS fileservers (which use SATA SSDs). The short version is that as far as I can tell, there's nothing that does this, and once I started thinking about it I wasn't as surprised as I might be.

What you can readily find is passive adapters from M.2 NVMe or M.2 SATA to various other forms of either NVMe or SATA, depending. For example, there are M.2 NVMe to U.2 cases, and M.2 SATA to SATA cases; these are passive because they're just wiring things through, with no protocol conversion. There are also some non-passive products that go the other way; they're a M.2 'NVMe' 2280 card that has four SATA ports on it (and presumably a PCIe SATA controller). However, the only active M.2 NVMe product (one with protocol conversion) that I can find is M.2 NVMe to USB, generally in the form of external enclosures.

(NVMe drives are PCIe devices, so an 'M.2 NVMe' connector is actually providing some PCIe lanes to the M.2 card. Normally these lanes are connected to an NVMe controller, but I don't believe there's any intrinsic reason that you can't connect them to other PCIe things. So you can have 'PCIe SATA controller on an M.2 PCB' and various other things.)

When I thought about it, I realized the problem with my hypothetical 'obvious' M.2 NVMe to SATA board (and case): since it involves protocol conversion (between NVMe and SATA), someone would have to make the controller chipset for it. You can't make a M.2 NVMe to SATA adapter until someone goes to the expense of designing and fabricating (and probably programming) the underlying chipset, and presumably no one has yet found it commercially worthwhile to do so. Since (M.2) NVMe to USB adapters exist, protocol conversion is certainly possible, and since such adapters are surprisingly inexpensive, presumably there's enough demand to drive down the price of the underlying controller chipsets.

(These chipsets are, for example, the Realtek RTL9210B-CG or the ASMedia ASM3242.)

Designing a chipset is not merely expensive, it's very expensive, which to me explains why there aren't any high-priced options for connecting a NVMe drive up via SATA, the way there are high-priced options for some uncommon things (like connecting multiple NVMe drives to a single PCIe slot without PCIe bifurcation, which can presumably be done with the right existing PCIe bridge chipset).

(Since I checked, there also doesn't currently seem to be any high capacity M.2 SATA SSDs (which in theory could just be a controller chipset swap from the M.2 NVMe version). If they existing, you could use a passive M.2 SATA to 2.5" SATA adapter to get them into the form factor you want.)

It seems like NVMe SSDs have overtaken SATA SSDs for high capacities

By: cks

For a long time, NVMe SSDs were the high end option; as the high end option they cost more than SATA SSDs of the same capacity, and SATA SSDs were generally available in higher capacity than NVMe SSDs (at least at prices you wanted to pay). This is why my home desktop wound up with a storage setup with a mirrored pair of 2 TB NVMe SSDs (which felt pretty indulgent) and a mirrored pair of 4 TB SATA SSDs (which felt normal-ish). Today, for reasons outside the boundary of this entry, I wound up casually looking to see how available large SSDs were. What I expected to find was that large-capacity SATA SSDs would now be reasonably available and not too highly priced, while NVMe SSDs would top out at perhaps 4TB and high prices.

This is not what I found, at least at some large online retailers. Instead, SATA SSDs seem to have almost completely stagnated at 4 TB, with capacities larger than that only available from a few specialty vendors at eye-watering prices. By contrast, 8 TB NVMe SSDs seem readily available at somewhat reasonable prices from mainstream drive vendors like WD (they aren't inexpensive but they're not unreasonable given the prices of 4 TB NVMe, which is roughly the price I remember 4 TB SATA SSDs being at). This makes me personally sad, because my current home desktop has more SATA ports than M.2 slots or even PCIe x1 slots.

(You can get PCIe x1 cards that mount a single NVMe SSD, and I think I'd get somewhat better than SATA speeds out of them. I have one to try out in my office desktop, but I haven't gotten around to it yet.)

At one level this makes sense. Modern motherboards have a lot more M.2 slots than they used to, and I speculated several years ago that M.2 NVMe drives would eventually be cheaper to make than 2.5" SSDs. So in theory I'm not surprised that probable consumer (lack of) demand has basically extinguished SATA SSDs above 4 TB. In practice, I am surprised and it feels disconcerting for NVMe SSDs to now look like the 'mainstream' choice.

(This is also potentially inconvenient for work, where we have a bunch of ZFS fileservers that currently use 4 TB 2.5" SATA SSDs (an update from their original 2 TB SATA SSDs). If there are no reasonably priced SATA SSDs above 4 TB, then our options for future storage expansion become more limited. In the long run we may have to move to U.2 to get hotswappable 4+ TB SSDs. On the other hand, apparently there are inexpensive M.2 to U.2 adapters, and we've done worse sins with our fileservers.)

Websites and web developers mostly don't care about client-side problems

By: cks

In response to my entry on the fragility of the web in the face of the crawler plague, Jukka said in a comment:

While I understand the server-side frustrations, I think the corresponding client-side frustrations have largely been lacking from the debates around the Web.

For instance, CloudFlare now imposes heavy-handed checks that take a few seconds to complete. [...]

This is absolutely true but it's not new, and it goes well beyond anti-crawler and anti-robot defenses. As covered by people like Alex Russell, it's routine for websites to ignore most real world client side concerns (also, and including on desktops). Just recently (as of August 2025), Github put out a major update that many people are finding immensely slow even on developer desktops. If we can't get web developers to care about common or majority experiences for their UI, which in some sense has relatively little on the line, the odds of web site operators caring when their servers are actually experiencing problems (or at least annoyances) is basically nil.

Much like browsers have most of the power in various relationships with, for example, TLS certificate authorities, websites have most of the power in their relationship to clients (ie, us). If people don't like what a website is doing, their only option is generally a boycott. Based on the available evidence so far, any boycotts over things like CAPTCHA challenges have been ineffective so far. Github can afford to give people a UI with terrible performance because the switching costs are sufficiently high that they know most people won't.

(Another view is that the server side mostly doesn't notice or know that they're losing people; the lost people are usually invisible, with websites only having much visibility into the people who stick around. I suspect that relatively few websites do serious measurement of how many people bounce off or stop using them.)

Thus, in my view, it's not so much that client-side frustrations have been 'lacking' from debates around the web, which makes it sound like client side people haven't been speaking up, as that they've been actively ignored because, roughly speaking, no one on the server side cares about client-side frustrations. Maybe they vaguely sympathize, but they care a lot more about other things. And it's the web server side who decides how things operate.

(The fragility exposed by LLM crawler behavior demonstrates that clients matter in one sense, but it's not a sense that encourages website operators to cooperate or listen. Rather the reverse.)

I'm in no position to throw stones here, since I'm actively making editorial decisions that I know will probably hurt some real clients. Wandering Thoughts has never been hammered by crawler load the way some sites have been; I merely decided that I was irritated enough by the crawlers that I was willing to throw a certain amount of baby out with the bathwater.

Getting the Cinnamon desktop environment to support "AppIndicator"

By: cks

The other day I wrote about what "AppIndicator" is (a protocol) and some things about how the Cinnamon desktop appeared to support it, except they weren't working for me. Now I actually understand what's going on, more or less, and how to solve my problem of a program complaining that it needed AppIndicator.

Cinnamon directly implements the AppIndicator notification protocol in xapp-sn-watcher, part of Cinnamon's xapp(s) package. Xapp-sn-watcher is started as part of your (Cinnamon) session. However, it has a little feature, namely that it will exit if no one is asking it to do anything:

XApp-Message: 22:03:57.352: (SnWatcher) watcher_startup: ../xapp-sn-watcher/xapp-sn-watcher.c:592: No active monitors, exiting in 30s

In a normally functioning Cinnamon environment, something will soon show up to be an active monitor and stop xapp-sn-watcher from exiting:

Cjs-Message: 22:03:57.957: JS LOG: [LookingGlass/info] Loaded applet xapp-status@cinnamon.org in 88 ms
[...]
XApp-Message: 22:03:58.129: (SnWatcher) name_owner_changed_signal: ../xapp-sn-watcher/xapp-sn-watcher.c:162: NameOwnerChanged signal received (n: org.x.StatusIconMonitor.cinnamon_0, old: , new: :1.60
XApp-Message: 22:03:58.129: (SnWatcher) handle_status_applet_name_owner_appeared: ../xapp-sn-watcher/xapp-sn-watcher.c:64: A monitor appeared on the bus, cancelling shutdown

This something is a standard Cinnamon desktop applet. In System Settings β†’ Applets, it's way down at the bottom and is called "XApp Status Applet". If you've accidentally wound up with it not turned on, xapp-sn-watcher will (probably) not have a monitor active after 30 seconds, and then it will exit (and in the process of exiting, it will log alarming messages about failed GLib assertions). Not having this xapp-status applet turned on was my problem, and turning it on fixed things.

(I don't know how it got turned off. It's possible I wen through the standard applets at some point and turned some of them off in an excess of ignorant enthusiasm.)

As I found out from leigh scott in my Fedora bug report, the way to get this debugging output from xapp-sn-watcher is to run 'gsettings set org.x.apps.statusicon sn-watcher-debug true'. This will cause xapp-sn-watcher to log various helpful and verbose things to your ~/.xsession-errors (although apparently not the fact that it's actually exiting; you have to deduce that from the timestamps stopping 30 seconds later and that being the timestamps on the GLib assertion failures).

(I don't know why there's both a program and an applet involved in this and I've decided not to speculate.)

The current (2025) crawler plague and the fragility of the web

By: cks

These days, more and more people are putting more and more obstacles in the way of the plague of crawlers (many of them apparently doing it for LLM 'AI' purposes), me included. Some of these obstacles involve attempting to fingerprint unusual aspects of crawler requests, such as using old browser User-Agents or refusing to accept compressed things in an attempt to avoid gzip bombs; other obstacles may involve forcing visitors to run JavaScript, using CAPTCHAs, or relying on companies like Cloudflare to block bots with various techniques.

On the one hand, I sort of agree that these 'bot' (crawler) defenses are harmful to the overall ecology of the web. On the other hand, people are going to do whatever works for them for now, and none of the current alternatives are particularly good. There's a future where much of the web simply isn't publicly available any more, at least not to anonymous people.

One thing I've wound up feeling from all this is that the current web is surprisingly fragile. A significant amount of the web seems to have been held up by implicit understandings and bargains, not by technology. When LLM crawlers showed up and decided to ignore the social things that had kept those parts of the web going, things started coming down all over the place.

(This isn't new fragility; the fragility was always there.)

Unfortunately, I don't see a technical way out from this (and I'm not sure I see any realistic way in general). There's no magic wand that we can wave to make all of the existing websites, web apps, and so on not get impaired by LLM crawlers when the crawlers persist in visiting everything despite being told not to, and on top of that we're not going to make bandwidth free. Instead I think we're looking at a future where the web ossifies for and against some things, and more and more people see catgirls.

(I feel only slightly sad about my small part in ossifying some bits of the web stack. Another part of me feels that a lot of web client software has gotten away with being at best rather careless for far too long, and now the consequences are coming home to roost.)

What an "AppIndicator" is in Linux desktops and some notes on it

By: cks

Suppose, not hypothetically, that you start up some program on your Fedora 42 Cinnamon desktop and it helpfully tells you "<X> requires AppIndicator to run. Please install the AppIndicator plugin for your desktop". You are likely confused, so here are some notes.

'AppIndicator' itself is the name of an application notification protocol, apparently originally from KDE, and some desktop environments may need a (third party) extension to support it, such as the Ubuntu one for GNOME Shell. Unfortunately for me, Cinnamon is not one of those desktops. It theoretically has native support for this, implemented in /usr/libexec/xapps/xapp-sn-watcher, part of Cinnamon's xapps package.

The actual 'AppIndicator' protocol is done over D-Bus, because that's the modern way. Since this started as a KDE thing, the D-Bus name is 'org.kde.StatusNotifierWatcher'. What provides certain D-Bus names is found in /usr/share/dbus-1/services, but not all names are mentioned there and 'org.kde.StatusNotifierWatcher' is one of the missing ones. In this case /etc/xdg/autostart/xapp-sn-watcher.desktop mentions the D-Bus name in its 'Comment=', but that's probably not something you can count on to find what your desktop is (theoretically) using to provide a given D-Bus name. I found xapp-sn-watcher somewhat through luck.

There are probably a number of ways to see what D-Bus names are currently registered and active. The one that I used when looking at this is 'dbus-send --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.ListNames'. As far as I know, there's no easy way to go from an error message about 'AppIndicator' to knowing that you want 'org.kde.StatusNotifierWatcher'; in my case I read the source of the thing complaining which was helpfully in Python.

(I used the error message to find the relevant section of code, which showed me what it wasn't finding.)

I have no idea how to actually fix the problem, or if there is a program that implements org.kde.StatusNotifierWatcher as a generic, more or less desktop independent program the way that stalonetray does for system tray stuff (or one generation of system tray stuff, I think there have been several iterations of it, cf).

(Yes, I filed a Fedora bug, but I believe Cinnamon isn't particularly supported by Fedora so I don't expect much. I also built the latest upstream xapps tree and it also appears to fail in the same way. Possibly this means something in the rest of the system isn't working right.)

Some notes on DMARC policy inheritance and a gotcha

By: cks

When you use DMARC, you get to specify a policy that people should apply to email that claims to be from your domain but doesn't pass DMARC checks (people are under no obligation to pay attention to this and they may opt to be stricter). These policies are set in DNS TXT records, and in casual use we can say that the policies of subdomains in your domain can be 'inherited'. This recently confused me and now I have some answers.

Your top level domain can specify a separate policy for itself (eg 'user@example.org') and subdomains (eg 'user@foo.example.org'); these are the 'p=' and 'sp=' bits in a DMARC DNS TXT record. Your domain's subdomain policy is used only for subdomains that don't set a policy themselves; an explicitly set subdomain policy overrides the domain policy, for better or worse. If your organization wants to force some minimum DMARC policy, you can't do it with a simple DNS record; you have to somehow forbid subdomains from publishing their own conflicting DMARC policies in your DNS.

The flipside of this is that it's not as bad as it could be to set a strict subdomain policy in your domain DMARC record, because subdomains that care can override it (and may already be doing so implicitly if they've published DMARC records themselves).

However, strictly speaking DMARC policies aren't inherited as we usually think about it. Instead, as I once knew but forgot since then, people using DMARC will check for an applicable policy in only two places: on the direct domain or host name that they care about, and on your organization's top level domain. What this means in concrete terms is that if example.org and foo.example.org both have DMARC records and someone sends email as 'user@bar.foo.example.org', the foo.example.org DMARC record won't be checked. Instead, people will look for DMARC only at 'bar.foo.example.org' (where any regular 'p=' policy will be used) and at 'example.org' (where the subdomain policy, 'sp=', will be used).

(As a corollary, a 'sp=' policy setting in the foo.example.org DMARC record will never be used.)

One place this gets especially interesting is if people send email using the domain 'nonexistent.foo.example.org' in the From: header (either from inside or outside your organization). Since this host name isn't in DNS, it has no DMARC policy of its own, and so people will go straight to the 'example.org' subdomain policy without even looking at the policy of 'foo.example.org'.

(Since traditional DNS wildcard records can only wildcard the leftmost label and DMARC records are looked up on a special '_dmarc.' DNS sub-name, it's not simple to give arbitrary names under your subdomain a DMARC policy.)

How not to check or poll URLs, as illustrated by Fediverse software

By: cks

Over on the Fediverse, I said some things:

[on April 27th:]
A bit of me would like to know why the Akkoma Fediverse software is insistently polling the same URL with HEAD then GET requests at five minute intervals for days on end. But I will probably be frustrated if I turn over that rock and applying HTTP blocks to individual offenders is easier.

(I haven't yet blocked Akkoma in general, but that may change.)

[the other day:]
My patience with the Akkoma Fediverse server software ran out so now all attempts by an Akkoma instance to pull things from my techblog will fail (with a HTTP redirect to a static page that explains that Akkoma mis-behaves by repeatedly fetching URLs with HEAD+GET every few minutes). Better luck in some future version, maybe, although I doubt the authors of Akkoma care about this.

(The HEAD and GET requests are literally back to back, with no delay between them that I've ever observed.)

Akkoma is derived from Pleroma and I've unsurprisingly seen Pleroma also do the HEAD then GET thing, but so far I haven't seen any Pleroma server showing up with the kind of speed and frequency that (some) Akkoma servers do.

These repeated HEADs and GETs are for Wandering Thoughts entries that haven't changed. DWiki is carefully written to supply valid HTTP Last-Modified and ETag, and these values are supplied in replies to both HEAD and GET requests. Despite all of this, Akkoma is not doing conditional GETs and is not using the information from the HEAD to avoid doing a GET if neither header has changed its value from the last time. Since Akkoma is apparently completely ignoring the result of its HEAD request, it might as well not make the HEAD request in the first place.

If you're going to repeatedly poll a URL, especially every five or ten minutes, and you want me to accept your software, you must do conditional GETs. I won't like you and may still arrange to give you HTTP 429s for polling so fast, but I most likely won't block you outright. Polling every five or ten minutes without conditional GET is completely unacceptable, at least to me (other people probably don't notice or care).

My best guess as to why Akkoma is polling the URL at all is that it's for "link previews". If you link to something in a Fediverse post, various Fediverse software will do the common social media thing of trying to embed some information about the target of the URL into the post as it presents it to local people; for plain links with no special handling, this will often show the page title. As far as the (rapid) polling goes, I can only guess that Akkoma has decided that it is extremely extra special and it must update its link preview information very rapidly should the linked URL do something like change the page title. However, other Fediverse server implementations manage to do link previews without repeatedly polling me (much less the HEAD then immediately a GET thing).

(On the global scale of things this amount of traffic is small beans, but it's my DWiki and I get to be irritated with bad behavior if I want to, even if it's small scale bad behavior.)

Getting Linux nflog and tcpdump packet filters to sort of work together

By: cks

So, suppose that you have a brand new nflog version of OpenBSD's pflog, so you can use tcpdump to watch dropped packets (or in general, logged packets). And further suppose that you specifically want to see DNS requests to your port 53. So of course you do:

# tcpdump -n -i nflog:30 'port 53'
tcpdump: NFLOG link-layer type filtering not implemented

Perhaps we can get clever by reading from the interface in one tcpdump and sending it to another to be interpreted, forcing the pcap filter to be handled entirely in user space instead of the kernel:

# tcpdump --immediate-mode -w - -U -i nflog:30 | tcpdump -r - 'port 53'
tcpdump: listening on nflog:30, link-type NFLOG (Linux netfilter log messages), snapshot length 262144 bytes
reading from file -, link-type NFLOG (Linux netfilter log messages), snapshot length 262144
tcpdump: NFLOG link-layer type filtering not implemented

Alas we can't.

As far as I can determine, what's going on here is that the netfilter log system, 'NFLOG', uses a 'packet' format that isn't the same as any of the regular formats (Ethernet, PPP, etc) and adds some additional (meta)data about the packet to every packet you capture. I believe the various attributes this metadata can contain are listed in the kernel's nfnetlink_log.h.

(I believe it's not technically correct to say that this additional stuff is 'before' the packet; instead I believe the packet is contained in a NFULA_PAYLOAD attribute.)

Unfortunately for us, tcpdump (or more exactly libpcap) doesn't know how to create packet capture filters for this format, not even ones that are interpreted entirely in user space (as happens when tcpdump reads from a file).

I believe that you have two options. First, you can use tshark with a display filter, not a capture filter:

# tshark -i nflog:30 -Y 'udp.port == 53 or tcp.port == 53'
Running as user "root" and group "root". This could be dangerous.
Capturing on 'nflog:30'
[...]

(Tshark capture filters are subject to the same libpcap inability to work on NFLOG formatted packets as tcpdump has.)

Alternately and probably more conveniently, you can tell tcpdump to use the 'IPV4' datalink type instead of the default, as mentioned in (opaque) passing in the tcpdump manual page:

# tcpdump -i nflog:30 -L
Data link types for nflog:30 (use option -y to set):
  NFLOG (Linux netfilter log messages)
  IPV4 (Raw IPv4)
# tcpdump -i nflog:30 -y ipv4 -n 'port 53'
tcpdump: data link type IPV4
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on nflog:30, link-type IPV4 (Raw IPv4), snapshot length 262144 bytes
[...]

Of course this is only applicable if you're only doing IPv4. If you have some IPv6 traffic that you want to care about, I think you have to use tshark display filters (which means learning how to write Wireshark display filters, something I've avoided so far).

I think there is some potentially useful information in the extra NFLOG data, but to get it or to filter on it I think you'll need to use tshark (or Wireshark) and consult the NFLOG display filter reference, although that doesn't seem to give you access to all of the NFLOG stuff that 'tshark -i nflog:30 -V' will print about packets.

(Or maybe the trick is that you need to match 'nflog.tlv_type == <whatever> and nflog.tlv_value == <whatever>'. I believe that some NFLOG attributes are available conveniently, such as 'nflog.prefix', which corresponds to NFULA_PREFIX. See packet-nflog.c.)

PS: There's some information on the NFLOG format in the NFLOG linktype documentation and tcpdump's supported data link types in the link-layer header types documentation.

An interesting thing about people showing up to probe new DNS resolvers

By: cks

Over on the Fediverse, I said something:

It appears to have taken only a few hours (or at most a few hours) from putting a new resolving DNS server into production to seeing outside parties specifically probing it to see if it's an open resolver.

I assume people are snooping activity on authoritative DNS servers and going from there, instead of spraying targeted queries at random IPs, but maybe they are mass scanning.

There turns out to be some interesting aspects to these probes. This new DNS server has two network interfaces, both firewalled off from outside queries, but only one is used as the source IP on queries to authoritative DNS servers. In addition, we have other machines on both networks, with firewalls, so I can get a sense of the ambient DNS probes.

Out of all of these various IPs, the IP that the new DNS server used for querying authoritative DNS servers, and only that IP, very soon saw queries that were specifically tuned for it:

124.126.74.2.54035 > 128.100.X.Y.53: 16797 NS? . (19)
124.126.74.2.7747 > 128.100.X.Y.7: UDP, length 512
124.126.74.2.54035 > 128.100.X.Y.53: 17690 PTR? Y.X.100.128.in-addr.arpa. (47)

This was a consistent pattern from multiple IPs; they all tried to query for the root zone, tried to check the UDP echo port, and then tried a PTR query for the machine's IP itself. Nothing else saw this pattern; not the machine's other IP on a different network, not another IP on the same network, and so on. This pattern and the lack of this pattern to other IPs is what's led me to assume that people are somehow identifying probe targets based on what source IPs they seem making upstream queries.

(There are a variety of ways that you could do this without having special access to DNS servers. APNIC has long used web ad networks and special captive domains and DNS servers for them to do various sorts of measurements, and you could do similar things to discover who was querying your captive DNS servers.)

How you want to have the Unbound DNS server listen on all interfaces

By: cks

Suppose, not hypothetically, that you have an Unbound server with multiple network interfaces, at least two (which I will call A and B), and you'd like Unbound to listen on all of the interfaces. Perhaps these are physical interfaces and there are client machines on both, or perhaps they're virtual interfaces and you have virtual machines on them. Let's further assume that these are routed networks, so that in theory people on A can talk to IP addresses on B and vice versa.

The obvious and straightforward way to have Unbound listen on all of your interfaces is with a server stanza like this:

server:
  interface: 0.0.0.0
  interface: ::0
  # ... probably some access-control statements

This approach works 99% of the time, which is probably why it appears all over the documentation. The other 1% of the time is when a DNS client on network A makes a DNS request to Unbound's IP address on network B; when this happens, the network A client will not get any replies. Well, it won't get any replies that it accepts. If you use tcpdump to examine network traffic, you will discover that Unbound is sending replies to the client on network A using its network A IP address as the source address (which is the default behavior if you send packets to a network you're directly attached to; you normally want to use your IP on that network as the source IP). This will fail with almost all DNS client libraries because DNS clients reject replies from unexpected sources, which is to say any IP other than the IP they sent their query to.

(One way this might happen is if the client moves from network B to network A without updating its DNS configuration. Or you might be testing to see if Unbound's network B IP address answers DNS requests.)

The other way to listen on all interfaces in modern Unbound is to use 'interface-automatic: yes' (in server options), like this:

server:
  interface-automatic: yes

The important bit of what interface-automatic does for you is mentioned in passing in its documentation, and I've emphasized it here:

Listen on all addresses on all (current and future) interfaces, detect the source interface on UDP queries and copy them to replies.

As far as I know, you can't get this 'detect the source interface' behavior for UDP queries in any other way if you use 'interface: 0.0.0.0' to listen on everything. You get it if you listen on specific interfaces, perhaps with 'ip-transparent: yes' for safety:

server:
  interface: 127.0.0.1
  interface: ::1
  interface: <network A>.<my-A-IP>
  interface: <network B>.<my-B-IP>
  # insure we always start
  ip-transparent: yes

Since 'interface-automatic' is marked as an experimental option I'd love to be wrong, but I can't spot an option in skimming the documentation and searching on some likely terms.

(I'm a bit surprised that Unbound doesn't always copy the IP address it received UDP packets on and use that for replies, because I don't think things work if you have the wrong IP there. But this is probably an unusual situation and so it gets papered over, although now I'm curious how this interacts with default routes.)

Another reason to use expendable email addresses for everything

By: cks

I'm a long time advocate of using expendable email addresses any time you have to give someone an email address (and then making sure you can turn them off or more broadly apply filters to them). However, some of the time I've trusted the people who were asking for an email address, didn't have an expendable address already prepared for them, and gave them my regular email address. Today I discovered (or realized) another reason to not do this and to use expendable addresses for absolutely everything, and it's not the usual reason of "the people you gave your email address to might get compromised and have their address collection extracted and sold to spammers". The new problem is mailing service providers, such as Mailchimp.

It's guaranteed that some amount of spammers make use of big mailing service providers, so you will periodically get spam email to any exposed email address, most likely including your real, primary one from such MSPs. At the same time, these days it's quite likely that anyone you give your email address to will at some point wind up using an MSP, if only to send out a cheerful notification of, say, "we moved from street address A to street address B, please remember for planning your next appointment" (because if you want to send out such a mass mailing, you basically have to outsource it to an MSP to get it done, even if you normally use, eg, GMail for your regular organizational activities).

If you've given innocent trustworthy organizations your main email address, it's potentially dangerous or impossible to block a particular MSP from sending email to it. In searching your email archive, you may find that such an organization is already using the MSP to send you stuff that you want, or for big MSPs you might decide that the odds are too bad. But if you've given separate expendable email addresses to all such organizations, you know that they're not going to be sending anything to your main email address, including through some MSP that you've just got spam from, and it's much safer to block that MSP's access to your main email address.

This issue hadn't occurred to me back when I apparently gave one organization my main email address, but it became relevant recently. So now I'm writing it down, if only for my future self as a reminder of why I don't want to do that.

Implementing a basic equivalent of OpenBSD's pflog in Linux nftables

By: cks

OpenBSD's and FreeBSD's PF system has a very convenient 'pflog' feature, where you put in a 'log' bit in a PF rule and this dumps a copy of any matching packets into a pflog pseudo-interface, where you can both see them with 'tcpdump -i pflog0' and have them automatically logged to disk by pflogd in pcap format. Typically we use this to log blocked packets, which gives us both immediate and after the fact visibility of what's getting blocked (and by what rule, also). It's possible to mostly duplicate this in Linux nftables, although with more work and there's less documentation on it.

The first thing you need is nftables rules with one or two log statements of the form 'log group <some number>'. If you want to be able to both log packets for later inspection and watch them live, you need two 'log group' statements with different numbers; otherwise you only need one. You can use different (group) numbers on different nftables rules if you want to be able to, say, look only at accepted but logged traffic or only dropped traffic. In the end this might wind up looking something like:

tcp port ssh counter log group 30 log group 31 drop;

As the nft manual page will tell you, this uses the kernel 'nfnetlink_log' to forward the 'logs' (packets) to a netlink socket, where exactly one process (at most) can subscribe to a particular group to receive those logs (ie, those packets). If we want to both log the packets and be able to tcpdump them, we need two groups so we can have ulogd getting one and tcpdump getting the other.

To see packets from any particular log group, we use the special 'nflog:<N>' pseudo-interface that's hopefully supported by your Linux version of tcpdump. This is used as 'tcpdump -i nflog:30 ...' and works more or less like you'd want it to. However, as far as I know there's no way to see meta-information about the nftables filtering, such as what rule was involved or what the decision was; you just get the packet.

To log the packets to disk for later use, the default program is ulogd, which in Ubuntu is called 'ulogd2'. Ulogd(2) isn't as automatic as OpenBSD's and FreeBSD's pf logging; instead you have to configure it in /etc/ulogd.conf, and on Ubuntu make sure you have the 'ulogd2-pcap' package installed (along with ulogd2 itself). Based merely on getting it to work, what you want in /etc/ulogd.conf is the following three bits:

# A 'stack' of source, handling, and destination
stack=log31:NFLOG,base1:BASE,pcap31:PCAP

# The source: NFLOG group 31, for IPv4 traffic
[log31]
group=31
# addressfamily=10 for IPv6

# the file path is correct for Ubuntu
[pcap31]
file="/var/log/ulog/ulogd.pcap"
sync=0

(On Ubuntu 24.04, any .pcap files in /var/log/ulog will be automatically rotated by logrotate, although I think by default it's only weekly, so you might want to make it daily.)

The ulogd documentation suggests that you will need to capture IPv4 and IPv6 traffic separately, but I've only used this on IPv4 traffic so I don't know. This may imply that you need separate nftables rules to log (and drop) IPv6 traffic so that you can give it a separate group number for ulogd (I'm not sure if it needs a separate one for tcpdump or if tcpdump can sort it out).

Ulogd can also log to many different things than PCAP format, including JSON and databases. It's possible that there are ways to enrich the ulogd pcap logs, or maybe just the JSON logs, with additional useful information such as the network interface involved and other things. I find the ulogd documentation somewhat opaque on this (and also it's incomplete), and I haven't experimented.

(According to this, the JSON logs can be enriched or maybe default to that.)

Given the assorted limitations and other issues with ulogd, I'm tempted to not bother with it and only have our nftables setups support live tcpdump of dropped traffic with a single 'log group <N>'. This would save us from the assorted annoyances of ulogd2.

PS: One reason to log to pcap format files is that then you can use all of the tcpdump filters that you're already familiar with in order to narrow in on (blocked) traffic of interest, rather than having to put together a JSON search or something.

The 'nft' command may not show complete information for iptables rules

By: cks

These days, nftables is the Linux network firewall system that you want to use, and especially it's the system that Ubuntu will use by default even if you use the 'iptables' command. The nft command is the official interface to nftables, and it has a 'nft list ruleset' sub-command that will list your NFT rules. Since iptables rules are implemented with nftables, you might innocently expect that 'nft list ruleset' will show you the proper NFT syntax to achieve your current iptables rules.

Well, about that:

# iptables -vL INPUT
[...] target prot opt in  out  source   destination         
[...] ACCEPT tcp  --  any any  anywhere anywhere    match-set nfsports dst match-set nfsclients src
# nft list ruleset
[...]
      ip protocol tcp xt match "set" xt match "set" counter packets 0 bytes 0 accept
[...]

As they say, "yeah no". As the documentation tells you (eventually), somewhat reformatted:

xt TYPE NAME

TYPE := match | target | watcher

This represents an xt statement from xtables compat interface. It is a fallback if translation is not available or not complete. Seeing this means the ruleset (or parts of it) were created by iptables-nft and one should use that to manage it.

Nftables has a native set type (and also maps), but, quite reasonably, the old iptables 'ipset' stuff isn't translated to nftables sets by the iptables compatibility layer. Instead the compatibility layer uses this 'xt match' magic that the nft command can only imperfectly tell you about. To nft's credit, it prints a warning comment (which I've left out) that the rules are being managed by iptables-nft and you shouldn't touch them. Here, all of the 'xt match "set"' bits in the nft output are basically saying "opaque stuff happens here".

This still makes me a little bit sad because it makes it that bit harder to bootstrap my nftables knowledge from what iptables rules convert into. If I wanted to switch to nftables rules and nftables sets (for example for my now-simpler desktop firewall rules), I'd have to do that from relative scratch instead of getting to clean up what the various translation tools would produce or report.

(As a side effect it makes it less likely that I'll convert various iptables things to being natively nft/nftables based, because I can't do a fully mechanical conversion. If they still work with iptables-nft, I'm better off leaving them as is. Probably this also means that iptables-nft support is likely to have a long, long life.)

Servers will apparently run for a while even when quite hot

By: cks

This past Saturday (yesterday as I write this), a university machine room had an AC failure of some kind:

It's always fun times to see a machine room temperature of 54C and slowly climbing. It's not our machine room but we have switches there, and I have a suspicion that some of them will be ex-switches by the time this is over.

This machine room and its AC has what you could call a history; in 2011 it flooded partly due to an AC failure, then in 2016 it had another AC issue, and another in 2024 (and those are just the ones I remember and can find entries for).

Most of this machine room is a bunch of servers from another department, and my assumption is that they are what created all of the heat when the AC failed. Both we and the other department have switches in the room, but networking equipment is usually relatively low-heat compared to active servers. So I found it interesting that the temperature graph rises in a smooth arc to its maximum temperature (and then drops abruptly, presumably as the AC starts to get fixed). To me this suggests that many of the servers in the room kept running, despite the ambient temperature hitting 54C (and their internal temperatures undoubtedly being much higher). If some servers powered off from the heat, it wasn't enough to stabilized the heat level of the room; it was still increasing right up to when it started dropping rapidly.

(Servers may well have started thermally throttling various things, and it's possible that some of them crashed without powering off and thus potentially without reducing the heat load. I have second hand information that some UPS units reported battery overheating.)

It's one thing to be fairly confident that server thermal limits are set unrealistically high. It's another thing to see servers (probably) keep operating at 54C, rather than fall over with various sorts of failures. For example, I wouldn't have been surprised if power supplies overheated and shut down (or died entirely).

(I think desktop PSUs are often rated as '0C to 50C', but I suspect that neither end of that rating is actually serious, and this was over 50C anyway.)

I rather suspect that running at 50+C for a while has increased the odds of future failures and shortened the lifetime of everything in this machine room (our switches included). But it still amazes me a bit that things didn't fall over and fail, even above 50C.

(When I started writing this entry I thought I could make some fairly confident predictions about the servers keeping running purely from the temperature graph. But the more I think about it, the less I'm sure of that. There are a lot of things that could be going on, including server failures that leave them hung or locked up but still with PSUs running and pumping out heat.)

My policy of semi-transience and why I have to do it

By: cks

Some time back I read Simon Tatham's Policy of transience (via) and recognized both points of similarity and points of drastic departure between Tatham and I. Both Tatham and I use transient shell history, transient terminal and application windows (sort of for me), and don't save our (X) session state, and in general I am a 'disposable' usage pattern person. However, I depart from Tatham in that I have a permanently running browser and I normally keep my login sessions running until I reboot my desktops. But broadly I'm a 'transient' or 'disposable' person, where I mostly don't keep inactive terminal windows or programs around in case I might want them again, or even immediately re-purpose them from one use to another.

(I do have some permanently running terminal windows, much like I have permanently present other windows on my desktop, but that's because they're 'in use', running some program. And I have one inactive terminal window but that's because exiting that shell ends my entire X session.)

The big way that I depart from Tatham is already visible in my old desktop tour, in the form of a collection of iconified browser windows (in carefully arranged spots so I can in theory keep track of them). These aren't web pages I use regularly, because I have a different collection of schemes for those. Instead they're a collection of URLs that I'm keeping around to read later or in general to do something with. This is anathema to Tatham, who keeps track of URLs to read in other ways, but I've found that it's absolutely necessary for me.

Over and over again I've discovered that if something isn't visible to me, shoved in front of my nose, it's extremely likely to drop completely out of my mind. If I file email into a 'to be dealt with' or 'to be read later' or whatever folder, or if I write down URLs to visit later and explanations of them, or any number of other things, I almost might as well throw those things away. Having a web page in an iconified Firefox window in no way guarantees that I'll ever read it, but writing its URL down in a list guarantees that I won't. So I keep an optimistic collection of iconified Firefox windows around (and every so often I look at some of them and give up on them).

It would be nice if I didn't need to do this and could de-clutter various bits of my electronic life. But by now I've made enough attempts over a long enough period of time to be confident that my mind doesn't work that way and is unlikely to ever change its ways. I need active, ongoing reminders for things to stick, and one of the best forms is to have those reminders right on my desktop.

(And because the reminders need to be active and ongoing, they also need to be non-intrusive. Mailing myself every morning with 'here are the latest N URLs you've saved to read later' wouldn't work, for example.)

PS: I also have various permanently running utility programs and their windows, so my desktop is definitely not minimalistic. A lot of this is from being a system administrator and working with a bunch of systems, where I want various sorts of convenient fast access and passive monitoring of them.

The problem of Python's version dependent paths for packages

By: cks

A somewhat famous thing about Python is that more or less all of the official ways to install packages put them into somewhere on the filesystem that contains the Python series version (which is things like '3.13' but not '3.13.5'). This is true for site packages, for 'pip install --user' (to the extent that it still works), and for virtual environments, however you manage them. And this is a problem because it means that any time you change to a new release, such as going from 3.12 to 3.13, all of your installed packages disappear (unless you keep around the old Python version and keep your virtual environments and so on using it).

In general, a lot of people would like to update to new Python releases. Linux distributions want to ship the latest Python (and usually do), various direct users of Python would like the new features, and so on. But these versions dependent paths and their consequences make version upgrades more painful and so to some extent cause them to be done less often.

In the beginning, Python had at least two reasons to use these version dependent paths. Python doesn't promise that either its bytecode (and thus the .pyc files it generates from .py files) or its C ABI (which is depended on by any compiled packages, in .so form on Linux) are stable from version to version. Python's standard installation and bytecode processing used to put both bytecode files and compiled files along side the .py files rather than separating them out. Since pure Python packages can depend on compiled packages, putting the two together has a certain sort of logic; if a compiled package no longer loads because it's for a different Python release, your pure Python packages may no longer work.

(Python bytecode files aren't so tightly connected so some time ago Python moved them into a '__pycache__' subdirectory and gave them a Python version suffix, eg '<whatever>.cpython-312.pyc'. Since they're in a subdirectory, they'll get automatically removed if you remove the package itself.)

An additional issue is that even pure Python packages may not be completely compatible with a new version of Python (and often definitely not with a sufficiently old version). So updating to a new Python version may call for a package update as well, not just using the same version you currently have.

Although I don't like the current situation, I don't know what Python could do to make it significantly better. Putting .py files (ie, pure Python packages) into a version independent directory structure would work some of the time (perhaps a lot of the time if you only went forward in Python versions, never backward) but blow up at other times, sometimes in obvious ways (when a compiled package couldn't be imported) and sometimes in subtle ones (if a package wasn't compatible with the new version of Python).

(It would probably also not be backward compatible to existing tools.)

Abuse systems should handle email reports that use MIME message/rfc822 parts

By: cks

Today I had reason to report spam to Mailchimp (some of you are laughing already, I know). As I usually do, I forwarded the spam message we'd received to them as a message/rfc822 MIME part, with a prequel plain text part saying that it was spam. Forwarding email as a MIME message/rfc822 part is unambiguously the correct way to do so. It's in the MIME RFCs, if done properly (by the client) it automatically includes all headers, and because it's a proper MIME part, tools can recognize the forwarded email message, scan over just it, and so on.

So of course Mailchimp sent me back an autoreply to the effect that they couldn't find any spam mail message in my report. They're not the only people who've replied this way, although sometimes the reply says "we couldn't handle this .eml attachment". So I had to re-forward the spam message in what I called literal plaintext format. This time around either some human or some piece of software found the information and maybe correctly interpreted it.

I think it's perfectly fine and maybe even praiseworthy when email abuse handling systems (and people) are willing to accept these literal plaintext format forwarded spam messages. The more formats you accept abuse reports in, the better. But every abuse handling system should accept MIME message/rfc822 format messages too, as a minimum thing. Not just because it's a standard, but also because it's what a certain amount of mail clients will produce by default if you ask them to forward a message. If you refuse to accept these messages, you're reducing the amount of abuse reports you'll accept, for arbitrary (but of course ostensibly convenient for you) reasons.

I know, I'm tilting at windmills. Mailchimp and all of the other big places doing this don't care one bit what I want and may or may not even do anything when I send them reports.

(I suspect that many places see reducing the number of 'valid' abuse reports they receive as a good thing, so the more hoops they can get away with and the more reports they can reject, the better. In theory this is self-defeating in the long run, but in practice that hasn't worked with the big offenders so far.)

Responsibility for university physical infrastructure can be complicated

By: cks

One of the perfectly sensible reactions to my entry on realizing that we needed two sorts of temperature alerts is to suggest that we directly monitor the air conditioners in our machine rooms, so that we don't have to try to assess how healthy they are from second hand, indirect sources like the temperature of the rooms. There are some practical problems, but a broader problem is that by and large they're not 'our' air conditioners. By this I mean that while the air conditioners and the entire building belongs to the university, neither 'belong' to my department and we can't really do stuff to them.

There are probably many companies who have some split between who's responsible for maintaining a building (and infrastructure things inside it) and who is currently occupying (parts of) the building, but my sense is that universities (or at least mine) take this to a more extreme level than usual. There's an entire (administrative) department that looks after buildings and other physical infrastructure, and they 'own' much of the insides of buildings, including the air conditioning units in our machine rooms (including the really old one). Because those air conditioners belong to the building and the people responsible for it, we can't go ahead and connect monitoring up to the AC units or tap into any native monitoring they might have.

(Since these aren't our AC units, we haven't even asked. Most of the AC units are old enough that they probably don't have any digital monitoring, and for the new units the manufacturer probably considers that an extra cost option. Nor can we particularly monitor their power consumption; these are industrial units, with dedicated high-power circuits that we're not even going to get near. Only university electricians are supposed to touch that sort of stuff.)

I believe that some parts of the university have a multi-level division of responsibility for things. One organization may 'own' the building, another 'owns' the network wiring in the walls and is responsible for fixing it if something goes wrong, and a third 'owns' the space (ie, gets to use it) and has responsibility for everything inside the rooms. Certainly there's a lot of wiring within buildings that is owned by specific departments or organizations; they paid to put it in (although possibly through shared conduits), and now they're the people who control what it can be used for.

(We have run a certain amount of our own fiber between building floors, for example. I believe that things can get complicated when it comes to renovating space for something, but this is fortunately not one of the areas we have to deal with; other people in the department look after that level of stuff.)

I've been inside the university for long enough that all of this feels completely normal to me, and it even feels like it makes sense. Within a university, who is using space is something that changes over time, not just within an academic department but also between departments. New buildings are built, old buildings are renovated, and people move around, so separating maintaining the buildings from who occupies them right now feels natural.

(In general, space is a constant struggle at universities.)

My approach to testing new versions of Exim for our mail servers

By: cks

When I wrote about how Exim's ${run ...} string expansion operator changed how it did quoting, I (sort of) mentioned that I found this when I tested a new version of Exim. Some people would do testing like this in a thorough, automated manner, but I don't go that far. Instead I have a written down test plan, with some resources set up for it in advance. Well, it's more accurate to say that I have test plans, because I have a separate test plan for each of our important mail servers because they have different features and so need different things tested.

In the beginning I simply tested all of the important features of a particular mail server by hand and from memory when I rebuilt it on a new version of Ubuntu. Eventually I got tired of having to reinvent my test process from scratch (or from vague notes) every time around (for each mail server), so I started writing it down. In the process of writing my test process down the natural set of things happened; I made it more thorough and systematic, and I set up various resources (like saved copies of the EICAR test file) to make testing more cut and paste. Having an organized, written down test plan, even as basic as ours is, has made it easier to test new builds of our Exim servers and made that testing more comprehensive.

I test most of our mail servers primarily by using swaks to send various bits of test email to them and then watching what happens (both in the swaks SMTP session and in the Exim logs). So a lot of the test plan is 'run this swaks command and ...', with various combinations of sending and receiving addresses, starting with the very most basic test of 'can it deliver from a valid dummy address to a valid dummy address'. To do some sorts of testing, such as DNS blocklist tests, I take advantage of the fact that all of the IP-based DNS blocklists we use include 127.0.0.2, so that part of the test plan is 'use swaks on the mail machine itself to connect from 127.0.0.2'.

(Some of our mail servers can apply different filtering rules to different local addresses, so I have various pre-configured test addresses set up to make it easy to test that per-address filtering is working.)

The actual test plans are mostly a long list of 'run more or less this swaks command, pointing it at your test server, to test this thing, and you should see the following result'. This is pretty close to cut and paste, which makes it relatively easy and fast for me to run through.

One qualification is that these test plans aren't attempting to be an exhaustive check of everything we do in our Exim configurations. Instead, they're mostly about making sure that the basics work, like delivering straightforward email, and that Exim can interact properly with the outside world, such as talking to ClamAV and rspamd or running external programs (which also tests that the programs themselves work on the new Ubuntu version). Testing every corner of our configurations would be exhausting and my feeling is that it would generally be pointless. Exim is stable software and mostly doesn't change or break things from version to version.

(Part of this is pragmatic experience with Exim and knowledge of what our configuration does conditionally and what it checks all of the time. If Exim does a check all of the time and basic mail delivery works, we know we haven't run into, say, an issue with tainted data.)

The unusual way I end my X desktop sessions

By: cks

I use an eccentric X 'desktop' that is not really a desktop as such in the usual sense but instead a window manager and various programs that I run (as a sysadmin, there's a lot of terminal windows). One of the ways that my desktop is unusual is in how I exit from my X session. First, I don't use xdm or any other graphical login manager; instead I run my session through xinit. When you use an xinit based session, you give xinit a program or a script to run, and when the program exits, xinit terminates the X server and your session.

(If you gave xinit a shell script, whatever foreground program the script ended with was your keystone program.)

Traditionally, this keystone program for your X session was your window manager. At one level this makes a lot of sense; your window manager is basically the core of your X session anyway, so you might as well make quitting from it end the session. However, for a very long time I've used a do-nothing iconified xterm running a shell as my keystone program.

(If you look at FvwmIconMan's strip of terminal windows in my (2011) desktop tour, this is the iconified 'console-ex' window.)

The minor advantage to having an otherwise unused xterm as my session keystone program is that I can start my window manager basically at the start of my (rather complex) session startup, so that I can immediately have it manage all of the other things I start (technically I run a number of commands to set up X settings before I start fvwm, but it's the first program I start that will actually show anything on the screen). The big advantage is that using something else as my keystone program means that I can kill and restart my window manager if something goes badly wrong, and more generally that I don't have to worry about restarting it. This doesn't happen very often, but when it does happen I'm very glad that I can recover my session instead of having to abruptly terminate everything. And should I have to terminate fvwm, this 'console' xterm is a convenient idle xterm in which to restart it (or in general, any other program of my session that needs restarting).

(The 'console' xterm is deliberately placed up at the top of the screen, in an area that I don't normally put non-fvwm windows in, so that if fvwm exits and everything de-iconifies, it's highly likely that this xterm will be visible so I can type into it. If I put it in an ordinary place, it might wind up covered up by a browser window or another xterm or whatever.)

I don't particularly have to use an (iconified) xterm with a shell in it; I could easily have written a little Tk program that displayed a button saying 'click me to exit'. However, the problem with such a program (and the advantage of my 'console' xterm) is that it would be all too easy to accidentally click the button (and force-end my session). With the iconified xterm, I need to do a bunch of steps to exit; I have to deiconify that xterm, focus the window, and Ctrl-D the shell to make it exit (causing the xterm to exit). This is enough out of the way that I don't think I've ever done it by accident.

PS: I believe modern desktop environments like GNOME, KDE, and Cinnamon have moved away from making their window manager be the keystone program and now use a dedicated session manager program that things talk to. One reason for this may be that modern desktop shells seem to be rather more prone to crashing for various reasons, which would be very inconvenient if that ended your session. This isn't all bad, at least if there's a standard D-Bus protocol for ending a session so that you can write an 'exit the session' thing that will work across environments.

Understanding reading all available things from a Go channel (with a timeout)

By: cks

Recently I saw this example Go code (via), and I had to stare at it a while in order to understand what it was doing and how it worked (and why it had to be that way). The goal of waitReadAll() is to either receive (read) all currently available items from a channel (possibly a buffered one) or to time out if nothing shows up in time. This requires two nested selects, with the inner one in a for loop.

The outer select has this form:

select {
  case v, ok := <- c:
    if !ok {
      return ...
    }
    [... inner code ...]

  case <- time.After(dur) // wants go 1.23+
    return ...
}

This is doing three things. First (and last in the code), it's timing out of the duration expires before anything is received on the channel. Second, it's returning right away if the channel is closed and empty; in this case the channel receive from c will succeed, but ok will be false. And finally, in the code I haven't put in, it has received the first real value from the channel and now it has to read the rest of them.

The job of the inner code is to receive any (additional) currently ready items from the channel but to give up if the channel is closed or when there are no more items. It has the following form (trimmed of the actual code to properly accumulate things and so on, see the playground for the full version):

.. setup elided ..
for {
  select {
    case v, ok := <- c:
      if ok {
        // accumulate values
      } else {
        // channel closed and empty
        return ...
      }
    case default:
      // out of items
      return ...
  }
}

There's no timeout in this inner code because the 'case default' means that we never wait for the channel to be ready; either the channel is ready with another item (or it's been closed), or we give up.

One of the reasons this Go code initially confused me is that I started out misreading it as receiving as much as it could from a channel until it reached a timeout. Code that did that would do a lot of the same things (obviously it needs a timeout and a select that has that as one of the cases), and you could structure it somewhat similarly to this code (although I think it's more clearly written without a nested loop).

(This is one of those entries that I write partly to better understand something myself. I had to read this code carefully to really grasp it and I found it easy to mis-read on first impression.)

Starting scripts with '#!/usr/bin/env <whatever>' is rarely useful

By: cks

In my entry on getting decent error reports in Bash for 'set -e', I said that even if you were on a system where /bin/sh was Bash and so my entry worked if you started your script with '#!/bin/sh', you should use '#!/bin/bash' instead for various reasons. A commentator took issue with this direct invocation of Bash and suggested '#!/usr/bin/env bash' instead. It's my view that using env this way, especially for Bash, is rarely useful and thus is almost always unnecessary and pointless (and sometimes dangerous).

The only reason to start your script with '#!/usr/bin/env <whatever>' is if you expect your script to run on a system where Bash or whatever else isn't where you expect (or when it has to run on systems that have '<whatever>' in different places, which is probably most common for third party packages). Broadly speaking this only happens if your script is portable and will run on many different sorts of systems. If your script is specific to your systems (and your systems are uniform), this is pointless; you know where Bash is and your systems aren't going to change it, not if they're sane. The same is true if you're targeting a specific Linux distribution, such as 'this is intrinsically an Ubuntu script'.

(In my case, the script I was doing this to is intrinsically specific to Ubuntu and our environment. It will never run on anything else.)

It's also worth noting that '#!/usr/bin/env <whatever>' only works if (the right version of) <whatever> can be found on your $PATH, and in fact the $PATH of every context where you will run the script (including, for example, from cron). If the system's default $PATH doesn't include the necessary directories, this will likely fail some of the time. This makes using 'env' especially dangerous in an environment where people may install their own version of interpreters like Python, because your script's use of 'env' may find their Python on their $PATH instead of the version that you expect.

(These days, one of the dangers with Python specifically is that people will have a $PATH that (currently) points to a virtual environment with some random selection of Python packages installed and not installed, instead of the system set of packages.)

As a practical matter, pretty much every mainstream Linux distribution has a /bin/bash (assuming that you install Bash, and I'm sorry, Nix and so on aren't mainstream). If you're targeting Linux in general, assuming /bin/bash exists is entirely reasonable. If a Linux distribution relocates Bash, in my view the resulting problems are on them. A lot of the time, similar things apply for other interpreters, such as Python, Perl, Ruby, and so on. '#!/usr/bin/python3' on Linux is much more likely to get you a predictable Python environment than '#!/usr/bin/env python3', and if it fails it will be a clean and obvious failure that's easy to diagnose.

Another issue is that even if your script is fixed to use 'env' to run Bash, it may or may not work in such an alternate environment because other things you expect to find in $PATH may not be there. Unless you're actually testing on alternate environments (such as Nix or FreeBSD), using 'env' may suggest more portability than you're actually able to deliver.

My personal view is that for most people, '#!/usr/bin/env' is a reflexive carry-over that they inherited from a past era of multi-architecture Unix environments, when much less was shipped with the system and so was in predictable locations. In that past Unix era, using '#!/usr/bin/env python' was a reasonably sensible thing; you could hope that the person who wanted to run your script had Python, but you couldn't predict where. For most people, those days are over, especially for scripts and programs that are purely for your internal use and that you won't be distributing to the world (much less inviting people to run your 'written on X' script on a Y, such as a FreeBSD script being run on Linux).

The XLibre project is explicitly political and you may not like the politics

By: cks

A commentator on my 2024 entry on the uncertain possible futures of Unix graphical desktops brought up the XLibre project. XLibre is ostensibly a fork of the X server that will be developed by a new collection of people, which on the surface sounds unobjectionable and maybe a good thing for people (like me) who want X to keep being viable; as a result it has gotten a certain amount of publicity from credulous sources who don't look behind the curtain. Unfortunately for everyone, XLibre is an explicitly political project, and I don't mean that in the sense of disagreements about technical directions (the sense that you could say that 'forking is a political action', because it's the manifestation of a social disagreement). Instead I mean it in the regular sense of 'political', which is that the people involved in XLibre (especially its leader) have certain social values and policies that they espouse, and the XLibre project is explicitly manifesting some of them.

(Plus, a project cannot be divorced from the people involved in it.)

I am not going to summarize here; instead, you should read the Register article and its links, and also the relevant sections of Ariadne Conill's announcement of Wayback and their links. However, even if you "don't care" about politics, you should see this correction to earlier XLibre changes where the person making the earlier changes didn't understand what '2^16' did in C (I would say that the people who reviewed the changes also missed it, but there didn't seem to be anyone doing so, which ought to raise your eyebrows when it comes to the X server).

Using, shipping it as part of a distribution, or advocating for XLibre is not a neutral choice. To do so is to align yourself, knowingly or unknowingly, with the politics of XLibre and with the politics of its leadership and the people its leadership will attract to the project. This is always true to some degree with any project, but it's especially true when the project is explicitly manifesting some of its leadership's values, out in the open. You can't detach XLibre from its leader .

My personal view is that I don't want to have anything to do with XLibre and I will think less of any Unix or Linux distribution that includes it, especially ones that intend to make it their primary X server. At a minimum, I feel those distributions haven't done their due diligence.

In general, my personal guess is that a new (forked) standalone X server is also the wrong approach to maintaining a working X server environment over the long term. Wayback combined with XWayland seems like a much more stable base because each of them has more support in various ways (eg, there are a lot of people who are going to want old X programs to keep working for years or decades to come and so lots of demand for most of XWayland's features).

(This elaborates on my comment on XLibre in this entry. I also think that a viable X based environment is far more likely to stop working due to important programs becoming Wayland-only than because you can no longer get a working X server.)

Some practical challenges of access management in 'IAM' systems

By: cks

Suppose that you have a shiny new IAM system, and you take the 'access management' part of it seriously. Global access management is (or should be) simple; if you disable or suspect someone in your IAM system, they should wind up disabled everywhere. Well, they will wind up unable to authenticate. If they have existing credentials that are used without checking with your IAM system (including things like 'an existing SSH login'), you'll need some system to propagate the information that someone has been disabled in your IAM to consumers and arrange that existing sessions, credentials, and so on get shut down and revoked.

(This system will involve both IAM software features and features in the software that uses the IAM to determine identity.)

However, this only covers global access management. You probably have some things that only certain people should have access to, or that treat certain people differently. This is where our experiences with a non-IAM environment suggest to me that things start getting complex. For pure access, the simplest thing probably is if every separate client system or application has a separate ID and directly talks to the IAM, and the IAM can tell it 'this person cannot authenticate (to you)' or 'this person is disabled (for you)'. This starts to go wrong if you ever put two or more services or applications behind the same IAM client ID, for example if you set up a web server for one application (with an ID) and then host another application on the same web server because of convenience (your web server is already there and already set up to talk to the IAM and so on).

This gets worse if there is a layer of indirection involved, so that systems and application don't talk directly to your IAM but instead talk to, say, a LDAP server or a Radius server or whatever that's fed from your IAM (or is the party that talks to your IAM). I suspect that this is one reason why IAM software has a tendency to directly support a lot of protocols for identity and authentication.

(One thing that's sort of an extra layer of indirection is what people are trying to do, since they may have access permission for some things but not others.)

Another approach is for your IAM to only manage what 'groups' people are in and provide that information to clients, leaving it up to clients to make access decisions based on group membership. On the one hand, this is somewhat more straightforward; on the other hand, your IAM system is no longer directly managing access. It has to count on clients doing the right thing with the group information it hands them. At a minimum this gives you much less central visibility into what your access management rules are.

People not infrequently want complicated access control conditions for individual applications (including things like privilege levels). In any sort of access management system, you need to be able to express these conditions in rules. There's no uniform approach or language for expressing access control conditions, so your IAM will use one, your Unix systems will use one (or more) that you probably get to craft by hand using PAM tricks, your web applications will use one or more depending on what they're written in, and so on and so forth. One of the reasons that these languages differ is that the capabilities and concepts of each system will differ; a mesh VPN has different access control concerns than a web application. Of course these differences make it challenging to handle all of their access management in one single spot in an IAM system, leaving you with the choice of either not being able to do everything you want to but having it all in the IAM or having partially distributed access management.

A change in how Exim's ${run ...} string expansion operator does quoting

By: cks

The Exim mail server has, among other features, a string expansion language with quite a number of expansion operators. One of those expansion operators is '${run}', which 'expands' by running a command and substituting in its output. As is commonly the case, ${run} is given the command to run and all of its command line arguments as a single string, without any explicit splitting into separate arguments:

${run {/some/command -a -b foo -c ...} [...]}

Any time a program does this, a very important question to ask is how this string is split up into separate arguments in order to be exec()'d. In Exim's case, the traditional answer is that it was rather complicated and not well documented, in a way that required you to explicitly quote many arguments that came from variables. In my entry on this I called Exim's then current behavior dangerous and wrong but also said it was probably too late to change it. Fortunately, the Exim developers did not heed my pessimism.

In Exim 4.96, this behavior of ${run} changed. To quote from the changelog:

The ${run} expansion item now expands its command string elements after splitting. Previously it was before; the new ordering makes handling zero-length arguments simpler. The old ordering can be obtained by appending a new option "preexpand", after a comma, to the "run".

(The new way is more or less the right way to do it, although it can create problems with [[some sorts of command string expansions.)

This is an important change because this change is not backward compatible if you used deliberate quoting in your ${run} command string. For example, if you ever expanded a potentially dangerous Exim variable in a ${run} command (for example, one that might have a space in it), you previously had to wrap it in ${quote}:

${run {/some/command \
         --subject ${quote:$header_subject:} ...

(As seen in my entry on our attachment type logging with Exim.)

In Exim 4.96 and later, this same ${run} string expansion will add spurious quote marks around the email message's Subject: header as your program sees it. This is because ${quote:...} will add them, since you asked it to generate a quoted version of its argument, and then ${run} won't strip them out as part of splitting the command string apart into arguments because the command string has already been split before the ${quote:} was done. What this shows is that you probably don't need explicit quoting in ${run} command strings any more, unless you're doing tricky expansions with string expressions (in which case you'll have to switch back to the old way of doing it).

To be clear, I'm all for this change. It makes straightforward and innocent use of ${run} much safer and more reliable (and it plays better with Exim's new rules about 'tainted' strings from the outside world, such as the subject header). Having to remote my use of ${quote:...} is a minor price to pay, and learning this sort of stuff in advance is why I build test servers and have test plans.

(This elaborates on a Fediverse post of mine.)

My system administrator's view of IAM so far (from the outside)

By: cks

Over on the Fediverse I said something about IAM:

My IAM choices appear to be "bespoke giant monolith" or "DIY from a multitude of OSS pieces", and the natural way of life appears to be that you start with the latter because you don't think you need IAM and then you discover maybe you have to blow up the world to move to the first.

At work we are the latter: /etc/passwd to LDAP to a SAML/OIDC server depending on what generation of software and what needs. With no unified IM or AM, partly because no rules system for expressing it.

Identity and Access Management (IAM) isn't the same thing as (single sign on) authentication, although I believe it's connected to authorization if you take the 'Access' part seriously, and also a bunch of IAM systems will also do some or all of authentication too so everything is in one place. However, all of these things can be separated, and in complex environments they are (for example, the university's overall IAM environment, also).

(If you have an IAM system you're presumably going to want to feed information from it to your authentication system, so that it knows who is (still) valid to authenticate and perhaps how.)

I believe that one thing that makes IAM systems complicated is interfacing with what could be called 'legacy systems', which in this context includes garden variety Unix systems. If you take your IAM system seriously, everything that knows about 'logins' or 'users' needs to somehow be drawing data from the IAM system, and the IAM system has to know how to provide each with the information it needs. Or alternately your legacy systems need to somehow merge local identity information (Unix home directories, UIDs, GIDs, etc) with the IAM information. Since people would like their IAM system to do it all, I think this is one driver of IAM system complexity and those bespoke giant monoliths that want to own everything in your environment.

(The reason to want your IAM system to do it all is that if it doesn't, you're building a bunch of local tools and then your IAM information is fragmented. What UID is this person on your Unix systems? Only your Unix systems know, not your central IAM database. For bonus points, the person might have different UIDs on different Unix systems, depending.)

If you start out with a green field new system, you can probably build in this central IAM from the start (assuming that you can find and operate IAM software that does what you want and doesn't make you back away in terror). But my impression is that central IAM systems are quite hard to set up, so the natural alternative is that you start without an IAM system and then are possibly faced with trying to pull all of your /etc/passwd, Apache authentication data, LDAP data, and so on into a new IAM system that is somehow going to take over the world. I have no idea how you'd pull off this transition, although presumably people have.

(In our case, we started our Unix systems well before IAM systems existed. There are accounts here that have existed since the 1980s, partly because professors and retired professors tend to stick around for a long time.)

The difficulty of moving our environment to anything like an IAM system leaves me looking at the whole thing from the outside. If we had to add an 'IAM system', it would likely be because something else we wanted to do needed to be fed data from some IAM system using some IAM protocol. The IAM system would probably not become the center of identity and access management, but just another thing that we pushed information into and updated information in.

Another thing V7 Unix gave us is environment variables

By: cks

Simon Tatham recently wondered "Why is PATH called PATH?". This made me wonder the closely related question of when environment variables appeared in Unix, and the answer is that the environment and environment variables appeared in V7 Unix as another of the things that made it so important to Unix history (also).

Up through V6, the exec system call and family of system calls took two arguments, the path and the argument list; we can see this in both the V6 exec(2) manual page and the implementation of the system call in the kernel. As bonus trivia, it appears that the V6 exec() limited you to 510 characters of arguments (and probably V1 through V5 had a similarly low limit, but I haven't looked at their kernel code).

In V7, the exec(2) manual page now documents a possible third argument, and the kernel implementation is much more complex, plus there's an environ(5) manual page about it. Based on h/param.h, V7 also had a much higher size limit on the combined sized of arguments and environment variables, which isn't all that surprising given the addition of the environment. Commands like login.c were updated to put some things into the new environment; login sets a default $PATH and a $HOME, for example, and environ(5) documents various other uses (which I haven't checked in the source code).

This implies that the V7 shell is where $PATH first appeared in Unix, where the manual page describes it as 'the search path for commands'. This might make you wonder how the V6 shell handled locating commands, and where it looked for them. The details are helpfully documented in the V6 shell manual page, and I'll just quote what it has to say:

If the first argument is the name of an executable file, it is invoked; otherwise the string `/bin/' is prepended to the argument. (In this way most standard commands, which reside in `/bin', are found.) If no such command is found, the string `/usr' is further prepended (to give `/usr/bin/command') and another attempt is made to execute the resulting file. (Certain lesser-used commands live in `/usr/bin'.)

('Invoked' here is carrying some extra freight, since this may not involve a direct kernel exec of the file. An executable file that the kernel didn't like would be directly run by the shell.)

I suspect that '$PATH' was given such as short name (instead of a longer, more explicit one) simply as a matter of Unix style at the time. Pretty much everything in V7 was terse and short in this style for various reasons, and verbose environment variable names would have reduced that limited exec argument space.

Python argparse and the minor problem of a variable valid argument count

By: cks

Argparse is the standard Python module for handling arguments to command line programs, and because for small programs, Python makes using things outside the standard library quite annoying, it's the one I use in my Python based utility programs. Recently I found myself dealing with a little problem where argparse doesn't have a good answer, partly because you can't nest argument groups.

Suppose, not hypothetically, that you have a program that can properly take zero, two, or three command line arguments (which are separate from options), and the command line arguments are of different types (the first is a string and the second two are numbers). Argparse makes it easy to handle having either two or three arguments, no more and no less; the first two arguments have no nargs set, and then the third sets 'nargs="?"'. However, as far as I can see argparse has no direct support for handling the zero-argument case, or rather for forbidding the one-argument one.

(If the first two arguments were of the same type we could easily gather them together into a two-element list with 'nargs=2', but they aren't, so we'd have to tell argparse that both are strings and then try the 'string to int' conversion of the second argument ourselves, losing argparse's handling of it.)

If you set all three arguments to 'nargs="?"' and give them usable default values, you can accept zero, two, or three arguments, and things will work if you supply only one argument (because the second argument will have a usable default). This is the solution I've adopted for my particular program because I'm not stubborn enough to try to roll my own validation on top of argparse, not for a little personal tool.

If argparse supported nested groups for arguments, you could potentially make a mutually exclusive argument group that contained two sub-groups, one with nothing in it and one that handled the two and three argument case. This would require argparse not only to support nested groups but to support empty nested groups (and not ignore them), which is at least a little bit tricky.

Alternately, argparse could support a global specification of what numbers of arguments are valid. Or it could support a 'validation' callback that is called with information about what argparse detected and which could signal errors to argparse that argparse handled in its standard way, giving you uniform argument validation and error text and so on.

Unix had good reasons to evolve since V7 (and had to)

By: cks

There's a certain sort of person who feels that the platonic ideal of Unix is somewhere around Research Unix V7 and it's almost all been downhill since then (perhaps with the exception of further Research Unixes and then Plan 9, although very few people got their hands on any of them). For all that I like Unix and started using it long ago when it was simpler (although not as far back as V7), I reject this view and think it's completely mistaken.

V7 Unix was simple but it was also limited, both in its implementation (which often took shortcuts (also, also, also) and in its overall features (such as short filenames). Obviously V7 didn't have networking, but even things that most people think of as perfectly reasonable and good Unix features like '#!' support for shell scripts in the kernel and processes being in multiple groups at once. That V7 was a simple and limited system meant that its choices were to grow to meet people's quite reasonable needs or to fall out of use.

(Some of these needs were for features and some of them were for performance. The original V7 filesystem was quite simple but also suffered from performance issues, ones that often got worse over time.)

I'll agree that the path that the growth of Unix has taken since V7 is not necessarily ideal; we can all point to various things about modern Unixes that we don't like. Any particular flaws came about partly because people don't necessarily make ideal decisions and partly because we haven't necessarily had perfect understandings of the problems when people had to do something, and then once they'd done something they were constrained by backward compatibility.

(In some ways Plan 9 represents 'Unix without the constraint of backward compatibility', and while I think there are a variety of reasons that it failed to catch on in the world, that lack of compatibility is one of them. Even if you had access to Plan 9, you had to be fairly dedicated to do your work in a Plan 9 environment (and that was before the web made it worse).)

PS: It's my view that the people who are pushing various Unixes forward aren't incompetent, stupid, or foolish. They're rational and talented people who are doing their best in the circumstances that they find themselves. If you want to throw stones, don't throw them at the people, throw them at the overall environment that constrains and shapes how everything in this world is pushed to evolve. Unix is far from the only thing shaped in potentially undesirable ways by these forces; consider, for example, C++.

(It's also clear that a lot of people involved in the historical evolution of BSD and other Unixes were really quite smart, even if you don't like, for example, the BSD sockets API.)

Mostly stopping GNU Emacs from de-iconifying itself when it feels like it

By: cks

Over on the Fediverse I had a long standing GNU Emacs gripe:

I would rather like to make it so that GNU Emacs never un-iconifies itself when it completes (Lisp-level) actions. If I have Emacs iconified I want it to stay that way, not suddenly appear under my mouse cursor like an extremely large modal popup. (Modal popups suck, they are a relic of single-tasking windowing environments.)

For those of you who use GNU Emacs and have never been unlucky enough to experience this, if you start some long operation in GNU Emacs and then decide to iconify it to get it out of your face, a lot of the time GNU Emacs will abruptly pop itself back open when it finishes, generally with completely unpredictable timing so that it disrupts whatever else you switched to in the mean time.

(This only happens in some X environments. In others, the desktop or window manager ignores what Emacs is trying to do and leaves it minimized in your taskbar.)

To cut straight to the answer, you can avoid a lot of this with the following snippet of Emacs Lisp:

(add-to-list 'display-buffer-alist '(t nil (inhibit-switch-frame . t)))

I believe that this has some side effects but that these side effects will generally be that Emacs doesn't yank around your mouse focus or suddenly raise windows to be on top of everything.

GNU Emacs doesn't have a specific function that it calls to de-iconify a frame, what Emacs calls a top level window. Instead, the deiconification happens in C code inside C-level functions like raise-frame and make-frame-visible, which also do other things and which are called from many places. For instance, one of make-frame-visible's jobs is actually displaying the frame's X level window if it doesn't already exist on the screen.

(There's an iconify-or-deiconify-frame function but if you look that's a Lisp function that calls make-frame-visible. It's only used a little bit in the Emacs Lisp code base.)

A determined person could probably hook these C-level functions through advice-add to make them do nothing if they were called on an existing, mapped frame that was just iconified. That would be the elegant way to do what I want. The inelegant way is to discover, via use of the Emacs Lisp debugger, that everything I seem to care about is going through 'display-buffer' (eventually calling window--maybe-raise-frame), and that display-buffer's behavior can be customized to not 'switch frames', which will wind up causing things to not call window--maybe-raise-frame and not de-iconify GNU Emacs windows on me.

To understand display-buffer-alist I relied on Demystifying Emacs’s Window Manager. My addition to display-buffer-alist has three elements:

  • the t tells display-buffer to always use this alist entry.
  • the nil tells display-buffer that I don't have any special action functions I want to use here and it should just use its regular ones. I think an empty list might be more proper here, but nil works.
  • the '(inhibit-switch-frame . t)' sets the important customization, which will be merged with any other things set by other (matching) alist entries.

The net effect is that 'display-buffer' will see 'inhibit-switch-frame' set for every buffer it's asked to switch to, and so will not de-iconify, raise, or otherwise monkey around with frame things in the process of displaying buffers. It's possible that this will have undesirable side effects in some circumstances, but as far as I can tell things like 'speedbar' and 'C-x 5 <whatever>' still work for me afterward, so new frames are getting created when I want them to be.

(I could change the initial 't' to something more complex, for example to only apply this to MH-E buffers, which is where I mostly encounter the problem. See Demystifying Emacs’s Window Manager for a discussion of how to do this based on the major mode of the buffer.)

To see if you're affected by this, you can run the following Emacs Lisp in the scratch buffer and then immediately minimize or iconify the window.

(progn
  (sleep-for 5)
  (display-buffer "*scratch*"))

If you're affected, the Emacs window will pop back open in a few seconds (five or less, depending on how fast you minimized the window). If the Emacs window stays minimized or iconified, your desktop environment is probably overriding whatever Emacs is trying to do.

For me this generally happens any time some piece of Emacs Lisp code is taking a long time to get a buffer ready for display and then calls 'display-buffer' at the end to show the buffer. One trigger for this is if the buffer to be displayed contains a bunch of unusual Unicode characters (possibly ones that my font doesn't have anything for). The first time the characters are used, Emacs will apparently stall working out how to render them and then de-iconify itself if I've iconified it out of impatience.

(It's quite possible that there's a better way to do this, and if so I'd love to know about it.)

Sending drawing commands to your display server versus sending images

By: cks

One of the differences between X and Wayland is that in the classical version of X you send drawing commands to the server while in Wayland you send images; this can be called server side rendering versus client side rendering. Client side rendering doesn't preclude a 'network transparent' display protocol, but it does mean that you're shipping around images instead of drawing commands. Is this less efficient? In thinking about it recently, I realized that the answer is that it depends on a number of things.

Let's start out by assuming that the display server and the display clients are equally powerful and capable as far as rendering the graphics goes, so the only question is where the rendering happens (and what makes it better to do it in one place instead of another). The factors that I can think of are:

  • How many different active client (machines) there are; if there are enough, the active client machines have more aggregate rendering capacity than the server does. But probably you don't usually have all that many different clients all doing rendering at once (that would be a very busy display).

  • The number of drawing commands as compared to the size of the rendered result. In an extreme case in favor of client side rendering, a client executes a whole bunch of drawing commands in order to render a relatively small image (or window, or etc). In an extreme case the other way, a client can send only a few drawing commands to render a large image area.
  • The amount of input data the drawing commands need compared to the output size of the rendered result. An extreme case in favour of client side rendering is if the client is compositing together a (large) stack of things to produce a single rendered result.
  • How efficiently you can encode (and decode) the rendered result or the drawing commands (and their inputs). There's a tradeoff of space used to encoding and decoding time, where you may not be able to afford aggressive encoding because it gets in the way of fast updates.

    What these add up to is the aggregate size of the drawing commands and all of the inputs that they need relative to the rendered result, possibly cleverly encoded on both sides.

  • How much changes from frame to frame and how easily you can encode that in some compact form. Encoding changes in images is a well studied thing (we call it 'video'), but a drawing command model might be able to send only a few commands to change a little bit of what it sent previously for an even bigger saving.

    (This is affected by how a server side rendering server holds the information from clients. Does it execute their draw commands then only retain the final result, as X does, or does it hold their draw commands and re-execute them whenever it needs to re-render things? Let's assume it holds the rendered result, so you can draw over it with new drawing commands rather than having to send a new full set of 'draw this from now onward' commands.)

    A pragmatic advantage of client side rendering is that encoding image to image changes can be implemented generically after any style of rendering; all you need is to retain a copy of the previous frame (or perhaps more frames than that, depending). In a server rendering model, the client needs specific support for determining a set of drawing operations to 'patch' the previous result, and this doesn't necessarily cooperate with an immediate mode approach where the client regenerates the entire set of draw commands from scratch any time it needs to re-render a frame.

I was going to say that the network speed is important too but while it matters, what I think it does is magnifies or shrinks the effect of the relative size of drawing commands compared to the final result. The faster and lower latency your network is, the less it matters if you ship more data in aggregate. On a slow network, it's much more important.

There's probably other things I'm missing, but even with just these I've wound up feeling that the tradeoffs are not as simple and obvious as I believed before I started thinking about it.

(This was sparked by an offhand Fediverse remark and joke.)

Getting decent error reports in Bash when you're using 'set -e'

By: cks

Suppose that you have a shell script that's not necessarily complex but is at least long. For reliability, you use 'set -e' so that the script will immediately stop on any unexpected errors from commands, and sometimes this happens. Since this isn't supposed to happen, it would be nice to print some useful information about what went wrong, such as where it happened, what the failing command's exit status was, and what the command was. The good news is that if you're willing to make your script specifically a Bash script, you can do this quite easily.

The Bash trick you need is:

trap 'echo "Exit status $? at line $LINENO from: $BASH_COMMAND"' ERR

This uses three Bash features: the special '$LINENO' and '$BASH_COMMAND' environment variables (which have the command executed just before the trap and the line number), and the special 'ERR' Bash 'trap' condition that causes your 'trap' statement to be invoked right when 'set -e' is causing your script to fail and exit.

Using 'ERR' instead of 'EXIT' (or '0' if you're a traditionalist like me) is necessary in order to get the correct line number in Bash. If you switch this to 'trap ... EXIT', the line number that Bash will report is the line that the 'trap' was defined on, not the line that the failing command is on (although the command being executed remains the same). This makes a certain amount of sense from the right angle; the shell is currently on that line as it's exiting.

As far as I know, no other version of the Bourne shell can do all of this. The OpenBSD version of /bin/sh has a '$LINENO' variable and 'trap ... 0' preserves its value (instead of resetting it to the line of the 'trap'), but it has no access to the current command. The FreeBSD version of /bin/sh resets '$LINENO' to the line of your 'trap ... 0', so the best you can do is report the exit status. Dash, the Ubuntu 24.04 default /bin/sh, doesn't have '$LINENO', effectively putting you in the same situation as FreeBSD.

(On Fedora, /bin/sh is Bash, and the Fedora version of Bash supports all of 'trap .. ERR', $LINENO, and $BASH_COMMAND even when invoked as '#!/bin/sh' by your script. You probably shouldn't count on this; if you want Bash, use '#!/bin/bash'.)

NFS v4 delegations on a Linux NFS server can act as mandatory locks

By: cks

Over on the Fediverse, I shared an unhappy learning experience:

Linux kernel NFS: we don't have mandatory locks.
Also Linux kernel NFS: if the server has delegated a file to a NFS client that's now not responding, good luck writing to the file from any other machine. Your writes will hang.

NFS v4 delegations are an feature where the NFS server, such as your Linux fileserver, hands a lot of authority over a particular file over to a client that is using that file. There are various sorts of delegations, but even a basic read delegation will force the NFS server to recall the delegation if anything else wants to write to the file or to remove it. Recalling a delegation requires notifying the NFS v4 client that it has lost the delegation and then having the client accept and respond to that. NFS v4 clients have to respond to the loss of a delegation because they may be holding local state that needs to be flushed back to the NFS server before the delegation can be released.

(After all the NFS v4 server promised the client 'this file is yours to fiddle around with, I will consult you before touching it'.)

Under some circumstances, when the NFS v4 server is unable to contact the NFS v4 client, it will simply sit there waiting and as part of that will not allow you to do things that require the delegation to be released. I don't know if there's a delegation recall timeout, although I suspect that there is, and I don't know how to find out what the timeout is, but whatever the value is, it's substantial (it may be the 90 second 'default lease time' from nfsd4_init_leases_net(), or perhaps the 'grace', also probably 90 seconds, or perhaps the two added together).

(90 seconds is not what I consider a tolerable amount of time for my editor to completely freeze when I tell it to write out a new version of the file. When NFS is involved, I will typically assume that something has gone badly wrong well before then.)

As mentioned, the NFS v4 RFC also explicitly notes that NFS v4 clients may have to flush file state in order to release their delegation, and this itself may take some time. So even without an unavailable client machine, recalling a delegation may stall for some possibly arbitrary amount of time (depending on how the NFS v4 server behaves; the RFC encourages NFS v4 servers to not be hasty if the client seems to be making a good faith effort to clear its state). Both the slow client recall and the hung client recall can happen even in the absence of any actual file locks; in my case, the now-unavailable client merely having read from the file was enough to block things.

This blocking recall is effectively a mandatory lock, and it affects both remote operations over NFS and local operations on the fileserver itself. Short of waiting out whatever timeout applies, you have two realistic choices to deal with this (the non-realistic choice is to reboot the fileserver). First, you can bring the NFS client back to life, or at least something that's at its IP address and responds to the server with NFS v4 errors. Second, I believe you can force everything from the client to expire through /proc/fs/nfsd/clients/<ID>, by writing 'expire' to the client's 'ctl' file. You can find the right client ID by grep'ing for something in all of the clients/*/info files.

Discovering this makes me somewhat more inclined than before to consider entirely disabling 'leases', the underlying kernel feature that is used to implement these NFS v4 delegations (I discovered how to do this when investigating NFS v4 client locks on the server). This will also affect local processes on the fileserver, but that now feels like a feature since hung NFS v4 delegation recalls will stall or stop even local operations.

Projects can't be divorced from the people involved in them

By: cks

Among computer geeks, myself included, there's a long running optimistic belief that projects can be considered in isolation and 'evaluated on their own merits', divorced from the specific people or organizations that are involved with them and the culture that they have created. At best, this view imagines that we can treat everyone involved in the development of something as a reliable Vulcan, driven entirely by cold logic with no human sentiment involved. This is demonstrably false (ask anyone about the sharp edge of Linus Torvalds' emails), but convenient, at least for people with privilege.

(A related thing is considering projects in isolation from the organizations that create and run them, for example ignoring that something is from 'killed by' Google.)

Over time, I have come to understand and know that this is false, much like other things I used to accept. The people involved with a project bring with them attitudes and social views, and they create a culture through their actions, their expressed views, and even their presence. Their mere presence matters because it affects other people, and how other people will or won't interact with the project.

(To put it one way, the odds that I will want to be involved in a project run by someone who openly expresses their view that bicyclists are the scum of the earth and should be violently run off the road are rather low, regardless of how they behave within the confines of the project. I'm not a Vulcan myself and so I am not going to be able to divorce my interactions with this person from my knowledge that they would like to see me and my bike club friends injured or dead.)

You can't divorce a project from its culture or its people (partly because the people create and sustain that culture); the culture and the specific people are entwined into how 'the project' (which is to say, the crowd of people involved in it) behaves, and who it attracts and repels. And once established, the culture of a project, like the culture of anything, is very hard to change, partly because it acts as a filter for who becomes involved in the project. The people who create a project gather like-minded people who see nothing wrong with the culture and often act to perpetuate it, unless the project becomes so big and so important that other people force their way in (usually because a corporation is paying them to put up with the initial culture).

(There is culture everywhere. C++ has a culture (or several), for example, as does Rust. Are they good cultures? People have various opinions that I will let you read about yourself.)

Realizing we needed two sorts of alerts for our temperature monitoring

By: cks

We have a long standing system to monitor the temperatures of our machine rooms and alert us if there are problems. A recent discussion about the state of the temperature in one of them made me realize that we want to monitor and alert for two different problems, and because they're different we need two different sorts of alerts in our monitoring system.

The first, obvious problem is a machine room AC failure, where the AC shuts off or becomes almost completely ineffective. In our machine rooms, an AC failure causes a rapid and sustained rise in temperature to well above its normal maximum level (which is typically reached just before the AC starts its next cooling cycle). AC failures are high priority issues that we want to alert about rapidly, because we don't have much time before machines start to cook themselves (and they probably won't shut themselves down before the damage has been done).

The second problem is an AC unit that can't keep up with the room's heat load; perhaps its filters are (too) clogged, or it's not getting enough cooling from the roof chillers, or various other mysterious AC reasons. The AC hasn't failed and it is still able to cool things to some degree and keep the temperature from racing up, but over time the room's temperature steadily drifts upward. Often the AC will still be cycling on and off to some degree and we'll see the room temperature vary up and down as a result; at other things the room temperature will basically reach a level and more or less stay there, presumably with the AC running continuously.

One issue we ran into is that a fast triggering alert that was implicitly written for the AC failure case can wind up flapping up and down if insufficient AC has caused the room to slowly drift close to its triggering temperature level. As the AC works (and perhaps cycles on and off), the room temperature will shift above and then back below the trigger level, and the alert flaps.

We can't detect both situations with a single alert, so we need at least two. Currently, the 'AC is not keeping up' alert looks for sustained elevated temperatures with the temperature always at or above a certain level over (much) more time than the AC should take to bring it down, even if the AC has to avoid starting for a bit of time to not cycle too fast. The 'AC may have failed' alert looks for high temperatures over a relatively short period of time, although we may want to make this an average over a short period of time.

(The advantage of an average is that if the temperature is shooting up, it may trigger faster than a 'the temperature is above X for Y minutes' alert. The drawback is that an average can flap more readily than a 'must be above X for Y time' alert.)

Checklists are hard (but still a good thing)

By: cks

We recently had a big downtime at work where part of the work was me doing a relatively complex and touchy thing. Naturally I made a checklist, but also naturally my checklist turned out to be incomplete, with some things I'd forgotten and some steps that weren't quite right or complete. This is a good illustration that checklists are hard to create.

Checklists are hard partly because they require us to try to remember, reconstruct, and understand everything in what's often a relatively complex system that is too big for us to hold in our mind. If your understanding is incomplete you can overlook something and so leave out a step or a part of a step, and even if you write down a step you may not fully remember (and record) why the step has to be there. My view is that this is especially likely in system administration where we may have any number of things that have been quietly sitting in the corner for some time, working away without problems, and so they've slipped out of our minds.

(For example, one of the issues that we ran into in this downtime was not remembering all of the hosts that ran crontab jobs that used one particular filesystem. Of course we thought we did know, so we didn't try to systematically look for such crontab jobs.)

To get a really solid checklist you have to be able to test it, much like all documentation needs testing. Unfortunately, a lot of the checklists I write (or don't write) are for one-off things that we can't really test in advance for various reasons, for example because they involve a large scale change to our live systems (that requires a downtime). If you're lucky you'll realize that you don't know something or aren't confident in something while writing the checklist, so you can investigate it and hopefully get it right, but some of the time you'll be confident you understand the problem but you're wrong.

Despite any imperfections, checklists are still a good thing. An imperfect written down checklist is better than relying on your memory and mind on the fly almost all of the time (the rare exceptions are when you wouldn't even dare do the operation without a checklist but an imperfect checklist tempts you into doing it and fumbling).

(You can try to improve the situation by keeping notes on what was missed in the checklist and then saving or publishing these notes somewhere. You can review these after the fact notes on what was missed in this specific checklist if you have to do the thing again, or look for specific types of things you tend to overlook and should specifically check for the next time you're making a checklist that touches on some area.)

A logic to Apache accepting query parameters for static files

By: cks

One of my little web twitches is the lax handling of unknown query parameters. As part of this twitch I've long been a bit irritated that Apache accepts query parameters even on static files, when they definitely have no meaning at all. You could say that this is merely Apache being accepting in general, but recently I noticed a combination of Apache features that can provide an additional reason for Apache to do this.

Apache has various features to redirect from old URLs on your site to new URLs, such as Redirect and RewriteRule. As covered in the relevant documentation for each of them, these rewrites preserve query parameters (although for RewriteRule you can turn that off with the QSD flag). This behavior makes sense in a lot of cases; if you've moved an application from one URL to another (or from one host to another) and it uses query parameters, you almost certainly want the query parameters to carry over with the HTTP redirection that people using old URLs will get.

(Here by 'an application' I mean anything that accepts and acts on query parameters. It might be a CGI, a PHP page or set of pages, a reverse proxy to something else, a Django application implemented with mod_wsgi, or various other things.)

A lot of the time if you use a redirect in Apache on URLs for an application, you'll be sending people to the new location of that application or its replacement. However, some of the time you'll be redirecting from an application to a static page, for example a page that says "this application has gone away". At least by default, your redirection from the application to the static page will carry query parameters along with it, and it would be a bad experience (for the people visiting and you) if the default result was that Apache served some sort of error page because it received query parameters on a static file.

(A closely related change is replacing a single-URL application, such as a basic CGI, with a static web page. Maybe the whole thing is no longer supported, or maybe everything now has a single useful response regardless of query parameters. Here again you can legitimately receive query parameters on a static file.)

Realizing this made me more sympathetic to Apache's behavior of accepting query parameters on static files. It's a relatively reasonable pragmatic choice even if (like me) you're not one of the people who feel unknown query parameters should always be ignored (which is the de facto requirement on the modern web, so my feelings about it are irrelevant).

Why Ubuntu 24.04's ls can show a puzzling error message on NFS filesystems

By: cks

Suppose that you're on Ubuntu 24.04, using NFS v4 filesystems mounted from a Linux NFS fileserver, and at some point you do a 'ls -l' or a 'ls -ld' of something you don't own. You may then be confused and angered:

; /bin/ls -ld ckstst
/bin/ls: ckstst: Permission denied
drwx------ 64 ckstst [...] 131 Jul 17 12:06 ckstst

(There are situations where this doesn't happen or doesn't repeat, which I don't understand but which I'm assuming are NFS caching in action.)

If you apply strace to the problem, you'll find that the failing system call is listxattr(2), which is trying to list 'extended attributes'. On Ubuntu 24.04, ls comes from Coreutils, and Coreutils apparently started using listxattr() in version 9.4.

The Linux NFS v4 code supports extended attributes (xattrs), which are from RFC 8276; they're supported in both the client and the server since mid-2020 if I'm reading git logs correctly. Both the normal Ubuntu 22.04 LTS and 24.04 LTS server kernels are recent enough to include this support on both the server and clients, and I don't believe there's any way to turn just them off in the kernel server (although if you disable NFS v4.2 they may disappear too).

However, the NFS v4 server doesn't treat listxattr() operations the way the kernel normally does. Normally, the kernel will let you do listxattr() on an object (a directory, a file, etc) that you don't have read permissions on, just as it will let you do stat() on it. However, the NFS v4 server code specifically requires that you have read access to the object. If you don't, you get EACCES (no second S).

(The sausage is made in nfsd_listxattr() in fs/nfsd/vfs.c, specifically in the fh_verify() call that uses NFSD_MAY_READ instead of NFSD_MAY_NOP, which is what eg GETATTR uses.)

In January of this year, Coreutils applied a workaround to this problem, which appeared in Coreutils 9.6 (and is mentioned in the release notes).

Normally we'd have found this last year, but we've been slow to roll out Ubuntu 24.04 LTS machines and apparently until now no one ever did a 'ls -l' of unreadable things on one of them (well, on a NFS mounted filesystem).

(This elaborates on a Fediverse post. Our patch is somewhat different than the official one.)

Two tools I've been using to look into my web traffic volume

By: cks

These days, there's an unusually large plague of web crawlers, many of them attributed to LLM activities and most of them acting anonymously, with forged user agents and sometimes widely distributed source IPs. Recently I've been using two tools more and more to try to identify and assess suspicious traffic sources.

The first tool is Anarcat's asncounter. Asncounter takes IP addresses, for example from your web server logs, and maps them to ASNs (roughly who owns an IP address) and to CIDR netblocks that belong to those ASNs (a single ASN can have a lot of netblocks). This gives you information like:

count   percent ASN     AS
1460    7.55    24940   HETZNER-AS, DE
[...]
count   percent prefix  ASN     AS
1095    5.66    66.249.64.0/20  15169   GOOGLE, US
[...]
85      0.44    49.13.0.0/16    24940   HETZNER-AS, DE
85      0.44    65.21.0.0/16    24940   HETZNER-AS, DE
82      0.42    138.201.0.0/16  24940   HETZNER-AS, DE
71      0.37    135.181.0.0/16  24940   HETZNER-AS, DE
68      0.35    65.108.0.0/16   24940   HETZNER-AS, DE
[...]

While Hetzner is my biggest traffic source by ASN, it's not my biggest source by 'prefix' (a CIDR netblock), because this Hetzner traffic is split up across a bunch of their networks. Since most software operates by CIDR netblocks, not by ASNs, this difference can be important (and unfortunate if you want to block all traffic from a particular ASN).

The second tool is grepcidr. Grepcidr will let you search through a log file, such as your web server logs, for traffic from any particular netblock (or a group of netblocks), such as Google's '66.249.64.0/20'. This lets me find out what sort of requests came from a potentially suspicious network block, for example 'grepcidr 49.13.0.0/16 /var/log/...'. If what I see looks suspicious and has little or no legitimate traffic, I can consider taking steps against that netblock.

Asncounter is probably not (yet) packaged in your Linux distribution. Grepcidr may be, but if it's not it's a C program and simple to compile.

(It wouldn't be too hard to put together an 'asngrep' that would cut out the middleman, but I've so far not attempted to do this.)

PS: Both asncounter and grepcidr can be applied to other sorts of logs with IP addresses, for example sources of SSH brute force password scans. But my web logs are all that I've used them for so far.

People want someone to be responsible for software that fails

By: cks

There are various things in the open source tech news these days, like bureaucratic cybersecurity risk assessment request to open source projects, maintainers rejecting the current problematic approach to security issues (also), and 'software supply chain security'. One of my recent thoughts as a result of all of this is that the current situation is fundamentally unsustainable, and one part of it is because people are increasingly going to require someone to be held responsible for software that fails and does damage ('damage' in an abstract sense; people know it when they see it).

This isn't anything unique or special with software. People feel the same way about buildings, bridges, vehicles, food, and anything else that actually matters to leading a regular life, and eventually they've managed to turn that feeling into concrete results for most things. Software has so far had a long period of not being held to account, but then once upon a time so did food and food safety and food has always been very important to people while software spent a long time not being visibly a big deal (or, if you prefer, not being as visibly slipshod as it is today, when a lot more people are directly exposed to a lot more software and thus to its failings).

The bottom line is that people don't consider it (morally) acceptable when no one is held responsible for either negligence or worse, deliberate choices that cause harm. A field can only duck and evade their outrage for so long; sooner or later it stops being able to shrug and walk away. Software is now systematically important in the world, which means that its failings can do real harm, and people have noticed.

(Which is to say that an increasing number of people have been harmed by software and didn't like it, and the number and frequency is only going to go up.)

There are a lot of ways that this could happen, with the EU CRA being only one of them; as various drafts of the EU CRA has shown, there are a lot of ways that things could go badly in the process. And it could also be that the forces of unbridled pure-profit capitalism will manage to fight this off no matter how much people want it, as they're busy doing with other things in the world (see, for example, the LLM crawler plague). But if companies do fight this off I don't think we're going to enjoy that world very much for multiple reasons, and people's desire for this is still going to very much be there. The days of people's indifference are over and one way or another we're going to have to deal with that. Both our software and our profession will be shaped by how we do.

Doing web things with CGIs is mostly no longer a good idea

By: cks

Recently I saw Serving 200 million requests per day with a cgi-bin (via, and there's a follow-up), which talks about how fast modern CGIs can be in compiled languages like Rust and Go (Rust more so than Go, because Go has a runtime that it has to start every time a Go program is executed). I'm a long standing fan of CGIs (and Wandering Thoughts, this blog, runs as a CGI some of the time), but while I admire these articles, I think that you mostly shouldn't consider trying to actually write a CGI these days.

Where and how CGI programs shine is when they have a simple deployment and development model. You write a little program, you put the little program somewhere, and it just works (and it's not going to be particularly slow these days). The programs run only when they get used, and if you're using Apache, you can also make these little programs run as the user who owns that web area instead of the web server user.

Where CGI programs fall down today is that they're unpopular, no longer well supported in various programming environments and frameworks, and they don't integrate with various other tools because these days the tools expect to operate as HTTP (reverse) proxies in front of your HTTP service (for example, Anubis for anti-crawler protections). It's easy to write, for example, a Go HTTP based web service; you can find lots of examples of how to do it (and the pieces are part of Go's standard library). If you want to write a Go CGI, you're actually in luck because Go put that in the standard library, but you're not going to find anywhere near as many examples and of course you won't get that integration with other HTTP reverse proxy tools. Other languages are not necessarily going to be as friendly as Go (including Python, which has removed the 'cgi' standard library package in 3.13).

(Similarly, many modern web servers are less friendly to CGIs than Apache is and will make you assemble more pieces to run them, reducing a number of the deployment advantages of CGIs.)

Only running these 'backend' HTTP server programs when they're needed is not easy today (although it's possible with systemd), so if you have a lot of little things that you can't bundle together into one server program, CGIs may still make sense despite what is generally the extra hassle of developing and running them. But otherwise, a HTTP based service that you run behind your general purpose web server is what modern web development is steering you toward and it's almost certainly going to be the easiest path.

(There's also a lot of large scale software support for deploying things that are HTTP services, with things like load balancers and smart routing frontends and so on and so forth, never mind containers and orchestration environments. If you want to use CGIs in this environment you basically get to add in a little web server as the way the outside world invokes them.)

Improving my GNU Emacs 'which-key' experience with a Ctrl-h binding

By: cks

One of the GNU Emacs packages that I use is which-key (which is now in GNU Emacs v30). I use it because I don't always remember specific extended key bindings (ones that start with a prefix) that are available to me, especially in MH-E, which effectively has its own set of key bindings for a large collection of commands. Which-key gives me a useful but minimal popup prompt about what's available and I can usually use that to navigate to what I want. Recently I read Omar AntolΓ­n's The case against which-key: a polemic, which didn't convince me but did teach me two useful tricks in one.

The first thing I learned from it, which I could have already known if I was paying attention, was the Ctrl-h keybinding that's available in extended key bindings to get some additional help. In stock GNU Emacs this Ctrl-h information is what I consider only somewhat helpful, and with which-key turned on it's basically not helpful at all from what I can see (which may partly explain why I didn't pay it any attention before).

The second thing is a way to make this Ctrl-h key binding useful in combination with which-key, using embark and in fact I believe a number of my other minibuffer completion things. That is to bind Ctrl-h to an Embark function designed for this:

(setq prefix-help-command #'embark-prefix-help-command)

As illustrated in the article, using Ctrl-h with this binding effectively switches your multi-key entry over to minibuffer completion, complete with all of the completion add-ons you have configured. If you've got things like Vertico, Marginalia, and Orderless configured, this is an excellent way to pick through the available bindings to figure out what you want; with my configuration I get the key itself, the ELisp function name, and the ELisp help summary (and then I can cursor down and up through the list).

The Embark version is too much information if I just need a little reminder of what's possible; that's what the basic which-key display is great for. But if the basic information isn't enough, the Embark binding for Ctrl-h is a great supplement, and it even reaches through multi-key sequences (which is something that which-key doesn't do, at least in my setup, and I have some of them in MH-E).

People still use our old-fashioned Unix login servers

By: cks

Every so often I think about random things, and today's random thing was how our environment might look if it was rebuilt from scratch as a modern style greenfield development. One of the obvious assumptions is that it'd involve a lot of use of containers, which led me to wondering how you handle traditional Unix style login servers. This is a relevant issue for us because we have such traditional login servers and somewhat to our surprise, they still see plenty of use.

We have two sorts of login servers. There's effectively one general purpose login server that people aren't supposed to do heavy duty computation on (and which uses per-user CPU and RAM limits to help with that), and four 'compute' login servers where they can go wild and use up all of the CPUs and memory they can get their hands on (with no guarantees that there will be any, those machines are basically first come, first served; for guaranteed CPUs and RAM people need to use our SLURM cluster). Usage of these servers has declined over time, but they still see a reasonable amount of use, including by people who have only recently joined the department (as graduate students or otherwise).

What people log in to our compute servers to do probably hasn't changed much, at least in one sense; people probably don't log in to a compute server to read their mail with their favorite text mode mail reader (yes, we have Alpine and Mutt users). What people use the general purpose 'application' login server for likely has changed a fair bit over time. It used to be that people logged in to run editors, mail readers, and other text and terminal based programs. However, now a lot of logins seem to be done either to SSH to other machines that aren't accessible from the outside world or to run the back-ends of various development environments like VSCode. Some people still use the general purpose login server for traditional Unix login things (me included), but I think it's rarer these days.

(Another use of both sorts of servers is to run cron jobs; various people have various cron jobs on one or the other of our login servers. We have to carefully preserve them when we reinstall these machines as part of upgrading Ubuntu releases.)

PS: I believe the reason people run IDE backends on our login servers is because they have their code on our fileservers, in their (NFS-mounted) home directories. And in turn I suspect people put the code there partly because they're going to run the code on either or both of our SLURM cluster or the general compute servers. But in general we're not well informed about what people are using our login servers for due to our support model.

The development version of OpenZFS is sometimes dangerous, illustrated

By: cks

I've used OpenZFS on my office and home desktops (on Linux) for what is a long time now, and over that time I've consistently used the development version of OpenZFS, updating to the latest git tip on a regular basis (cf). There have been occasional issues but I've said, and continue to say, that the code that goes into the development version is generally well tested and I usually don't worry too much about it. But I do worry somewhat, and I do things like read every commit message for the development version and I sometimes hold off on updating my version if a particular significant change has recently landed.

But, well, sometimes things go wrong in a development version. As covered in Rob Norris's An (almost) catastrophic OpenZFS bug and the humans that made it (and Rust is here too) (via), there was a recently discovered bug in the development version of OpenZFS that could or would have corrupted RAIDZ vdevs. When I saw the fix commit go by in the development version, I felt extremely lucky that I use mirror vdevs, not raidz, and so avoided being affected by this.

(While I might have detected this at the first scrub after some data was corrupted, the data would have been gone and at a minimum I'd have had to restore it from backups. Which I don't currently have on my home desktop.)

In general this is a pointed reminder that the development version of OpenZFS isn't perfect, no matter how long I and other people have been lucky with it. You might want to think twice before running the development version in order to, for example, get support for the very latest kernels that are used by distributions like Fedora. Perhaps you're better off delaying your kernel upgrades a bit longer and sticking to released branches.

I don't know if this is going to change my practices around running the development version of OpenZFS on my desktops. It may make me more reluctant to update to the very latest version on my home desktop; it would be straightforward to have that run only time-delayed versions of what I've already run through at least one scrub cycle on my office desktop (where I have backups). And I probably won't switch to the next release version when it comes out, partly because of kernel support issues.

What OSes we use here (as of July 2025)

By: cks

About five years ago I wrote an entry on what OSes we were using at the time. Five years is both a short time and a long time here, and in that time some things have changed.

Our primary OS is still Ubuntu LTS; it's our default and we use it on almost everything. On the one hand, these days 'almost everything' covers somewhat more ground than it did in 2020, as some machines have moved from OpenBSD to Ubuntu. On the other hand, as time goes by I'm less and less confident that we'll still be using Ubuntu in five years, because I expect Canonical to start making (more) unfortunate and unacceptable changes any day now. Our most likely replacement Linux is Debian.

CentOS is dead here, killed by a combination of our desire to not have two Linux variants to deal with and CentOS Stream. We got rid of the last of our CentOS machines last year. Conveniently, our previous commercial anti-spam system vendor effectively got out of the business so we didn't have to find a new Unix that they supported.

We're still using OpenBSD, but it's increasingly looking like a legacy OS that's going to be replaced by FreeBSD as we rebuild the various machines that currently run OpenBSD. Our primary interests are better firewall performance and painless mirrored root disks, but if we're going to run some FreeBSD machines and it can do everything OpenBSD can, we'd like to run fewer Unixes so we'll probably replace all of the OpenBSD machines with FreeBSD ones over time. This is a shift in progress and we'll see how far it goes, but I don't expect the number of OpenBSD machines we run to go up any more; instead it's a question of how far down the number goes.

(Our opinions about not using Linux for firewalls haven't changed. We like PF, it's just we like FreeBSD as a host for it more than OpenBSD.)

We continue to not use containers so we don't have to think about a separate, minimal Linux for container images.

There are a lot of research groups here and they run a lot of machines, so research group machines are most likely running a wide assortment of Linuxes and Unixes. We know that Ubuntu (both LTS and non-LTS) is reasonably popular among research groups, but I'm sure there are people with other distributions and probably some use of FreeBSD, OpenBSD, and so on. I believe there may be a few people still using Solaris machines.

(My office desktop continues to run Fedora, but I wouldn't run it on any production server due to the frequent distribution version updates. We don't want to be upgrading distribution versions every six months.)

Overall I'd say we've become a bit more of an Ubuntu LTS monoculture than we were before, but it's not a big change, partly because we were already mostly Ubuntu. Given our views on things like firewalls, we're probably never going to be all-Ubuntu or all-Linux.

(Maybe) understanding how to use systemd-socket-proxyd

By: cks

I recently read systemd has been a complete, utter, unmitigated success (via among other places), where I found a mention of an interesting systemd piece that I'd previously been unaware of, systemd-socket-proxyd. As covered in the article, the major purpose of systemd-socket-proxyd is the bridge between systemd dynamic socket activation and a conventional programs that listens on some socket, so that you can dynamically activate the program when a connection comes in. Unfortunately the systemd-socket-proxyd manual page is a little bit opaque about how it works for this purpose (and what the limitations are). Even though I'm familiar with systemd stuff, I had to think about it for a bit before things clicked.

A systemd socket unit activates the corresponding service unit when a connection comes in on the socket. For simple services that are activated separately for each connection (with 'Accept=yes'), this is actually a templated unit, but if you're using it to activate a regular daemon like sshd (with 'Accept=no') it will be a single .service unit. When systemd activates this unit, it will pass the socket to it either through systemd's native mechanism or an inetd-compatible mechanism using standard input. If your listening program supports either mechanism, you don't need systemd-socket-proxyd and your life is simple. But plenty of interesting programs don't; they expect to start up and bind to their listening socket themselves. To work with these programs, systemd-socket-proxyd accepts a socket (or several) from systemd and then proxies connections on that socket to the socket your program is actually listening to (which will not be the official socket, such as port 80 or 443).

All of this is perfectly fine and straightforward, but the question is, how do we get our real program to be automatically started when a connection comes in and triggers systemd's socket activation? The answer, which isn't explicitly described in the manual page but which appears in the examples, is that we make the socket's .service unit (which will run systemd-socket-proxyd) also depend on the .service unit for our real service with a 'Requires=' and an 'After='. When a connection comes in on the main socket that systemd is doing socket activation for, call it 'fred.socket', systemd will try to activate the corresponding .service unit, 'fred.service'. As it does this, it sees that fred.service depends on 'realthing.service' and must be started after it, so it will start 'realthing.service' first. Your real program will then start, bind to its local socket, and then have systemd-socket-proxyd proxy the first connection to it.

To automatically stop everything when things are idle, you set systemd-socket-proxyd's --exit-idle-time option and also set StopWhenUnneeded=true on your program's real service unit ('realthing.service' here). Then when systemd-socket-proxyd is idle for long enough, it will exit, systemd will notice that the 'fred.service' unit is no longer active, see that there's nothing that needs your real service unit any more, and shut that unit down too, causing your real program to exit.

The obvious limitation of using systemd-socket-proxyd is that your real program no longer knows the actual source of the connection. If you use systemd-socket-proxyd to relay HTTP connections on port 80 to an nginx instance that's activated on demand (as shown in the examples in the systemd-socket-proxyd manual page), that nginx sees and will log all of the connections as local ones. There are usage patterns where this information will be added by something else (for example, a frontend server that is a reverse proxy to a bunch of activated on demand backend servers), but otherwise you're out of luck as far as I know.

Another potential issue is that systemd's idea of when the .service unit for your real program has 'started' and thus it can start running systemd-socket-proxyd may not match when your real program actually gets around to setting up its socket. I don't know if systemd-socket-proxyd will wait and try a bit to cope with the situation where it gets started a bit faster than your real program can get its socket ready.

(Systemd has ways that your real program can signal readiness, but if your program can use these ways it may well also support being passed sockets from systemd as a direct socket activated thing.)

Linux 'exportfs -r' stops on errors (well, problems)

By: cks

Linux's NFS export handling system has a very convenient option where you don't have to put all of your exports into one file, /etc/exports, but can instead write them into a bunch of separate files in /etc/exports.d. This is very convenient for allowing you to manage filesystem exports separately from each other and to add, remove, or modify only a single filesystem's exports. Also, one of the things that exportfs(8) can do is 'reexport' all current exports, synchronizing the system state to what is in /etc/exports and /etc/exports.d; this is 'exportfs -r', and is a handy thing to do after you've done various manipulations of files in /etc/exports.d.

Although it's not documented and not explicit in 'exportfs -v -r' (which will claim to be 'exporting ...' for various things), I have an important safety tip which I discovered today: exportfs does nothing on a re-export if you have any problems in your exports. In particular, if any single file in /etc/exports.d has a problem, no files from /etc/exports.d get processed and no exports are updated.

One potential problem with such files is syntax errors, which is fair enough as a 'problem'. But another problem is that they refer to directories that don't exist, for example because you have lingering exports for a ZFS pool that you've temporarily exported (which deletes the directories that the pool's filesystems may have previously been mounted on). A missing directory is an error even if the exportfs options include 'mountpoint', which only does the export if the directory is a mount point.

When I stubbed my toe on this I was surprised. What I'd vaguely expected was that the error would cause only the particular file in /etc/exports.d to not be processed, and that it wouldn't be a fatal error for the entire process. Exportfs itself prints no notices about this being a fatal problem, and it will happily continue to process other files in /etc/exports.d (as you can see with 'exportfs -v -r' with the right ordering of where the problem file is) and claim to be exporting them.

Oh well, now I know and hopefully it will stick.

Systemd user units, user sessions, and environment variables

By: cks

A variety of things in typical graphical desktop sessions communicate through the use of environment variables; for example, X's $DISPLAY environment variable. Somewhat famously, modern desktops run a lot of things as systemd user units, and it might be nice to do that yourself (cf). When you put these two facts together, you wind up with a question, namely how the environment works in systemd user units and what problems you're going to run into.

The simplest case is using systemd-run to run a user scope unit ('systemd-run --user --scope --'), for example to run a CPU heavy thing with low priority. In this situation, the new scope will inherit your entire current environment and nothing else. As far as I know, there's no way to do this with other sorts of things that systemd-run will start.

Non-scope user units by default inherit their environment from your user "systemd manager". I believe that there is always only a single user manager for all sessions of a particular user, regardless of how you've logged in. When starting things via 'systemd-run', you can selectively pass environment variables from your current environment with 'systemd-run --user -E <var> -E <var> -E ...'. If the variable is unset in your environment but set in the user systemd manager, this will unset it for the new systemd-run started unit. As you can tell, this will get very tedious if you want to pass a lot of variables from your current environment into the new unit.

You can manipulate your user "systemd manager environment block", as systemctl describes it in Environment Commands. In particular, you can export current environment settings to it with 'systemctl --user import-environment VAR VAR2 ...'. If you look at this with 'systemctl --user show-environment', you'll see that your desktop environment has pushed a lot of environment variables into the systemd manager environment block, including things like $DISPLAY (if you're on X). All of these environment variables for X, Wayland, DBus, and so on are probably part of how the assorted user units that are part of your desktop session talk to the display and so on.

You may now see a little problem. What happens if you're logged in with a desktop X session, and then you go elsewhere and SSH in to your machine (maybe with X forwarding) and try to start a graphical program as a systemd user unit? Since you only have a single systemd manager regardless of how many sessions you have, the systemd user unit you started from your SSH session will inherit all of the environment variables that your desktop session set and it will think it has graphics and open up a window on your desktop (which is hopefully locked, and in any case it's not useful to you over SSH). If you import the SSH session's $DISPLAY (or whatever) into the systemd manager's environment, you'll damage your desktop session.

For specific environment variables, you can override or remove them with 'systemd-run --user -E ...' (for example, to override or remove $DISPLAY). But hunting down all of the session environment variables that may trigger undesired effects is up to you, making systemd-run's user scope units by far the easiest way to deal with this.

(I don't know if there's something extra-special about scope units that enables them and only them to be passed your entire environment, or of this is simply a limitation in systemd-run that it doesn't try to implement this for anything else.)

The reason I find all of this regrettable is that it makes putting applications and other session processes into their own units much harder than it should be. Systemd-run's scope units inherit your session environment but can't be detached, so at a minimum you have extra systemd-run processes sticking around (and putting everything into scopes when some of them might be services is unaesthetic). Other units can be detached but don't inherit your environment, requiring assorted contortions to make things work.

PS: Possibly I'm missing something obvious about how to do this correctly, or perhaps there's an existing helper that can be used generically for this purpose.

The easiest way to interact with programs is to run them in terminals

By: cks

I recently wrote about a new little script of mine, which I use to start programs in terminals in a way that I can interact with them (to simplify it). Much of what I start with this tool doesn't need to run in a terminal window at all; the actual program will talk directly to the X server or arrange to talk to my Firefox or the like. I could in theory start them directly from my X session startup script, as I do with other things.

The reason I haven't put these things in my X session startup is that running things in shell sessions in terminal windows is the easiest way to interact with them in all sorts of ways. It's trivial to stop the program or restart it, to look at its output, to rerun it with slightly different arguments if I need to, it automatically inherits various aspects of my current X environment, and so on. You can do all of these things with programs in ways other than using shell sessions in terminals, but it's generally going to be more awkward.

(For instance, on systemd based Linuxes, I could make some of these programs into systemd user services, but I'd still have to use systemd commands to manipulate them. If I run them as standalone programs started from my X session script, it's even more work to stop them, start them again, and so on.)

For well established programs that I expect to never restart or want to look at output from, I'll run them from my X session startup script. But for new programs, like these, they get to spend a while in terminal windows because that's the easiest way. And some will be permanent terminal window occupants because they sometimes produce (text) output.

On the one hand, using terminal windows for this is simple and effective, and I could probably make it better by using a multi-tabbed terminal program, with one tab for each program (or the equivalent in a regular terminal program with screen or tmux). On the other hand, it feels a bit sad that in 2025, our best approach for flexible interaction with a program and monitoring its output is 'put it in a terminal'.

(It's also irritating that with some programs, the easiest and best way to make sure that they really exit when you want them to shut down, rather than "helpfully" lingering on in various ways, is to run them from a terminal and then Ctrl-C them when you're done with them. I have to use a certain video conferencing application that is quite eager to stay running if you tell it to 'quit', and this is my solution to it. Someday I may have to figure out how to put it in a systemd user unit so that it can't stage some sort of great escape into the background.)

Filesystems and the problems of exposing their internal features

By: cks

Modern filesystems often have a variety of sophisticated features that go well beyond standard POSIX style IO, such as transactional journals of (all) changes and storing data in compressed form. For certain usage cases, it could be nice to get direct access to those features; for example, so your web server could potentially directly serve static files in their compressed form, without having the kernel uncompress them and then the web server re-compress them (let's assume we can make all of the details work out in this sort of situation, which isn't a given). But filesystems only very rarely expose this sort of thing to programs, even through private interfaces that don't have to be standardized by the operating system.

One of the reasons for filesystems to not do this is that they don't want to turn what are currently internal filesystem details into an API (it's not quite right to call them only an 'implementation' detail, because often the filesystem has to support the resulting on-disk structures more or less forever). Another issue is that the implementation inside the kernel is often not even written so that the necessarily information could be provided to a user-level program, especially efficiently.

Even when exposing a feature doesn't necessarily require providing programs with internal information from the filesystem, filesystems may not want to make promises to user space about what they do and when they do it. One place this comes up is the periodic request that filesystems like ZFS expose some sort of 'transaction' feature, where the filesystem promises that either all of a certain set of operations are visible or none of them are. Supporting such a feature doesn't just require ZFS or some other filesystem to promise to tell you when all of the things are durably on disk; it also requires the filesystem to not make any of them visible early, despite things like memory pressure or the filesystem's other natural activities.

Sidebar: Filesystem compression versus program compression

When you start looking, how ZFS does compression (and probably how other filesystems do it) is quite different from how programs want to handle compressed data. A program such as a web server needs a compressed stream of data that the recipient can uncompress as a single (streaming) thing, but this is probably not what the filesystem does. To use ZFS as an example of filesystem behavior, ZFS compresses blocks independently and separately (typically in 128 Kbyte blocks), may use different compression schemes for different blocks, and may not compress a block at all. Since ZFS reads and writes blocks independently and has metadata for each of them, this is perfectly fine for it but obviously is somewhat messy for a program to deal with.

Operating system kernels could return multiple values from system calls

By: cks

In yesterday's entry, I talked about how Unix's errno is so limited partly because of how the early Unix kernels didn't return multiple values from system calls. It's worth noting that this isn't a limitation in operating system kernels and typical system call interfaces; instead, it's a limitation imposed by C. If anything, it's natural to return multiple values from system calls.

Typically, system call interfaces use CPU registers because it's much easier (and usually faster) for the kernel to access (user) CPU register values than it is to read or write things from and to user process memory. If you can pass system call arguments in registers, you do so, and similarly for returning results. Most CPU architectures have more than one register that you could put system call results into, so it's generally not particularly hard to say that your OS returns results in the following N CPU registers (quite possibly the registers that are also used for passing arguments).

Using multiple CPU registers for system call return values was even used by Research Unix on the PDP-11, for certain system calls. This is most visible in versions that are old enough to document the PDP-11 assembly versions of system calls; see, for example, the V4 pipe(2) system call, which returns the two ends of the pipe in r0 and r1. Early Unix put errno error codes and non-error results in the same place not because it had no choice but because it was easier that way.

(Because I looked it up, V7 returned a second value in r1 in pipe(), getuid(), getgid(), getpid(), and wait(). All of the other system calls seem to have only used r0; if r1 was unused by a particular call, the generic trap handling code preserved it over the system call.)

I don't know if there's any common operating system today with a system call ABI that routinely returns multiple values, but I suspect not. I also suspect that if you were designing an OS and a system call ABI today and were targeting it for a modern language that directly supported multiple return values, you would probably put multiple return values in your system call ABI. Ideally, including one for an error code, to avoid anything like errno's limitations; in fact it would probably be the first return value, to cope with any system calls that had no ordinary return value and simply returned success or some failure.

What is going on in Unix with errno's limited nature

By: cks

If you read manual pages, such as Linux's errno(3), you'll soon discover an important and peculiar seeming limitation of looking at errno. To quote the Linux version:

The value in errno is significant only when the return value of the call indicated an error (i.e., -1 from most system calls; -1 or NULL from most library functions); a function that succeeds is allowed to change errno. The value of errno is never set to zero by any system call or library function.

This is also more or less what POSIX says in errno, although in standards language that's less clear. All of this is a sign of what has traditionally been going on behind the scenes in Unix.

The classical Unix approach to kernel system calls doesn't return multiple values, for example the regular return value and errno. Instead, Unix kernels have traditionally returned either a success value or the errno value along with an indication of failure, telling them apart in various ways (such as the PDP-11 return method). At the C library level, the simple approach taken in early Unix was that system call wrappers only bothered to set the C level errno if the kernel signaled an error. See, for example, the V7 libc/crt/cerror.s combined with libc/sys/dup.s, where the dup() wrapper only jumps to cerror and sets errno if the kernel signals an error. The system call wrappers could all have explicitly set errno to 0 on success, but they didn't.

The next issue is that various C library calls may make a number of system calls themselves, some of which may fail without the library call itself failing. The classical case is stdio checking to see whether stdout is connected to a terminal and so should be line buffered, which was traditionally implemented by trying to do a terminal-only ioctl() to the file descriptors, which would fail with ENOTTY on non-terminal file descriptors. Even if stdio did a successful write() rather than only buffering your output, the write() system call wrapper wouldn't change the existing ENOTTY errno value from the failed ioctl(). So you can have a fwrite() (or printf() or puts() or other stdio call) that succeeds while 'setting' errno to some value such as ENOTTY.

When ANSI C and POSIX came along, they inherited this existing situation and there wasn't much they could do about it (POSIX was mostly documenting existing practice). I believe that they also wanted to allow a situation where POSIX functions were implemented on top of whatever oddball system calls you wanted to have your library code do, even if they set errno. So the only thing POSIX could really require was the traditional Unix behavior that if something failed and it was documented to set errno on failure, you could then look at errno and have it be meaningful.

(This was what existing Unixes were already mostly doing and specifying it put minimal constraints on any new POSIX environments, including POSIX environments on top of other operating systems.)

(This elaborates on a Fediverse post of mine, and you can run into this in non-C languages that have true multi-value returns under the right circumstances.)

On sysadmins (not) changing (OpenSSL) cipher suite strings

By: cks

Recently I read Apps shouldn’t let users enter OpenSSL cipher-suite strings by Frank Denis (via), which advocates for providing at most a high level interface to people that lets them express intentions like 'forward secrecy is required' or 'I have to comply with FIPS 140-3'. As a system administrator, I've certainly been guilty of not keeping OpenSSL cipher suite strings up to date, so I have a good deal of sympathies for the general view of trusting the clients and the libraries (and also possibly the servers). But at the same time, I think that this approach has some issues. In particular, if you're only going to set generic intents, you have to trust that the programs and libraries have good defaults. Unfortunately, historically time when system administrators have most reached for setting specific OpenSSL cipher suite strings was when something came up all of a sudden and they didn't trust the library or program defaults to be up to date.

The obvious conclusion is that an application or library that wants people to only set high level options needs to commit to agility and fast updates so that it always has good defaults. This needs more than just the upstream developers making prompt updates when issues come up, because in practice a lot of people will get the program or library through their distribution or other packaging mechanism. A library that really wants people to trust it here needs to work with distributions to make sure that this sort of update can rapidly flow through, even for older distribution versions with older versions of the library and so on.

(For obvious reasons, people are generally pretty reluctant to touch TLS libraries and would like to do it as little as possible, leaving it to specialists and even then as much as possible to the upstream. Bad things can and have happened here.)

If I was doing this for a library, I would be tempted to give the library two sets of configuration files. One set, the official public set, would be the high level configuration that system administrators were supposed to use to express high level intents, as covered by Frank Denis. The other set would be internal configuration that expressed all of those low level details about cipher suite preferences, what cipher suites to use when, and so on, and was for use by the library developers and people packaging and distributing the library. The goal is to make it so that emergency cipher changes can be shipped as relatively low risk and easily backported internal configuration file changes, rather than higher risk (and thus slower to update) code changes. In an environment with reproducible binary builds, it'd be ideal if you could rebuild the library package with only the configuration files changed and get library shared objects and so on that were binary identical to the previous versions, so distributions could have quite high confidence in newly-built updates.

(System administrators who opted to edit these second set of files themselves would be on their own. In packaging systems like RPM and Debian .debs, I wouldn't even have these files marked as 'configuration files'.)

How you can wind up trying to allocate zero bytes in C

By: cks

One of the reactions I saw to my entry on malloc(0) being allowed to return NULL is to wonder why you'd ever do a zero-size allocation in the first place. Unfortunately, it's relatively easy to stumble into this situation with simple code in certain sorts of not particularly uncommon situations. The most obvious one is if you're allocating memory for a variable size object, such as a Python tuple or a JSON array. In a simple C implementation these will typically have a fixed struct that contains a pointer to a C memory block with either the actual elements or an array of pointers to them. The natural way to set this up is to write code that winds up calling 'malloc(nelems * sizeof(...))' or something like that, like this:

array_header *
alloc_array(unsigned int nelems)
{
   array_header *h;
   h = malloc(sizeof(array_header))
   if (h == NULL) return NULL;

   /* get space for the element pointers except oops */
   h->data = malloc(nelems * sizeof(void *));
   if (h->data == NULL) {
      free(h);
      return NULL;
   }

   h->nelems = nelems;
   /* maybe some other initialization */

   return h;
}

(As a disclaimer, I haven't tried to compile this C code because I'm lazy, so it may contain mistakes.)

Then someone asks your code to create an empty tuple or JSON array and on some systems, things will explode because nelems will be 0 and you will wind up doing 'malloc(0)' and that malloc() will return NULL, as it's allowed to, and your code will think it's out of memory. You can obviously prevent this from happening, but it requires more code and thus requires you to have thought of the possibility.

(Allocating C strings doesn't have this problem because you always need one byte for the terminating 0 byte, but it can come up with other forms of strings where you track the length explicitly.)

One tricky bit about this code is that it will only ever go wrong in an obvious way on some uncommon systems. On most systems today, 'malloc(0)' returns a non-NULL result, usually because the allocator rounds up the amount of memory you asked for to some minimum size. So you can write this code and have it pass all of your tests on common platforms and then some day someone reports that it fails on, for example, AIX.

(It's possible that modern C linters and checkers will catch this; I'm out of touch with the state of the art there.)

As a side note, if malloc(0) returns anything other than NULL, I believe that each call is required to return a unique pointer (see eg the POSIX description of malloc()). I believe that these unique pointers don't have to point to actual allocated memory; they could point to some large reserved arena and be simply allocated in sequence, with a free() of them effectively doing nothing. But it's probably simpler to have your allocator round the size up and return real allocated memory, since then you don't have handle things like the reserved arena running out of space.

The "personal computer" model scales better than the "terminal" model

By: cks

In an aside in a recent entry, I said that one reason that X terminals faded away is that what I called the "personal computer" model of computing had some pragmatic advantages over the "terminal" model. One of them is that broadly, the personal computer model scales better, even though sometimes it may be more expensive or less capable at any given point in time. But first, let me define my terms. What I mean by the "personal computer" model is one where computing resources are distributed, where everyone is given a computer of some sort and is expected to do much of their work with that computer. What I mean by the "terminal" model is where most computing is done on shared machines, and the objects people have are simply used to access those shared machines.

The terminal model has the advantage that the devices you give each individual person can be cheaper, since they don't need to do as much. It has the potential disadvantage that you need some number of big shared machines for everyone to do their work on, and those machines are often expensive. However, historically, some of the time those big shared servers (plus their terminals) have been less expensive than getting everyone their own computer that was capable enough. So the "terminal" model may win at any fixed point in both time and your capacity needs.

The problem with the terminal model is those big shared resources, which become an expensive choke point. If you want to add some more terminals, you need to also budget for more server capacity. If some of your people turn out to need more power than you initially expected, you're going to need more server capacity. And so on. The problem is that your server capacity generally has to be bought in big, expensive units and increments, a problem that has come up before.

The personal computer model is potentially more expensive up front but it's much easier to scale it, because you buy computer capacity in much smaller units. If you get more people, you get each of them a personal computer. If some of your people need more power, you get them (and just them) more capable, more expensive personal computers. If you're a bit short of budget for hardware updates, you can have some people use their current personal computers for longer. In general, you're free to vary things on a very fine grained level, at the level of individual people.

(Of course you may still have some shared resources, like backups and perhaps shared disk space, but there are relatively fine grained solutions for that too.)

PS: I don't know if big compute is cheaper than a bunch of small compute today, given that we've run into various limits in scaling up things like CPU performance, power and heat limits, and so on. There are "cloud desktop" offerings from various providers, but I'm not sure these are winners based on the hardware economics alone, plus today you'd need something to be the "terminal" as well and that thing is likely to be a capable computer itself, not the modern equivalent of an X terminal..

How history works in the version of the rc shell that I use

By: cks

Broadly, there have been three approaches to command history in Unix shells. In the beginning there was none, which was certainly simple but which led people to be unhappy. Then csh gave us in-memory command history, which could be recalled and edited with shell builtins like '!!' but which lasted only as long as that shell process did. Finally, people started putting 'readline style' interactive command editing into shells, which included some history of past commands that you could get back with cursor-up, and picked up the GNU Readline feature of a $HISTORY file. Broadly speaking, the shell would save the in-memory (readline) history to $HISTORY when it exited and load the in-memory (readline) history from $HISTORY when it started.

I use a reimplementation of rc, the shell created by Tom Duff, and my version of the shell started out with a rather different and more minimal mechanism for history. In the initial release of this rc, all the shell itself did was write every command executed to $history (if that variable was set). Inspecting and reusing commands from a $history file was left up to you, although rc provided a helper program that could be used in a variety of ways. For example, in a terminal window I commonly used '-p' to print the last command and then either copied and pasted it with the mouse or used an rc function I wrote to repeat it directly.

(You didn't have to set $history to the same file in every instance of rc. I arranged to have a per-shell history file that was removed when the shell exited, because I was only interested in short term 'repeat a previous command' usage of history.)

Later, the version of rc that I use got support for GNU Readline and other line editing environments (and I started using it). GNU Readline maintains its own in-memory command history, which is used for things like cursor-up to the previous line. In rc, this in-memory command history is distinct from the $history file history, and things can get confusing if you mix the two (for example, cursor-up to an invocation of your 'repeat the last command' function won't necessarily repeat the command you expect).

It turns out that at least for GNU Readline, the current implementation in rc does the obvious thing; if $history is set when rc starts, the commands from it are read into GNU Readline's in-memory history. This is one half of the traditional $HISTORY behavior. Rc's current GNU Readline code doesn't attempt to save its in-memory history back to $history on exit, because if $history is set the regular rc code has already been recording all of your commands there. Rc otherwise has no shell builtins to manipulate GNU Readline's command history, because GNU Readline and other line editing alternatives are just optional extra features that have relatively minimal hooks into the core of rc.

(In theory this allows thenshell to inject a synthetic command history into rc on startup, but it requires thenshell to know exactly how I handle my per-shell history file.)

Sidebar: How I create per-shell history in this version of rc

The version of rc that I use doesn't have an 'initialization' shell function that runs when the shell is started, but it does support a 'prompt' function that's run just before the prompt is printed. So my prompt function keeps track of the 'expected shell PID' in a variable and compares it to the actual PID. If there's a mismatch (including the variable being unset), the prompt function goes through a per-shell initialization, including setting up my per-shell $history value.

A new little shell script to improve my desktop environment

By: cks

Recently on the Fediverse I posted a puzzle about a little shell script:

A silly little Unix shell thing that I've vaguely wanted for ages but only put together today. See if you can guess what it's for:

#!/bin/sh
trap 'exec $SHELL' 2
"$@"
exec $SHELL

(The use of this is pretty obscure and is due to my eccentric X environment.)

The actual version I now use wound up slightly more complicated, and I call it 'thenshell'. What it does (as suggested by the name) is to run something and then after the thing either exits or is Ctrl-C'd, it runs a shell. This is pointless in normal circumstances but becomes very relevant if you use this as the command for a terminal window to run instead of your shell, as in 'xterm -e thenshell <something>'.

Over time, I've accumulated a number of things I want to run in my eccentric desktop environment, such as my system for opening URLs from remote machines and my alert monitoring. But some of the time I want to stop and restart these (or I need to restart them), and in general I want to notice if they produce some output, so I've been running them in terminal windows. Up until now I've had to manually start a terminal and run these programs each time I restart my desktop environment, which is annoying and sometimes I forget to do it for something. My new 'thenshell' shell script handles this; it runs whatever and then if it's interrupted or exits, starts a shell so I can see things, restart the program, or whatever.

Thenshell isn't quite a perfect duplicate of the manual version. One obvious limitation is that it doesn't put the command into the shell's command history, so I can't just cursor-up and hit return to restart it. But this is a small thing compared to having all of these things automatically started for me.

(Actually, I think I might be able to get this into a version of thenshell that knows exactly how my shell and my environment handle history, but it would be more than a bit of a hack. I may still try it, partly because it would be nifty.)

Current cups-browsed seems to be bad for central CUPS print servers

By: cks

Suppose, not hypothetically, that you have a central CUPS print server, and that people also have Linux desktops or laptops that they point at your print server to print to your printers. As of at least Ubunut 24.04, if you're doing this you probably want to get people to turn off and disable cups-browsed on their machines. If you don't, your central print server may see a constant flood of connections from client machines running cups-browsed. You're probably running it, as I believe that cups-browsed is installed and activated by default these days in most desktop Linux environments.

(We didn't really notice this in prior Ubuntu versions, although it's possible cups-browsed was always doing something like this and what's changed in the Ubuntu 24.04 version is that it's doing it more and faster.)

I'm not entirely sure why this happens, and I'm also not sure what the CUPS requests typically involve, but one pattern that we see is that such clients will make a lot of requests to the CUPS server's /admin/ URL. I'm not sure what's in these requests, because CUPS immediately rejects them as unauthenticated. Another thing we've seen is frequent attempts to get printer attributes for printers that don't exist and that have name patterns that look like local printers. One of the reason that the clients are hitting the /admin/ endpoint may be to somehow add these printers to our CUPS server, which is definitely not going to work.

(We've also seen signs that some Ubuntu 24.04 applications can repeatedly spam the CUPS server, probably with status requests for printers or print jobs. This may be something enabled or encouraged by cups-browsed.)

My impression is that modern Linux desktop software, things like cups-browsed included, is not really spending much time thinking about larger scale, managed Unix environments where there are a bunch of printers (or at least print queues), the 'print server' is not on your local machine and not run by you, anything random you pick up through broadcast on the local network is suspect, and so on. I broadly sympathize with this, because such environments are a small minority now, but it would be nice if client side CUPS software didn't cause problems in them.

(I suspect that cups-browsed and its friends are okay in an environment where either the 'print server' is local or it's operated by you and doesn't require authentication, there's only a few printers, everyone on the local network is friendly and if you see a printer it's definitely okay to use it, and so on. This describes a lot of Linux desktop environments, including my home desktop.)

Tape drives (and robots) versus hard disk drives, and volume

By: cks

In a conversation on the Fediverse, I had some feelings on tapes versus disks:

I wish tape drives and tape robots were cheaper. At work economics made us switch to backups on HDDs, and apart from other issues it quietly bugs me that every one of them bundles in a complex read-write mechanism on top of the (magnetic) storage that's what we really want.

(But those complex read/write mechanisms are remarkably inexpensive due to massive volume, while the corresponding tape drive+robot read/write mechanism is ... not.)

(I've written up our backup system, also.)

As you can read in many places, hard drives are mechanical marvels, made to extremely fine tolerances, high performance targets, and startlingly long lifetimes (all things considered). And you can get them for really quite low prices.

At a conceptual level, an LTO tape system is the storage medium (the LTO tape) separated from the read/write head and the motors (the tape drive). When you compare this to hard drives, you get to build and buy the 'tape drive' portion only once, instead of including a copy in each instance of the storage medium (the tapes). In theory this should make the whole collection a lot cheaper. In practice it only does so once you have a quite large number of tapes, because the cost of tape drives (and tape robots to move tapes in and out of the drives) is really quite high (and has been for a relatively long time).

There are probably technology challenges and complexities that come with the tape drive operating in an unsealed and less well controlled environment than hard disk mechanisms. But it's hard to avoid the assumption that a lot of the price difference has to do with the vast difference in volume. We make hard drives and thus all of their components in high volume, and have for decades, so there's been a lot of effort spent on making them inexpensively and in bulk. Tape drives are a specialty item with far lower production volumes and are sold to much less price sensitive buyers (as compared to consumer level hard drives, which have a lot of parts in common with 'enterprise' HDDs).

I understand all of this but it still bugs me a bit. It's perfectly understandable but inelegant.

Some notes on X terminals in their heyday

By: cks

I recently wrote about how the X Window System didn't immediately have (thin client) X terminals. X terminals are now a relatively obscure part of history and it may not be obvious to people today why they were a relatively significant deal at the time. So today I'm going to add some additional notes about X terminals in their heyday, from their introduction around 1989 through the mid 1990s.

One of the reactions to my entry that I've seen is to wonder if there was much point to X terminals, since it seems like they should be close to much more functional normal computers and all you'd save is perhaps storage. Practically this wasn't the case in 1989 when they were introduced; NCD's initial models cost substantially less than, say, a Sparcstation 1 (also introduced in 1989), it appears less than half the cost of even a diskless Sparcstation 1. I believe that one reason for this is that memory was comparatively more expensive in those days and X terminals could get away with much, much less of it, since they didn't need to run a Unix kernel and enough of a Unix user space to boot up the X server (and I believe that some or all of the software was run directly from ROM instead of being loaded into precious RAM).

(The NCD16 apparently started at 1 MByte of RAM and the NCD19 at 2 MBytes, for example. You could apparently get a Sparcstation 1 with that little memory but you probably didn't want to use it for much.)

In one sense, early PCs were competition for X terminals in that they put computation on people's desks, but in another sense they weren't, because you couldn't use them as an inexpensive way to get Unix on people's desks. There eventually was at least one piece of software for this, DESQview/X, but it appeared later and you'd have needed to also buy the PC to run it on, as well as a 'high resolution' black and white display card and monitor. Of course, eventually the march of PCs made all of that cheap, which was part of the diminishing interest in X terminals in the later part of the 1990s and onward.

(I suspect that one reason that X terminals had lower hardware costs was that they probably had what today we would call a 'unified memory system', where the framebuffer's RAM was regular RAM instead of having to be separate because it came on a separate physical card.)

You might wonder how well X terminals worked over the 10 MBit Ethernet that was all you had at the time. With the right programs it could work pretty well, because the original approach of X was that you sent drawing commands to the X server, not rendered bitmaps. If you were using things that could send simple, compact rendering commands to your X terminal, such as xterm, 10M Ethernet could be perfectly okay. Anything that required shipping bitmapped graphics could be not as impressive, or even not something you'd want to touch, but for what you typically used monochrome X for between 1989 and 1995 or so, this was generally okay.

(Today many things on X want to ship bitmaps around, even for things like displaying text. But back in the day text was shipped as, well, text, and it was the X server that rendered the fonts.)

When looking at the servers you'd need for a given number of diskless Unix workstations or X terminals, the X terminals required less server side disk space but potentially more server side memory and CPU capacity, and were easier to administer. As noted by some commentators here, you might also save on commercial software licensing costs if you could license it only for your few servers instead of your lots of Unix workstations. I don't know how the system administration load actually compared to a similar number of PCs or Macs, but in my Unix circles we thought we scaled much better and could much more easily support many seats (and many potential users if you had, for example, many more students than lab desktops).

My perception is that what killed off X terminals as particularly attractive, even for Unix places, was that on the one hand the extra hardware capabilities PCs needed over X terminals kept getting cheaper and cheaper and on the other hand people started demanding more features and performance, like decent colour displays. That brought the X terminal 'advantage' more or less down to easier administration, and in the end that wasn't enough (although some X terminals and X 'thin client' setups clung on quite late, eg the SunRay, which we had some of in the 2000s).

Of course that's a Unix centric view. In a larger view, Unix was displaced on the desktop by PCs, which naturally limited the demand for both X terminals and dedicated Unix workstations (which were significantly marketed toward the basic end of Unix performance, and see also). By no later than the end of the 1990s, PCs were better basic Unix workstations than the other options and you could use them to run other software too if you wanted to, so they mostly ran over everything else even in the remaining holdouts.

(We ran what were effectively X terminals quite late, but the last few generations were basic PCs running LTSP not dedicated hardware. All our Sun Rays got retired well before the LTSP machines.)

(I think that the 'personal computer' model has or at least had some significant pragmatic advantages over the 'terminal' model, but that's something for another entry.)

Some bits on malloc(0) in C being allowed to return NULL

By: cks

One of the little traps in standard C and POSIX is that malloc(0) is allowed to return NULL instead of a pointer. This makes people unhappy for various reasons. Today I wound up reading 017. malloc(0) & realloc(…, 0) β‰  0, which runs through a whole collection of Unix malloc() versions and finds almost none of them which return NULL on malloc(0) except for some Unix System V releases that ship with an optional 'fast' malloc library that does return NULL on zero-sized allocations. Then AT&T wrote the System V Interface Definition and requires this 'fast malloc' behavior, except that actual System V releases (probably) didn't behave this way unless you explicitly used the fast malloc instead of the standard one.

(Apparently AIX may behave this way, eg, and it's old enough to have influenced POSIX and C. But I suspect that AIX got this behavior by making the System V fast malloc their only malloc, possibly when the SVID nominally required this behavior. AIX may have wound up weird but IBM didn't write it from scratch.)

When I read all of this today and considered what POSIX had done, one of my thoughts was about non-Unix C compilers (partly because I'd recently heard about the historical Whitesmiths C compiler source code being released). C was standardized at a time when C was being increasingly heavily used on various personal computers, including in environments that were somewhat hostile to it, and also other non-Unix environments. These C implementations used their own standard libraries, including malloc(), so maybe they had adopted the NULL return behavior.

As far as I can tell, Whitesmiths' malloc() doesn't have this behavior (also). However, I did find this in the MS-DOS version of Manx Aztec C, or at least it's in version 5.2a; the two earlier versions also available have a simpler malloc() that always rounds up, like the Whitesmiths malloc(). My memory is that you could get the Manx Aztec C compiler for the Amiga with library source, but I'm not particularly good at poking around the Amiga image available so I was unable to spot it if it's included in that version, and I haven't looked at the other Aztec C versions.

(I wouldn't be surprised if a number of 1980s non-Unix C compilers had this behavior, but I don't know where to find good information on this. If someone has written a comprehensive history page on malloc(0) that covers non-Unix C compilers, I haven't found it.)

On systems with small amounts of memory, one reason to specifically make your malloc() return NULL for 0-sized allocations is to reduce memory usage if someone makes a number of such allocations through some general code path that deals with variable-sized objects. Otherwise you'd have to consume some minimum amount of memory even for these useless allocations.

PS: Minix version 1 also rounds up the size of malloc(0).

(Yes, I got nerd-sniped by this and my own curiosity.)

Compute GPUs can have odd failures under Linux (still)

By: cks

Back in the early days of GPU computation, the hardware, drivers, and software were so relatively untrustworthy that our early GPU machines had to be specifically reserved by people and that reservation gave them the ability to remotely power cycle the machine to recover it (this was in the days before our SLURM cluster). Things have gotten much better since then, with things like hardware and driver changes so that programs with bugs couldn't hard-lock the GPU hardware. But every so often we run into odd failures where something funny is going on that we don't understand.

We have one particular SLURM GPU node that has been flaky for a while, with the specific issue being that every so often the NVIDIA GPU would throw up its hands and drop off the PCIe bus until we rebooted the system. This didn't happen every time it was used, or with any consistent pattern, although some people's jobs seemed to regularly trigger this behavior. Recently I dug up a simple to use GPU stress test program, and when this machine's GPU did its disappearing act this Saturday, I grabbed the machine, rebooted it, ran the stress test program, and promptly had the GPU disappear again. Success, I thought, and since it was Saturday, I stopped there, planning to repeat this process today (Monday) at work, while doing various monitoring things.

Since I'm writing a Wandering Thoughts entry about it, you can probably guess the punchline. Nothing has changed on this machine since Saturday, but all today the GPU stress test program could not make the GPU disappear. Not with the same basic usage I'd used Saturday, and not with a different usage that took the GPU to full power draw and a reported temperature of 80C (which was a higher temperature and power draw than the GPU had been at when it disappeared, based on our Prometheus metrics). If I'd been unable to reproduce the failure at all with the GPU stress program, that would have been one thing, but reproducing it once and then not again is just irritating.

(The machine is an assembled from parts one, with an RTX 4090 and a Ryzen Threadripper 1950X in an X399 Taichi motherboard that is probably not even vaguely running the latest BIOS, seeing as the base hardware was built many years ago, although the GPU has been swapped around since then. Everything is in a pretty roomy 4U case, but if the failure was consistent we'd have assumed cooling issues.)

I don't really have any theories for what could be going on, but I suppose I should try to find a GPU stress test program that exercises every last corner of the GPU's capabilities at full power rather than using only one or two parts at a time. On CPUs, different loads light up different functional units, and I assume the same is true on GPUs, so perhaps the problem is in one specific functional unit or a combination of them.

(Although this doesn't explain why the GPU stress test program was able to cause the problem on Saturday but not today, unless a full reboot didn't completely clear out the GPU's state. Possibly we should physically power this machine off entirely for long enough to dissipate any lingering things.)

The X Window System didn't immediately have X terminals

By: cks

For a while, X terminals were a reasonably popular way to give people comparatively inexpensive X desktops. These X terminals relied on X's network transparency so that only the X server had to run on the X terminal itself, with all of your terminal windows and other programs running on a server somewhere and just displaying on the X terminal. For a long time, using a big server and a lab full of X terminals was significantly cheaper than setting up a lab full of actual workstations (until inexpensive and capable PCs showed up). Given that X started with network transparency and X terminals are so obvious, you might be surprised to find out that X didn't start with them.

In the early days, X ran on workstations. Some of them were diskless workstations, and on some of them (especially the diskless ones), you would log in to a server somewhere to do a lot of your more heavy duty work. But they were full workstations, with a full local Unix environment and you expected to run your window manager and other programs locally even if you did your real work on servers. Although probably some people who had underpowered workstations sitting around experimented with only running the X server locally, with everything else done remotely (except perhaps the window manager).

The first X terminals arrived only once X was reasonably well established as the successful cross-vendor Unix windowing system. NCD, who I suspect were among the first people to make an X terminal, was founded only in 1987 and of course didn't immediately ship a product (it may have shipped its first product in 1989). One indication of the delay in X terminals is that XDM was only released with X11R3, in October of 1988. You technically didn't need XDM to have an X terminal, but it made life much easier, so its late arrival is a sign that X terminals didn't arrive much before then.

(It's quite possible that the possibility for an 'X terminal' was on people's minds even in the early days of X. The Bell Labs Blit was a 'graphical terminal' that had papers written and published about it sometime in 1983 or 1984, and the Blit was definitely known in various universities and so on. Bell Labs even gave people a few of them, which is part of how I wound up using one for a while. Sadly I'm not sure what happened to it in the end, although by now it would probably be a historical artifact.)

(This entry was prompted by a comment on a recent entry of mine.)

PS: A number of people seem to have introduced X terminals in 1989; I didn't spot any in 1988 or earlier.

Sidebar: Using an X terminal without XDM

If you didn't have XDM available or didn't want to have to rely on it, you could give your X terminal the ability to open up a local terminal window that ran a telnet client. To start up an X environment, people would telnet into their local server, set $DISPLAY (or have it automatically set by the site's login scripts), and start at least their window manager by hand. This required your X terminal to not use any access control (at least when you were doing the telnet thing), but strong access control wasn't exactly an X terminal feature in the first place.

My pragmatic view on virtual screens versus window groups

By: cks

I recently read z3bra's 2014 Avoid workspaces (via) which starts out with the tag "Virtual desktops considered harmful". At one level I don't disagree with z3bra's conclusion that you probably want flexible groupings of windows, and I also (mostly) don't use single-purpose virtual screens. But I do it another way, which I think is easier than z3bra's (2014) approach.

I've written about how I use virtual screens in my desktop environment, although a bit of that is now out of date. The short summary is that I mostly have a main virtual screen and then 'overflow' virtual screens where I move to if I need to do something else without cleaning up the main virtual screen (as a system administrator, I can be quite interrupt-driven or working on more than one thing at once). This sounds a lot like window groups, and I'm sure I could do it with them in another window manager. The advantage to me of fvwm's virtual screens is that it's very easy to move windows from one to another.

If I start a window in one virtual screen, for what I think is going to be one purpose, and it turns out that I need it for another purpose too, on another virtual screen, I don't have to fiddle around with, say, adding or changing its tags. Instead I can simply grab it and move it to the new virtual screen (or, for terminal windows and some others, iconify them on one screen, switch screens, and deiconify them). This makes it fast, fluid, and convenient to shuffle things around, especially for windows where I can do this by iconifying and deiconify them.

This is somewhat specific to (fvwm's idea of) virtual screens, where the screens have a spatial relationship to each other and you can grab windows and move them around to change their virtual screen (either directly or through FvwmPager). In particular, I don't have to switch between virtual screens to drag a window on to my current one; I can grab it in a couple of ways and yank it to where I am now.

In other words, it's the direct manipulation of window grouping that makes this work so nicely. Unfortunately I'm not sure how to get direct manipulation of currently not visible windows without something like virtual screens or virtual desktops. You could have a 'show all windows' feature, but that still requires bouncing between that all-windows view (to tag in new windows) and your regular view. Maybe that would work fluidly enough, especially with today's fast graphics.

❌