❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

New old systems in the age of hardware shortages

By: cks
29 March 2026 at 02:52

Recently I asked something on the Fediverse:

Lazyweb, if you were going to put together new DDR4-based desktop (because you already have the RAM and disks), what CPU would you use? Integrated graphics would probably be ideal because my needs are modest and that saves wrangling a GPU.

(Also I'm interested in your motherboard opinions, but the motherboard needs 2x M.2 and 2x to 4x SATA, which makes life harder. And maybe 4K@60Hz DisplayPort output, for integrated graphics)

If I was thinking of building a new desktop under normal circumstances, I would use all modern components (which is to say, current generation CPU, motherboard, RAM, and so on). But RAM is absurdly expensive these days, so building a new DDR5-based system with the same 64 GBytes of RAM that I currently have would cost over a thousand dollars Canadian just for the RAM. The only particularly feasible way to replace such an existing system today is to reuse as many components as possible, which means reusing my DDR4 RAM. In turn, this means that a lot of the rest of the system will be 'old'. By this I don't necessarily mean that it will have been manufactured a while ago (although it may have) but that its features and capabilities will be from a while back.

If you want an AMD CPU for your DDR4-based system, it will have to be an AM4 CPU and motherboard. I'm not sure how old good CPUs are for AM4, but the one you want may be as old as a 2022 CPU (Ryzen 5 5600; other more recent options don't seem to be as well regarded). Intel's 14th generation CPUs ("Raptor Lake") from late 2023 still support DDR4 with compatible motherboards, but at this point you're still looking at things launched two years or more ago, which at one point was an eternity in CPUs.

(It's still somewhat of an eternity in CPUs, especially AMD, because AMD has introduced support for various useful instructions since then. For instance, Go's latest garbage collector would like you to have AVX-512 support. Intel desktop CPUs appear to have no AVX-512 at all, though.)

Beyond CPU performance, older CPUs and often older motherboards also often mean that you have older PCIe standards, fewer PCIe lanes, less high speed USB ports, and so on. You're not going to get the latest PCIe from an older CPU and chipset. Then you may step down in other components as well (like GPUs and NVMe drives), depending on how long you expect to keep them, or opt to keep your current components if those are good enough.

My impression is that such 'new old systems' have usually been a relatively unusual thing in the PC market, and that historically people have upgraded to the current generation. This lead to a steady increase in baseline capabilities over time as you could assume that desktop hardware would age out on a somewhat consistent basis. If people are buying new old systems and keeping old systems outright, that may significantly affect not just the progress of performance but also the diffusion of new features (such as AVX-512 support) into the CPU population.

The other aspect of this is, well, why bother upgrading to a new old system at all, instead of keeping your existing old old system? If your old system works, you may not get much from upgrading to a new old system. If your old system doesn't have enough performance or features, spending money on a new old system may not get you enough of an improvement to remove your problems (although it may mitigate them a bit). New old systems are effectively a temporary bridge and there's a limit to how much people are willing to spend on temporary bridges unless they have to. This also seems likely to slow down both the diffusion of nice new CPU features and the slow increase in general performance that you could assume.

(At work, the current situation has definitely caused us to start retaining machines that we would have discarded in the past, and in fact were planning to discard until quite recently.)

PS: One potentially useful thing you can get out of a new old system like this is access to newer features like PCIe bifurcation or decent UEFI firmware that your current system doesn't support or have.

Mass production's effects on the cheapest way to get some things

By: cks
23 March 2026 at 02:00

We have a bunch of networks in a number of buildings, and as part of looking after them, we want to monitor whether or not they're actually working. For reasons beyond the scope of this entry we don't do things like collect information from our switches through SNMP, so our best approach is 'ping something on the network in the relevant location'. This requires something to ping. We want that thing to be stable and always on the network, which typically rules out machines and devices run by other people, and we want it to run from standard wall power for various reasons.

You can imagine a bunch of solutions to this for both wired and wireless networks. There are lots of cheap little computers these days that can run Linux, so you could build some yourself or expect to find someone selling them pre-made. However, these are unlikely to be a mass produced volume product, and it turns out that the flipside of things only being cheap when there is volume is that if there is volume, unexpected things can be the cheapest option.

The cheapest wall-powered device you can put on your wireless network to ping these days turns out to be a remote controlled power plug intended for home automation (as a bonus it will report uptime information for you if you set it up right, so you can tell if it lost power recently). They can fail after a few years, but they're inexpensive so we consider them consumables. And if you have another device that turns out to be flaky and has to be power cycled every so often, you can reuse a 'wifi reachability sensor' for its actual remote power control capabilities.

Similarly, as far as we've found, the cheapest wall powered device that plugs into a wired Ethernet and can be given an IP address so it can be pinged is a basic five port managed switch. You give it a 'management IP', plug one port into the network, and optionally plug up its other four ports so no one uses it for connectivity (because it's a cheap switch and you don't necessarily trust it). You might even be able to find one that supports SNMP so you can get some additional information from it (although our current ones don't, as far as I can tell).

In both cases it's clear that these are cheap because of mass production. People are making lots of wireless remote controlled power plugs and five port managed switches, so right now you can get the switches for about $30 Canadian each and the power plugs for $10 Canadian. In both cases what we get is overkill for what we want, and you could do a simpler version that has a smaller, cheaper bill of materials (BOM). But that smaller version wouldn't have the volume so it would cost much more for us to get it or an approximation.

(Even if we designed and built our own, we probably can't beat the price of the wireless remote controlled power plugs. We might be able to get a cheaper BOM for a single-Ethernet simple computer with case and wall plug power supply, but that ignores staff time to design, program, and assemble the thing.)

At one level this makes me sad. We're wasting the reasonably decent capabilities of both devices, and it feels like there should be a more frugal and minimal option. But it's hard to see what it would be and how it could be so cheap and readily available.

Power glitches can leave computer hardware in weird states

By: cks
10 March 2026 at 03:58

Late Friday night, the university's downtown campus experienced some sort of power glitch or power event. A few machines rebooted, a number of machines dropped out of contact for a bit (which probably indicates some switches restarting), and most significantly, some of our switches wound up in a weird, non-working state despite being powered on. This morning we cured the situation by fully power cycling all of them.

This isn't the first time we've seen brief power glitches leave things in unusual states. In the past we've seen it with servers, with BMCs (IPMIs), and with switches. It's usually not every machine, either; some machines won't notice and some will. When we were having semi-regular power glitches, there were definitely some models of server that were more prone to problems than others, but even among those models it usually wasn't universal.

It's fun to speculate about reasons why some particular servers of a susceptible model would survive and others not, but that's somewhat beside today's point, which is that power glitches can get your hardware into weird states (and your hardware isn't broken when and because this happens; it can happen to hardware that's in perfectly good order). We'd like to think that the computers around us are binary, either shut off entirely or working properly, but that clearly isn't the case. A power glitch like this peels back the comforting illusion to show us the unhappy analog truth underneath. Modern computers do a lot of work to protect themselves from such analog problems, but obviously it doesn't always work completely.

(My wild speculation is that the power glitch has shifted at least part of the overall system into a state that's normally impossible, and either this can't be recovered from or the rest of the system doesn't realize that it has to take steps to recover, for example forcing a full restart. See also flea power, where a powered off system still retains some power, and sometimes this matters.)

PS: We've also had a few cases where power cycling the hardware wasn't enough, which is almost certainly flea power at work.

PPS: My steadily increasing awareness of the fundamentally analog nature of a lot of what I take as comfortably digital has come in part from exposure on the Fediverse to people who deal with fun things like differential signaling for copper Ethernet, USB, and PCIe, and the spooky world of DDR training, where very early on your system goes to some effort to work out the signal characteristics of your particular motherboard, RAM, and so on so that it can run the RAM as fast as possible (cf).

(Never mind all of the CPU errata about unusual situations that aren't quite handled properly.)

With disk caches, you want to be able to attribute hits and misses

By: cks
26 February 2026 at 03:06

Suppose that you have a disk or filesystem cache in memory (which you do, since pretty much everything has one these days). Most disk caches will give you simple hit and miss information as part of their basic information, but if you're interested in the performance of your disk cache (or in improving it), you want more information. The problem with disk caches is that there are a lot of different sources and types of disk IO, and you can have hit rates that are drastically different between them. Your hit rate for reading data from files may be modest, while your hit rate on certain sorts of metadata may be extremely high. Knowing this is important because it means that your current good performance on things involving that metadata is critically dependent on that hit rate.

(Well, it may be, depending on what storage media you're using and what its access speeds are like. A lot of my exposure to this dates from the days of slow HDDs.)

This potential vast difference is why you want more detailed information in both cache metrics and IO traces. The more narrowly you can attribute IO and the more you know about it, the more useful things you can potentially tell about the performance of your system and what matters to it. This is not merely 'data' versus 'metadata', and synchronous versus asynchronous; ideally you want to know the sort of metadata read being done, and whether the file data being read is synchronous or not, and whether this is a prefetching read or a 'demand' read that really needs the data.

A lot of the times, operating systems are not set up to pass this information down through all of the layers of IO from the high level filesystem code that knows what it's asking for to the disk driver code that's actually issuing the IOs. Part of the reason for this is that it's a lot of work to pass all of this data along, which means extra CPU and memory on what is an increasingly hot path (especially with modern NVMe based storage). These days you may get some of this fine grained details in metrics and perhaps IO traces (eg, for (Open)ZFS), but probably not all the way to types of metadata.

Of course, disk and filesystem caches (and IO) aren't the only place that this can come up. Any time you have a cache that stores different types of things that are potentially queried quite differently, you can have significant divergence in the types of activity and the activity rates (and cache hit rates) that you're experiencing. Depending on the cache, you may be able to get detailed information from it or you may need to put more detailed instrumentation into the code that queries your somewhat generic cache.

Modern general observability features in operating systems can sometimes let you gather some of this detailed attribute yourself (if the OS doesn't already provide them). However, it's not a certain thing and there are limits; for example, you may have trouble tracing and tracking IO once it gets dispatched asynchronously inside the OS (and most OSes turn IO into asynchronous operations before too long).

Two challenges of incremental backups

By: cks
18 February 2026 at 04:25

Roughly speaking, there are two sorts of backups that you can make, full backups and incremental backups. At the abstract level, full backups are pretty simple; you save everything that you find. Incremental backups are more complicated because they save only the things that changed since whatever they're relative to. People want incremental backups despite the extra complexity because they save a lot of space compared to backing up everything all the time.

There are two general challenges that make incremental backups more complicated than full backups. The first challenge is reliably finding everything that's changed, in the face of all of the stuff that can change in filesystems (or other sources of data). Full backups only need to be able to traverse all of the filesystem (or part of it), or in general the data source, and this is almost always a reliable thing because all sorts of things and people use it. Finding everything that has changed has historically been more challenging because it's not something that people do often outside of incremental backups.

(And when people do it they may not notice if they're missing some things, the way they absolutely will notice if a general traversal skips some files.)

The second challenge is handling things that have gone away. Once you have a way to find everything that's changed it's not too difficult to build a backup system that will faithfully reproduce everything that definitely was there as of the incremental. All you need to do is save every changed file and then unpack the sequence of full and incremental backups on top of each other, with the latest version of any particular file overwriting any previous one. But people often want their incremental restore to reflect the state of directories and so on as of the incremental, which means removing things that have been deleted (both files and perhaps entire directory trees). This means that your incrementals need some way to pass on information about things that were there in earlier backups but aren't there now, so that the restore process can either not restore them or remove them as it restores the sequence of full and incremental backups.

While there are a variety of ways to tackle the first challenge, backup systems that want to run quickly are often constrained by what features operating systems offer (and also what features your backup system thinks it can trust, which isn't always the same thing). You can checksum everything all the time and keep a checksum database, but that's usually not going to be the fastest thing. The second challenge is much less constrained by what the operating system provides, which means that in practice it's much more on you (the backup system) to come up with a good solution. Your choice of solution may interact with how you solve the first challenge, and there are tradeoffs in various approaches you can pick (for example, do you represent deletions explicitly in the backup format or are they implicit in various ways).

There is no single right answer to these challenges. I'll go as far as to say that the answer depends partly on what sort of data and changes you expect to see in the backups and partly where you want to put the costs between creating backups and handling restores.

The (very) old "repaint mode" GUI approach

By: cks
14 February 2026 at 04:34

Today I ran across another article that talked in passing about "retained mode" versus "immediate mode" GUI toolkits (this one, via), and gave some code samples. As usual when I read about immediate mode GUIs and see source code, I had a pause of confusion because the code didn't feel right. That's because I keep confusing "immediate mode" as used here with a much older approach, which I will call repaint mode for lack of a better description.

A modern immediate mode system generally uses double buffering; one buffer is displayed while the entire window is re-drawn into the second buffer, and then the two buffers are flipped. I believe that modern retained mode systems also tend to use double buffering to avoid screen tearing and other issues (and I don't know if they can do partial updates or have to re-render the entire new buffer). In the old days, the idea of having two buffers for your program's window was a decided luxury. You might not even have one buffer and instead be drawing directly onto screen memory. I'll call this repaint mode, because you directly repainted some or all of your window any time you needed to change anything in it.

You could do an immediate mode GUI without double buffering, in this repaint mode, but it would typically be slow and look bad. So instead people devoted a significant amount of effort to not repainting everything but instead identifying what they were changing and repainting only it, along with any pixels from other elements of your window that had been 'damaged' from prior activity. If you did do a broader repaint, you (or the OS) typically set clipping regions so that you wouldn't actually touch pixels that didn't need to be changed.

(The OS's display system typically needed to support clipping regions in any situation where windows partially overlapped yours, because it couldn't let you write into their pixels.)

One reason that old display systems worked this way is that it required as little memory as possible, which was an important consideration back in the day (which was more or less the 1980s to the early to mid 1990s). People could optimize their repaint code to be efficient and do as little work as possible, but they couldn't materialize RAM that wasn't there. Today, RAM is relatively plentiful and we care a lot more about non-tearing, coherent updates.

The typical code style for a repaint mode system was that many UI elements would normally only issue drawing commands to update or repaint themselves when they were altered. If you had a slider or a text field and its value was updated as a result of input, the code would typically immediately call its repaint function, which could lead to a relatively tight coupling of input handling to the rendering code (a coupling that I believe Model-view-controller was designed to break). Your system had to be capable of a full window repaint, but if you wanted to look good, it wasn't a common operation. A corollary of this is that your code might spend a significant amount of effort working out what was the minimal amount of repainting you needed to do in order to correctly get between two states (and this code could be quite complicated).

(Some of the time this was hidden from you in widget and toolkit internals, although they didn't necessarily give you minimal repaints as you changed widget organization. Also, because a drawing operation was issued right away didn't mean that it took effect right away. In X, server side drawing operations might be batched up to be sent to the X server only when your program was about to wait for more X events.)

Because I'm used to this repaint mode style, modern immediate mode code often looks weird to me. There's no event handler connections, no repaint triggers, and so on, but there is an explicit display step. Alternately, you aren't merely configuring widgets and then camping out in the toolkit's main loop, letting it handle events and repaints for you (the widgets approach is the classical style for X applications, including PyTk applications such as pyhosts).

These days, I suspect that any modern toolkit that still looks like a repaint mode system is probably doing double buffering behind the scenes (unless you deliberately turn that off). Drawing directly to what's visible right now on screen is decidedly out of fashion because of issues like screen tearing, and it's not how modern display systems like Wayland want to operate. I don't know if toolkits implement this with a full repaint on the new buffer, or if they try to copy the old buffer to the new one and selectively repaint parts of it, but I suspect that the former works better with modern graphics hardware.

PS: My view is that even the widget toolkit version of repaint mode isn't a variation of retained mode because the philosophy was different. The widget toolkit might batch up operations and defer redoing layout and repainting things until you either returned to its event loop or asked it to update the display, but you expected a more or less direct coupling between your widget operations and repaints. But you can see it as a continuum that leads to retained mode when you decouple and abstract things enough.

(Now that I've written this down, perhaps I'll stop having that weird 'it's wrong somehow' reaction when I see immediate mode GUI code.)

The consoles of UEFI, serial and otherwise, and their discontents

By: cks
3 February 2026 at 03:07

UEFI is the modern firmware standard for x86 PCs and other systems; sometimes the actual implementation is called a UEFI BIOS, but the whole area is a bit confusing. I recently wrote about getting FreeBSD to use a serial console on a UEFI system and mentioned that some UEFI BIOSes could echo console output to a serial port, which caused Greg A. Woods to ask a good question in a comment:

So, how does one get a typical UEFI-supporting system to use a serial console right from the firmware?

The mechanical answer is that you go into your UEFI BIOS settings and see if it has any options for what is usually called 'console redirection'. If you have it, you can turn it on and at that point the UEFI console will include the serial device you picked, theoretically allowing both output and input from the serial device. This is very similar to the 'console redirection' option in 'legacy' pre-UEFI BIOSes, although it's implemented rather differently. An important note here is that UEFI BIOS console redirection only applies to things using the UEFI console. Your UEFI BIOS definitely uses the UEFI console, and your UEFI operating system boot loader hopefully does. Your operating system almost certainly doesn't.

A UEFI BIOS doesn't need to have such an option and typical desktop ones probably don't. The UEFI standard provides a standard set of ways to implement console redirection (and alternate console devices in general), but UEFI doesn't require it; it's perfectly standard compliant for a UEFI BIOS to only support the video console. Even if your UEFI BIOS provides console redirection, your actual experience of trying to use it may vary. Watching boot output is likely to be fine, but trying to interact with the BIOS from your serial port may be annoying.

How all of this works is that UEFI has a notion of an EFI console, which is (to quote the documentation) "used to handle input and output of text-based information intended for the system user during the operation of code in the boot services environment". The EFI console is an abstract thing, and it's also some globally defined variables that include ConIn and ConOut, the device paths of the console input and output device or devices. Device paths can include multiple sub-devices (in generic device path structures), and one of the examples specifically mentioned is:

[...] An example of this would be the ConsoleOut environment variable that consists of both a VGA console and serial output console. This variable would describe a console output stream that is sent to both VGA and serial concurrently and thus has a Device Path that contains two complete Device Paths. [...]

(Sometimes this is 'ConsoleIn' and 'ConsoleOut', eg, and sometimes 'ConIn' and 'ConOut'. Don't ask me why.)

In theory, a UEFI BIOS can hook a wide variety of things up to ConIn, ConOut, or both, as it decides (and implements), possibly including things like IPv4 connections. In practice it's up to the UEFI BIOS to decide what it will bother to support. Server UEFI BIOSes will typically support serial console redirection, which is to say connecting some serial port to ConIn and ConOut in addition to the VGA console. Desktop motherboard UEFI BIOSes probably won't. I don't know if there are very many server UEFI BIOSes that will use only the serial console and exclude the VGA console from ConIn and ConOut.

(Also in theory I believe a UEFI BIOS could wire up ConOut to include a serial port but not connect it to ConIn. In practice I don't know of any that do.)

EFI also defines a protocol (a set of function calls) for console input and output. For input, what people (including the UEFI BIOS itself) get back is either or both of an EFI scan code or a Unicode character. The 'EFI scan code' is used to determine what special key you typed, for example F11 to go into some UEFI BIOS setup mode. The UEFI standard also has an appendix with examples of mapping various sorts of input to these EFI scan codes, which is very relevant for entering anything special over a serial console.

If you look at this appendix B, you'll note that it has entries for both 'ANSI X3.64 / DEC VT200-500 (8-bit mode)' and 'VT100+ (7-bit mode)'. Now you have two UEFI BIOS questions. First, does your UEFI BIOS even implement this, or does it either ignore the whole issue (leaving you with no way to enter special characters) or come up with its own answers? And second, does your BIOS restrict what it recognizes over the serial port to just whatever type it's set the serial port to, or will it recognize either sequence for something like F11? The latter question is very relevant because your terminal emulator environment may or may not generate what your UEFI BIOS wants for special keys like F11 (or it may even intercept some keys, like F11; ideally you can turn this off).

(Another question is what your UEFI BIOS may call the option that controls what serial port key mapping it's using. One machine I've tested on calls the setting "Putty KeyPad" and the correct value for the "ANSI X3.64" version is "XTERMR6", for example, which corresponds to what xterm, Gnome-Terminal and probably other modern terminal programs send.)

Another practical issue is that if you do anything fancy with a UEFI serial console, such as go into the BIOS configuration screens, your UEFI BIOS may generate output that assumes a very specific and unusual terminal resolution. For instance, the Supermicro server I've been using for my FreeBSD testing appears to require a 100x30 terminal in its BIOS configuration screens; if you have any other resolution you get various sorts of jumbled results. Many of our Dell servers take a different approach, where the moment you turn on serial console redirection they choke their BIOS configuration screens down to an ASCII 80x24 environment. OS boot environments may be more forgiving in various ways.

The good news is that your operating system's bootloader will probably limit itself to regular characters, and in practice what you care about a lot of the time is interacting with the bootloader (for example, for alternate boot and disaster recovery), not your UEFI BIOS.

As FreeBSD discusses in loader.efi(8), it's not necessarily straightforward for an operating system boot loader to decode what the UEFI ConIn and ConOut are connected to in order to pass the information to the operating system (which normally won't be using UEFI to talk to its console(s)). This means that the UEFI BIOS console(s) may not wind up being what the OS console(s) are, and you may have to configure them separately.

PS: As you may be able to tell from what I've written here, if you care significantly about UEFI BIOS access from the serial port, you should expect to do a bunch of experimentation with your specific hardware. Remember to re-check your results with new server generations and new UEFI BIOS firmware versions.

The two subtypes of one sort of package managers, the "program manager"

By: cks
28 January 2026 at 02:08

I've written before that one of the complications of talking about package managers and package management is that there are two common types of package managers, program managers (which manage installed programs on a system level) and module managers (which manage package dependencies for your project within a language ecosystem or maybe a broader ecosystem). Today I realized that there is a further important division within program managers. I will call this division application (package) managers and system (package) managers.

A system package manager is what almost all Linux distributions have (in the form of Debian's dpkg and its set of higher level tools, Fedora's RPM and its set of higher level tools, Arch's pacman, and so on). It manages everything installed by the distribution on the system, from the kernel all the way up to the programs that people run to get work done, but certainly including what we think of as system components like the core C library, basic POSIX utilities, and so on. In modern usage, all updates to the system are done by shipping new package versions, rather than by trying to ship 'patches' that consist of only a few changed files or programs.

(Some Linux distributions are moving some high level programs like Chrome to an application package manager.)

An application package manager doesn't manage the base operating system; instead it only installs, manages, and updates additional (and optional) software components. Sometimes these are actual applications, but at other times, especially historically, these were things like the extra-cost C compiler from your commercial Unix vendor. On Unix, files from these application packages were almost always installed outside of the core system areas like /usr/bin; instead they might go into /opt/<something> or /usr/local or various other things.

(Sometimes vendor software comes with its own internal application package manager, because the vendor wants to ship it in pieces and let you install only some of them while managing the result. And if you want to stretch things a bit, browsers have their own internal 'application package management' for addons.)

A system package manager can also be used for 'applications' and routinely is; many Linux systems provide undeniable applications like Firefox and LibreOffice through the system package manager (not all of them, though). This can include third party packages that put themselves in non-system places like /opt (on Unix) if they want to. I think this is most common on Linux systems, where there's no common dedicated application package manager that's widely used, so third parties wind up building their own packages for the system package manager (which is sure to be there).

For relatively obvious reasons, it's very hard to have multiple system package managers in use on the same system at once; they wind up fighting over who owns what and who changes what in the operating system. It's relatively straightforward to have multiple application package managers in use at once, provided that they keep to their own area so that they aren't overwriting each other.

For the most part, the *BSDs have taken a base system plus application manager approach, with things like their 'ports' system being their application manager. Where people use third party program managers, including pkgsrc on multiple Unixes, Homebrew on macOS, and so on, these are almost always application managers that don't try to also take over and manage the core ('base') operating system programs, libraries, and so on.

(As a result, the *BSDs ship system updates as 'patches', not as new packages, cf OpenBSD's syspatch. I've heard some rumblings that FreeBSD may be working to change this.)

I believe that Microsoft Windows has some degree of system package management, in that it has components that you might or might not install and that can be updated or restored independently, but I don't have much exposure to the Windows world. I will let macOS people speak up in the comments about how that system operates (as people using macOS experience, not as how it's developed; as developed there are a bunch of different parts to macOS, as one can see from the various open source repositories that Apple publishes).

PS: The Linux flatpak movement is mostly or entirely an application manager, and so usually separate from the system package manager (Snap is the same thing but I ignore Canonical's not-invented-here pet projects as much as possible). You can also see containers as an extremely overweight application 'package' delivery model.

PPS: In my view, to count as package management a system needs to have multiple 'packages' and have some idea of what packages are installed. It's common but not absolutely required for the package manager to keep track of what files belong to what package. Generally this goes along with a way to install and remove packages. A system can be divided up into components without having package management, for example if there's no real tracking of what components you've installed and they're shipped as archives that all get unpacked in the same hierarchy with their files jumbled together.

Printing things in colour is not simple

By: cks
25 January 2026 at 03:47

Recently, Verisimilitude left a comment on my entry on X11's DirectColor visual type, where they mentioned that L Peter Deutsch, the author of Ghostscript, lamented using twenty-four bit colour for Ghostscript rather than a more flexible approach, which you may need in printing things with colour. As it happens, I know a bit about this area for two or three reasons, which come at it from different angles. A long time ago I was peripherally involved in desktop publishing software, which obviously cares about printing colour, and then later I became a hobby photographer and at one point had some exposure to people who care about printing photographs (both colour and black and white).

(The actual PDF format supports much more complex colour models than basic 24-bit sRGB or sGray colour, but apparently Ghostscript turns all of that into 24-bit colour internally. See eg, which suggests that modern Ghostscript has evolved into a more complex internal colour model.)

On the surface, printing colour things out in physical media may seem simple. You convert RGB colour to CMYK colour and then send the result off to the printer, where your inkjet or laser printer uses its CMYK ink or toner to put the result on the paper. Photographic printers provide the first and lesser complication in this model, because serious photographic printers have many more colours of ink than CMYK and they put these inks on various different types of fine art paper that have different effects on how the resulting colours come out.

Photographic printers have so many ink colours because this results in more accurate and faithful colours or, for black and white photographs (where a set of grey inks may be used), in more accurate and faithful greys. Photographers who care about this will carefully profile their printer using its inks on the particular fine art paper they're going to use in order to determine how RGB colours can be most faithfully reproduced. Then as part of the printing process, the photographic print software and the printer driver will cooperate to take the RGB photograph and map its colours to what combination of inks and ink intensity can best do the job.

(Photographers use different fine art papers because the papers have different characteristics; one of the high level ones is matte versus glossy papers. But the rabbit hole of detailed paper differences goes quite deep. So does the issue of how many inks a photo printer should have and what they should be. Naturally photographers who make prints have lots of opinions on this whole area.)

Where this stops being just a print driver issue is that people editing photographs often want to see roughly how they'll look when printed out without actually making a print (which is generally moderately expensive). This requires the print subsystem to be capable of feeding colour mapping results back to the editing layer, so you can see that certain things need to be different at the RGB colour level so that they come out well in the printed photograph. This is of course all an approximation, but at the very least photo editing software like darktable wants to be able to warn you when you're creating an 'out of gamut' colour that can't be accurately printed.

(I don't have any current numbers for the cost of making prints on photographic printers, but it's not trivial, especially if you're making large prints; you'll use a decent amount of ink and the fine art paper isn't cheap either. You don't want to make more test prints than you really have to.)

All of this is still in the realm of RGB colour, though (although colour space and display profiling and management complicate the picture). To go beyond this we need to venture into the twin worlds of printing advertising, including product boxes, and fine art printing. Printed product ads and especially boxes for products not infrequently use spot colours, where part of the box will be printed with a pure ink colour rather than approximated with process colours (CMYK or other). You don't really want to manage spot colours by saying that they're a specific RGB value and then everything with that RGB value will be printed with that spot colour; ideally you want to manage them as a specific spot colour layer for each spot colour you're using. An additional complication is that product boxes for mass products aren't necessarily printed with CMYK inks at all; like photographic prints, they may use a custom ink set that's designed to do a good job with the limited colour gamut that appears on the product box.

(This leads to a fun little game you can play at home.)

Desktop publishing software that wants to do a good job with this needs a bunch of features. I believe that generally you want to handle spot colours as separate editing layers even if they're represented in RGB. You probably also want features to limit the colour space and colours that the product designer can do, because the company that will print your boxes may have told you it has certain standard ink sets and please keep your box colours to things they handle well as much as possible. Or you may want to use only pure spot colours from your set of them and not have a product designer accidentally set something to another colour.

Printing art books of fine art has similar issues. The artwork that you're trying to reproduce in the art book may use paint colours that don't reproduce well in standard CMYK colours, or in any colour set without special inks (one case is metallic colours, which are readily available for fine art paints and which some artists love). The artist whose work you're trying to print may have strong opinions about you doing a good job of it, while the more inks you use (and the more special inks) the more expensive the book will be. Some compromise is inevitable but you have to figure out where and what things will be the most mangled by various ink set options. This means your software should be able to map from something roughly like RGB scans or photographs into ink sets and let you know about where things are going to go badly.

For fine art books, my memory is that there are a variety of tricks that you can play to increase the number of inks you can use. For example, sometimes you can print different sections of the book with different inks. This requires careful grouping of the pages (and artwork) that will be printed on a single large sheet of paper with a single set of inks at the printing plant. It also means that your publishing software needs to track ink sets separately for groups of pages and understand how the printing process will group pages together, so it can warn you if you're putting an artwork onto a page that clashes with the ink set it needs.

(Not all art books run into these issues. I believe that a lot of art books for Japanese anime have relatively few problems here because the art they're reproducing was already made for an environment with a restricted colour gamut. No one animates with true metallic colours for all sorts of reasons.)

To come back to PDFs and colour representation, we can see why you might regret picking a single 24-bit RGB colour representation for everything in a program that handles things that will eventually be printed. I'm not sure there's any reasonable general format that will cover everything you need when doing colour printing, but you certainly might want to include explicit provisions for spot colours (which are very common in product boxes, ads, and so on), and apparently Ghostscript eventually gained support for them (as well as various other colour related things).

People cannot "just pay attention" to (boring, routine) things

By: cks
18 January 2026 at 02:04

Sometimes, people in technology believe that we can solve problems by getting people to pay attention. This comes up in security, anti-virus efforts, anti-phish efforts, monitoring and alert handling, warning messages emitted by programs, warning messages emitted by compilers and interpreters, and many other specific contexts. We are basically always wrong.

One of the core, foundational results from human factors research, research into human vision, the psychology of perceptions, and other related fields, is that human brains are a mess of heuristics and have far more limited capabilities than we think (and they lie to us all the time). Anyone who takes up photography as a hobby has probably experienced this (I certainly did); you can take plenty of photographs where you literally didn't notice some element in the picture at the time but only saw it after the fact while reviewing the photograph.

(In general photography is a great education on how much our visual system lies to us. For example, daytime shadows are blue, not black.)

One of the things we have a great deal of evidence about from both experiments and practical experience is that people (which is to say, human brains) are extremely bad at noticing changes in boring, routine things. If something we see all the time quietly disappears or is a bit different, the odds are extremely high that people will literally not notice. Our minds have long since registered whatever it is as 'routine' and tuned it out in favour of paying attention to more important things. You cannot get people to pay attention to these routine, almost always basically the same thing by asking them to (or yelling at them to do so, or blaming them when they don't), because our minds don't work that way.

We also have a tendency to see what we expect to see and not see what we don't expect to see, unless what we don't expect shoves itself into our awareness with unusual forcefulness. There is a famous invisible gorilla experiment that shows one aspect of this, but there are many others. This is why practical warning, alerts, and so on cannot be unobtrusive. Fire alarms are blaringly loud and obtrusive so that you cannot possibly miss them despite not expecting to hear them. A fire alarm that was "pay attention to this light if it starts blinking and makes a pleasant ringing tone" would get people killed.

There are hacks to get people to pay attention anyway, such as checklists, but these hacks are what we could call "not scalable" for many of the situations that people in technology care about. We cannot get people to go through a "should you trust this" checklist every time they receive an email message, especially when phish spammers deliberately craft their messages to create a sense of urgency and short-cut people's judgment. And even checklists are subject to seeing what you expect and not paying attention, especially if you do them over and over again on a routine basis.

(I've written a lot about this in various narrower areas before, eg 1, 2, 3, 4, 5. And in general, everything comes down to people, also.)

TCP and UDP and implicit "standard" elements of things

By: cks
16 January 2026 at 02:34

Recently, Verisimilitude left a comment on this entry of mine about binding TCP and UDP ports to a specific address. That got me thinking about features that have become standard elements of things despite not being officially specified and required.

TCP and UDP are more or less officially specified in various RFCs and are implicitly specified by what happens on the wire. As far as I know, nowhere in these standards (or wire behavior) does anything require that a multi-address host machine allow you to listen for incoming TCP or UDP traffic on a specific port on only a restricted subset of those addresses. People talking to your host have to use a specific IP, obviously, and established TCP connections have specific IP addresses associated with them that can't be changed, but that's it. Hosts could have an API where you simply listened to a specific TCP or UDP port and then they provided you with the local IP when you received inbound traffic; it would be up to your program to do any filtering to reject addresses that you didn't want used.

However, I don't think anyone has such an API, and anything that did would likely be considered very odd and 'non-standard'. It's become an implicit standard feature of TCP and UDP that you can opt to listen on only one or a few IP addresses of a multi-address host, including listening only on localhost, and connections to your (TCP) port on other addresses are rejected without the TCP three-way handshake completing. This has leaked through into the behavior that TCP clients expect in practice; if a port is not available on an IP address, clients expect to get a TCP layer 'connection refused', not a successful connection and then an immediate disconnection. If a host had the latter behavior, clients would probably not report it as 'connection refused' and some of them would consider it a sign of a problem on the host.

This particular (API) feature comes from a deliberately designed element of the BSD sockets API, the bind() system call. Allowing you to bind() local addresses to your sockets means that you can set the outgoing IP address for TCP connection attempts and UDP packets, which is important in some situations, but BSD could have provided a different API for that. BSD's bind() API does allow you maximum freedom with only a single system call; you can nail down either or both of the local IP and the local port. Binding the local port (but not necessarily the local IP) was important in BSD Unix because it was part of a security mechanism.

(This created an implicit API requirement for other OSes. If you wanted your OS to have an rlogin client, you had to be able to force the use of a low local port when making TCP connections, because the BSD rlogind.c simply rejected connections from ports that were 1024 and above even in situations where it would ask you for a password anyway.)

A number of people copied the BSD sockets API rather than design their own. Even when people designed their own API for handling networking (or IPv4 and later IPv6), my impression is that they copied the features and general ideas of the BSD sockets API rather than starting completely from scratch and deviating significantly from the BSD API. My usual example of a relatively divergent API is Go, which is significantly influenced by a quite different networking history inside Bell Labs and AT&T, but Go's net package still allows you to listen selectively on an IP address.

(Of course Go has to work with the underlying BSD sockets API on many of the systems it runs on; what it can offer is mostly constrained by that, and people will expect it to offer more or less all of the 'standard' BSD socket API features in some form.)

PS: The BSD TCP API doesn't allow a listening program to make a decision about whether to allow or reject an incoming connection attempt, but this is turned out to be a pretty sensible design. As we found out witn SYN flood attacks, TCP's design means that you want to force the initiator of a connection attempt to prove that they're present before the listening ('server') side spends much resources on the potential connection.

A little bit of complex design in phone "Level" applications

By: cks
1 January 2026 at 02:24

Modern smartphones have a lot of sensors; for example, they often have sensors that will report the phone's orientation and when it changes (which is used for things like 'wake up the screen when you pick up the phone'). One of the uses for these sensors is for little convenience applications, such as a "Level" app that uses the available sensors to report when the phone is level so you can use it as a level, sometimes for trivial purposes.

For years, this application seemed pretty trivial and obvious to me, with the only somewhat complex bit being figuring out how the person is holding the phone to determine which sort of level they wanted and then adjusting the display to clearly reflect that (while keeping it readable, something that Apple's current efforts partially fail). Then I had a realization:

Today's random thought: Your phone, like mine, probably has a "Level" app, which is most naturally used with the phone on its side for better accuracy, including resting on top of (or below) things. Your phone (also like mine) probably has buttons on the sides that make its sides not 100% straight and level end to end (because the buttons make bumps). So, how does the Level app deal with that? Does it have a range of 'close enough to level', or some specific compensation, or button detection?

(By 'on its side' I meant with the long side of the phone, as opposed to the top or the bottom, which are often flat and button-less. You can also use the phone as a level horizontally, on top of a flat surface, where you have the bump of the camera lenses to worry about.)

My current phone has a noticeable camera bump, and the app I use to get relatively raw sensor data suggests that there's a detectable, roughly 1.5 degree difference in tilt between resting all of the phone on a surface and just having the phone case edge around the camera bump on the surface (which should make the phone as 'level' as possible). However, once it's reached a horizontal '0 degrees' level, the "Level" app will treat both of them as equivalent (I can tilt the phone back and forth without disturbing the green level marking). This isn't just the Level app being deliberately imprecise; before I achieve a horizontal 0 degrees level, the "Level" app does respond to tilting the phone back and forth, typically changing its tilt reading by a degree.

(Experimentation suggests that the side buttons create less tilt, probably under a degree, and also that the Level app probably ignores that tilt when it's reached 0 degrees of tilt. It may ignore such small changes in tilt in general, and there's certainly some noise in the sensor readings.)

As a system administrator and someone who peers into technology for fun, I'm theoretically well aware that often there's more behind the scenes than is obvious. But still, it can surprise me when I notice an aspect of something I've been using for years without thinking about it. There's a lot of magic that goes into making things work the way we expect them to (for example, digital microwaves doing what you want with time; this Level app behavior also sort of falls under the category of 'good UI').

Expiry times are dangerous, on "The dangers of SSL certificates"

By: cks
29 December 2025 at 03:48

Recently I read Lorin Hochstein's The dangers of SSL certificates (via, among others), which talks about a Bazel build workflow outage caused by an expired TLS certificate. I had some direct reactions to this but after thinking about it I want to step back and say that in general, it's clear that expiry times are dangerous, often more or less regardless of where they appear. TLS certificate expiry times are an obvious and commonly encountered instance of expiry times in cryptography, but TLS certificates aren't the only case; in 2019, Mozilla had an incident where the signing key for Firefox addons expired (I believe the system used certificates, but not web PKI TLS certificates). Another thing that expires is DNS data (not just DNSSEC keys) and there have been incidents where expiring DNS data caused problems. Does a system have caches with expiry times? Someone has probably had an incident where things expired by surprise.

One of the problems with expiry times in general is that they're usually implemented as an abrupt cliff. On one side of the expiry time everything is fine and works perfectly, and one second later on the other side of the expiry time everything is broken. There's no slow degradation, no expiry equivalent of 'overload', and so on, which means that there's nothing indirect to notice and detect in advance. You must directly check and monitor the expiry time, and if you forget, things explode. We're fallible humans so we forget every so often.

This abrupt cliff of failure is a technology choice. In theory we could begin degrading service some time before the expiry time, or we could allow some amount of success for a (short) time after the expiry time, but instead we've chosen to make things be a boolean choice (which has made time synchronization across the Internet increasingly important; your local system can no longer be all that much out of step with Internet time if things are to work well). This is especially striking because expiry times are most often a heuristic, not a hard requirement. We add expiry times to limit hypothetical damage, such as silent key compromise, or constrain how long out of data DNS data is given to people, or similar things, but we don't usually have particular knowledge that the key or data cannot and must not be used after a specific time (for example, because the data will definitely have changed at that point).

(Of course the mechanics of degrading the service around the expiry time are tricky, especially in a way that the service operator would notice or get reports about.)

Another problem, related to the abrupt cliff, is that generally expiry times are invisible or almost invisible. Most APIs and user interfaces don't really surface the expiry time until you fall over the cliff; generally you don't even get warnings logged that an expiry time is approaching (either in clients or in servers and services). We implicitly assume that expiry times will never get reached because something will handle the situation before then. Invisible expiry times are fine if they're never reached, but if they're hit as an abrupt cliff you have the worst of two worlds. Again, this isn't a simple problem with an obvious solution; for example, you might need things to know or advertise what is a dangerously close expiry time (if you report the expiry time all of the time, it becomes noise that is ignored; that's already effectively the situation with TLS certificates, where tools will give you all the notAfter dates you could ask for and no one bothers looking).

Some protocols do without expiry times entirely; SSH keypairs are one example (unless you use SSH certificates, but even then the key that signs certificates has no expiry). This has problems and risks that make it not suitable for all environments. If you're working in an environment that has and requires expiry times, another option is to simply set them as far in the future as possible. If you don't expect the thing to ever expire and have no process for replacing it, don't set its expiry time to ten years. But not everything can work this way; your DNS entries will change sooner or later, and often in much less than ten years.

Our problem with finding good 10G-T Ethernet switches (in 2025)

By: cks
21 December 2025 at 02:35

We have essentially standardized our 10G Ethernet networking on 10G-T, which runs over relatively conventional copper network cables. The pragmatic advantage of 10G-T is that it provides for easy interoperability between 1G and 10G-T equipment. You can make all of your new in-wall cabling 10G-T rated and then plug 1G equipment and switches into it because those offices or rooms or whatever don't need 10G (yet), you can ship servers with 10G-T ports and not worry about people who are still at 1G, and so on. It's quite flexible and enables slow, piece by piece upgrades to 10G (which can be an important thing). However, we've run into a problem with our 10G-T environment, and that is finding good 10G-T switches that don't have a gigantic number of ports.

Our preference in Ethernet switches is ones that have around 24 ports. In our current network implementation, we try to make as many switches as possible be 'dumb' switches that carry only a single (internal) network, and we also put switches into each machine room rack. All of this means that 24 ports per switch is about right for most switches; we rarely want to connect up more than that many things on one network to a single switch in a single place. We can live with 16-port or 10-port switches, but that starts to get expensive because we have to buy (a lot) more switches.

Unfortunately, 24-port 10G-T switches appear to be an increasingly unpopular thing, as far as we can tell. At one point there were a reasonable number of inexpensive sources for good ones, but recently many of those seem to have gotten out of the business (and there's a few that have products that have thermals that don't work for us). You can probably get 24-port 10G-T switches from the 'enterprise' switch vendors but you'll pay 'enterprise' prices for them, there's a reasonable number of sources for 48-port 10G-T switches that are too big for us, and a certain amount of smaller 10G-T switches, but the middle seems to have gone mostly missing.

My suspicion is that this has to do with the shifts in the server market from plenty of relatively low (rack) density on-premise sales to an increasing amount of large cloud or high-density datacenter sales. A fully populated rack likely needs more than 24 ports of local connections, and you're buying the whole rack's worth at once, making incremental upgrades much less compelling. And 10G-T itself has drawbacks in high-density situations; the cables are physically bulkier than fiber, the ports (still) use more heat, SFP+ ports have a lot of flexibility, and increasingly people want datacenter networking that runs faster than 10G, even for individual machines.

At the same time, a 24-port 10G-T switch is awkwardly large for a lot of other situations. Most people don't have a use for that many 10G ports at home or in smaller offices, and on top of that 10G-T ports use enough power and are hot enough that the switch will need decent fans, which will make it noisy (and so not something you want to have out in the open). At most you might put such a 24-port switch in a local wiring closet, assuming that the wiring closet has enough air flow that a relatively hot switch doesn't cook itself.

(It's possible that there are good 24 port 10G-T switches out there that we haven't found. We know of TP-Link's offerings, but for local reasons we prefer to avoid them. Similarly, I believe that 16 or 24 port SFP+ switches with 10G-T SFP+ modules are likely to be decidedly too expensive for us, once we buy all the SFP+ modules needed.)

Password fields should usually have an option to show the text

By: cks
2 December 2025 at 03:46

I recently had to abruptly replace my smartphone, and because of how it happened I couldn't directly transfer data from the old phone to the new one; instead, I had to have the new phone restore itself from a cloud backup of the old phone (made on an OS version several years older than the new phone's OS). In the process, a number of passwords and other secrets fell off and I had to re-enter them. As I mentioned on the Fediverse, this didn't always go well:

I did get our work L2TP VPN to work with my new phone. Apparently the problem was a typo in one bit of one password secret, which is hard to see because of course there's no 'show the whole thing' option and you have to enter things character by character on a virtual phone keyboard I find slow and error-prone.

(Phone natives are probably laughing at my typing.)

(Some of the issue was that these passwords were generally not good ones for software keyboards.)

There are reasonable security reasons not to show passwords when you're entering them. In the old days, the traditional reason was shoulder surfing; today, we have to worry about various things that might capture the screen with a password visible. But at the same time, entering passwords and other secrets blindly is error prone, and especially these days the diagnostics of a failed password may be obscure and you might only get so many tries before bad things start happening.

(The smartphone approach of temporarily showing the last character you entered is a help but not a complete cure, especially if you're going back and forth three ways between the form field, the on-screen keyboard, and your saved or looked up copy of the password or secret.)

Partly as a result of my recent experiences, I've definitely come around to viewing those 'reveal the plain text of the password' options that some applications have as a good thing. I think a lot of applications should at least consider whether and how to do this, and how to make password entry less error prone in general. This especially applies if your application (and overall environment) doesn't allow pasting into the field (either from a memorized passwords system or by the person involved simply copying and pasting it from elsewhere, such as support site instructions).

In some cases, you might want to not even treat a 'password' field as a password (with hidden text) by default. Often things like wireless network 'passwords' or L2TP pre-shared keys are broadly known and perhaps don't need to be carefully guarded during input the way genuine account passwords do. If possible I'd still offer an option to hide the input text in whatever way is usual on your platform, but you could reasonably start the field out as not hidden.

Unfortunately, as of December 2025 I think there's no general way to do this in HTML forms in pure CSS, without JavaScript (there may be some browser-specific CSS attributes). I believe support for this is on the CSS roadmap somewhere, but that probably means at least several years before it starts being common.

(The good news is that a pure CSS system will presumably degrade harmlessly if the CSS isn't supported; the password will just stay hidden, which is no worse than today's situation with a basic form.)

Discovering that my smartphone had infiltrated my life

By: cks
30 November 2025 at 02:45

While I have a smartphone, I think of myself as not particularly using it all that much. I got a smartphone quite late, it spends a lot of its life merely sitting there (not even necessarily in the same room as me, especially at home), and while I installed various apps (such as a SSH client) I rarely use them; they're mostly for weird emergencies. Then I suddenly couldn't use my current smartphone any more and all sorts of things came out of the woodwork, both things I sort of knew about but hadn't realized how much they'd affect me and things that I didn't even think about until I had a dead phone.

The really obvious and somewhat nerve wracking thing I expected from the start is that plenty of things want to send you text messages (both for SMS authentication codes and to tell you what steps to do to, for example, get your new replacement smartphone). With no operating smartphone I couldn't receive them. I found myself on tenterhooks all through the replacement process, hoping very much that my bank wouldn't decide it needed to authenticate my credit card usage through either its smartphone app or a text message (and I was lucky that I could authenticate some things through another device). Had I been without a smartphone for a more extended time, I could see a number of things where I'd probably have had to make in-person visits to a bank branch.

(Another obvious thing I knew about is that my bike computer wants to talk to a smartphone app (also). At a different time of year this would have been a real issue, but fortunately my bike club's recreational riding season is over so all it did was delay me uploading one commute ride.)

In less obvious things, I use my smartphone as my alarm clock. With my smartphone unavailable I discovered that I had no good alternative (although I had some not so good ones that are too quiet). I've also become used to using my phone for a quick check of the weather on the way out the door, and to check the arrival time of TTC buses, neither of which were available. Nor could I check email (or text messages) on the way to pick up my new phone because with no smartphone I had no data coverage. I was lucky enough to have another wifi-enabled device available that I took with me, which turned out to be critical for the pickup process.

(It also felt weird and wrong to walk out of the door without the weight of my phone in my pocket, as if I was forgetting my keys or something equally important. And there were times on the trip to get the replacement phone when I found myself realizing that if I'd had an operating smartphone, I'd have taken it out for a quick look at this or that or whatever.)

On the level of mere inconveniences, over time I've gotten pulled into using my smartphone's payment setup for things like grocery purchases. I could still do that in several other ways even without a smartphone, but none of them would have been as nice an experience. There would also have been paper cuts in things like checking the balance on my public transit fare card and topping it up.

Having gone through this experience with my smartphone, I'm now wondering what other bits of technology have quietly infiltrated both my personal life and things at work without me noticing their actual importance. I suspect that there are some more and I'll only realize it when they break.

PS: The smartphone I had to replace is the same one I got back in late 2016, so I got a bit over nine years of usage out of it. This is pretty good by smartphone standards (although for the past few years I was carefully ignoring that it had questionable support for security bugs; there were some updates, but also some known issues that weren't being fixed).

We can't fund our way out of the free and open source maintenance problem

By: cks
28 November 2025 at 04:18

It's in the tech news a lot these days that there are 'problems' with free and open source maintenance. I put 'problems' in quotes because the issue is mostly that FOSS maintenance isn't happening as fast or as much as the people who've come to depend on it would like, and the people who maintain FOSS are increasingly saying 'no' when corporations turn up (cf, also). But even with all the corporate presence, there are still a reasonable number of people who use non-corporate FOSS operating systems like Debian Linux, FreeBSD, and so on, and they too suffer when parts of the FOSS software stack struggle with maintenance. Every so often, people will suggest that the problem would be solved if only corporations would properly fund this maintenance work. However, I don't believe this can actually work even in a world where corporations are willing to properly fund such things (in this world, they're very clearly not).

One big problem with 'funding' as a solution to the FOSS maintenance problems is that for many FOSS maintainers, there isn't enough work available to support them. Many FOSS people write and support only a small number of things that don't necessarily need much active development and bug fixing (people have done studies on this), and so can't feasibly provide full time employment (especially at something equivalent to a competitive salary). Certainly,there's plenty of large projects that are underfunded and could support one or more people working on them full time, but there's also a long tail of smaller, less obvious dependencies that are also important for various sorts of maintenance.

(In a way, the lack of funding pushes people toward small projects. With no funding, you have to do your projects in your spare time and the easiest way to make that work is to choose some small area or modest project that simply doesn't need that much time to develop or maintain.)

There are models where people who work on FOSS can be funded to do a bit of work on a lot of projects. But that's not the same as having funding to work full time on your own little project (or set of little projects). It's much more like regular work, in that you're being paid to do development work on other people's stuff (and I suspect that it will be much more time consuming than one might expect, since anyone doing this will have to come up to speed on a whole bunch of projects).

(I'm assuming the FOSS funding equivalent of a perfectly spherical frictionless object from physics examples, so we can wave away all other issues except that there is not enough work on individual projects. In the real world there are a huge host of additional problems with funding people for FOSS work that create significant extra friction (eg, potential liabilities).)

PS: Even though we can't solve the whole problem with funding, companies absolutely should be trying to use funding to solve as much of it as possible. That they manifestly aren't is one of many things that is probably going to bring everything down as pressure builds to do something.

(I'm sure I'm far from the first person to write about this issue with funding FOSS work. I just feel like writing it down myself, partly as elaboration on some parts of past Fediverse posts.)

Sidebar: It's full time work that matters

If someone is already working a regular full time job, their spare time is a limited resource and there are many claims on it. For various reasons, not everyone will take money to spend (potentially) most of their spare time maintaining their FOSS work. Many people will only be willing to spend a limited amount of their spare time on FOSS stuff, even if you could fund them at reasonable rates for all of their spare time. The only way to really get 'enough' time is to fund people to work full time, so their FOSS work replaces their regular full time job.

One of the reasons I suspect some people won't take money for their extra time is that they already have one job and they don't want to effectively get a second one. They do FOSS work deliberately because it's a break from 'job' style work.

(This points to another, bigger issue; there are plenty of people doing all sorts of hobbies, such as photography, who have no desire to 'go pro' in their hobby no matter how avid and good they are. I suspect there are people writing and maintaining important FOSS software who similarly have no desire to 'go pro' with their software maintenance.)

OIDC, Identity Providers, and avoiding some obvious security exposures

By: cks
15 November 2025 at 04:40

OIDC (and OAuth2) has some frustrating elements that make it harder for programs to support arbitrary identity providers (as discussed in my entry on the problems facing MFA-enabled IMAP in early 2025). However, my view is that these elements exist for good reason, and the ultimate reason is that an OIDC-like environment is by default an obvious security exposure (or several of them). I'm not sure there's any easy way around the entire set of problems that push towards these elements or something quite like them.

Let's imagine a platonically ideal OIDC-like identity provider for clients to use, something that's probably much like the original vision of OpenID. In this version, people (with accounts) can authenticate to the identity provider from all over the Internet, and it will provide them with a signed identity token. The first problem is that we've just asked identity providers to set up an Internet-exposed account and password guessing system. Anyone can show up, try it out, and best of all if it works they don't just get current access to something, they get an identity token.

(Within a trusted network, such as an organization's intranet, this exposed authentication endpoint is less of a concern.)

The second problem is that identity token, because the IdP doesn't actually provide the identity token to the person, it provides the token to something that asked for it. One of the uses of that identity token is to present it to other things to demonstrate that you're acting on the person's behalf; for example, your IMAP client presents it to your IMAP server. If what the identity token is valid for is not restricted in some way, a malicious party could get you to 'sign up with your <X> ID' for their website, take the identity token it got from the IdP, and reuse it with your IMAP server.

To avoid issues, this identity token must have a limited scope (and everything that uses identity tokens needs to check that the token for them). This implies that you can't just ask for an identity token in general, you have to ask for it for use with something specific. As a further safety measure the identity provider doesn't want to give such a scoped token to anything except the thing that's supposed to get it. You (an attacker) should not be able to tell the identity provider 'please create a token for webserver X, and give it to me, not webserver X' (this is part of the restrictions on OIDC redirect URIs).

In OIDC, what deals with much of these risks is client IDs, optionally client secrets, and redirect URIs. Client IDs are used to limit what an identity token can be used for and where it can be sent to (in combination with redirect URIs), and a client secret can be used by something getting a token to prove that it is the client ID it claims to be. If you don't have the right information, the OIDC IdP won't even talk to you. However, this means that all of this information has to be given to the client, or at least obtained by the client and stored by it.

(These days OIDC has a specification for Dynamic Client Registration and can support 'open' dynamic registration of clients, if desired (although it's apparently not widely implemented). But clients do have to register to get the risk-mitigating information for the main IdP endpoint, and I don't know how this is supposed to handle the IMAP situation if the IMAP server wants to verify that the OIDC token it receives was intended for it, since each dynamic client will have a different client ID.)

My GPS bike computer is less distracting than the non-computer option

By: cks
4 November 2025 at 02:44

I have a GPS bike computer primarily for following pre-planned routes, because it became a better supported option than our old paper cue sheets. One of the benefits of switching to from paper cue sheets to a GPS unit was better supported route following, but after I made the switch, I found that it was also less distracting than using paper cue sheets. On the surface this might sound paradoxical, since people often say that computer screens are more distracting. It's true that a GPS bike computer has a lot that you can look at, but for route following, a GPS bike computer also has features that let me not pay attention to it.

When I used paper cue sheets, I always had to pay a certain amount of attention to following the route. I needed to keep track of where we were on the cue sheet's route, and either remember what the next turn was or look at the cue sheet frequently enough that I could be sure I wouldn't miss it. I also needed to devote a certain amount of effort to scanning street signs to recognize the street we'd be turning on to. All of this distracted me from looking around and enjoying the ride; I could never check out completely from route following.

When I follow a route on my GPS bike computer, it's much easier to not pay attention to route following most of the time. My GPS bike computer will beep at me and display a turn alert when we get close to a turn, and I always have it display the distance to the next turn so I can take a quick glance to reassure myself that we're nowhere near the turn. If there's any ambiguity about where to turn, I can look at the route's trace on a map and see that the turn is, for example, two streets ahead, and of course the GPS bike computer is always keeping track of where in the route I am.

Because the GPS bike computer can tell me when I need to pay attention to following the route, I'm free to not pay attention at other times. I can stop thinking about the route at all and look around at the scenery, talk with my fellow club riders, and so on.

(When I look around there are similar situations at work, with some of our systems. Our metrics, monitoring, and alerting system often has the net effect that I don't even look at how things are going because I assume that silence means all is okay. And if I want to do the equivalent of glancing at my GPS bike computer to check the distance to the next turn, I can look at our dashboards.)

I wish SSDs gave you CPU performance style metrics about their activity

By: cks
19 October 2025 at 02:54

Modern CPUs have an impressive collection of performance counters for detailed, low level information on things like cache misses, branch mispredictions, various sorts of stalls, and so on; on Linux you can use 'perf list' to see them all. Modern SSDs (NVMe, SATA, and SAS) are all internally quite complex, and their behavior under load depends on a lot of internal state. It would be nice to have CPU performance counter style metrics to expose some of those details. For a relevant example that's on my mind (cf), it certainly would be interesting to know how often flash writes had to stall while blocks were hastily erased, or the current erase rate.

Having written this, I checked some of our SSDs (the ones I'm most interested in at the moment) and I see that our SATA SSDs do expose some of this information as (vendor specific) SMART attributes, with things like 'block erase count' and 'NAND GB written' to TLC or SLC (as well as the host write volume and so on stuff you'd expect). NVMe does this in a different way that doesn't have the sort of easy flexibility that SMART attributes do, so a random one of ours that I checked doesn't seem to provide this sort of lower level information.

It's understandable that SSD vendors don't necessarily want to expose this sort of information, but it's quite relevant if you're trying to understand unusual drive performance. For example, for your workload do you need to TRIM your drives more often, or do they have enough pre-erased space available when you need it? Since TRIM has an overhead, you may not want to blindly do it on a frequent basis (and its full effects aren't entirely predictable since they depend on how much the drive decides to actually erase in advance).

(Having looked at SMART 'block erase count' information on one of our servers, it's definitely doing something when the server is under heavy fsync() load, but I need to cross-compare the numbers from it to other systems in order to get a better sense of what's exceptional and what's not.)

I'm currently more focused on write related metrics, but there's probably important information that could be exposed for reads and for other operations. I'd also like it if SSDs provided counters for how many of various sorts of operations they saw, because while your operating system can in theory provide this, it often doesn't (or doesn't provide them at the granularity of, say, how many writes with 'Force Unit Access' or how many 'Flush' operations were done).

(In Linux, I think I'd have to extract this low level operation information in an ad-hoc way with eBPF tracing.)

A (filesystem) journal can be a serialization point for durable writes

By: cks
18 October 2025 at 02:57

Suppose that you have a filesystem that uses some form of a journal to provide durability (as many do these days) and you have a bunch of people (or processes) writing and updating things all over the filesystem that they want to be durable, so these processes are all fsync()'ing their work on a regular basis (or the equivalent system call or synchronous write operation). In a number of filesystem designs, this creates a serialization point on the filesystem's journal.

This is related to the traditional journal fsync() problem, but that one is a bit different. In the traditional problem you have a bunch of changes from a bunch of processes, some of which one process wants to fsync() and most of which it doesn't; this can be handled by only flushing necessary things. Here we have a bunch of processes making a bunch of relatively independent changes but approximately all of the processes want to fsync() their changes.

The simple way to get durability (and possibly integrity) for fsync() is to put everything that gets fsync()'d into the journal (either directly or indirectly) and then force the journal to be durably committed to disk. If the filesystem's journal is a linear log, as is usually the case, this means that multiple processes mostly can't be separately writing and flushing journal entries at the same time. Each durable commit of the journal is a bottleneck for anyone who shows up 'too late' to get their change included in the current commit; they have to wait for the current commit to be flushed to disk before they can start adding more entries to the journal (but then everyone can be bundled into the next commit).

In some filesystems, processes can readily make durable writes outside of the journal (for example, overwriting something in place); such processes can avoid serializing on a linear journal. Even if they have to put something in the journal, you can perhaps minimize the direct linear journal contents by having them (durably) write things to various blocks independently, then put only compact pointers to those out of line blocks into the linear journal with its serializing, linear commits. The goal is to avoid having someone show up wanting to write megabytes 'to the journal' and forcing everyone to wait for their fsync(); instead people serialize only on writing a small bit of data at the end, and writing the actual data happens in parallel (assuming the disk allows that).

(I may have made this sound simple but the details are likely fiendishly complex.)

If you have a filesystem in this situation, and I believe one of them is ZFS, you may find you care a bunch about the latency of disks flushing writes to media. Of course you need the workload too, but there are certain sorts of workloads that are prone to this (for example, traditional Unix mail spools).

I believe that you can also see this sort of thing with databases, although they may be more heavily optimized for concurrent durable updates.

Sidebar: Disk handling of durable writes can also be a serialization point

Modern disks (such as NVMe SSDs) broadly have two mechanism to force things to durable storage. You can issue specific writes of specific blocks with 'Force Unit Access' (FUA) set, which causes the disk to write those blocks (and not necessarily any others) to media, or you can issue a general 'Flush' command to the disk and it will write anything it currently has in its write cache to media.

If you issue FUA writes, you don't have to wait for anything else other than your blocks to be written to media. If you issue 'Flush', you get to wait for everyone's blocks to be written out. This means that for speed you want to issue FUA writes when you want things on media, but on the other hand you may have already issued non-FUA writes for some of the blocks before you found out that you wanted them on media (for example, if someone writes a lot of data, so much that you start writeback, and then they issue a fsync()). And in general, the block IO programming model inside your operating system may favour issuing a bunch of regular writes and then inserting a 'force everything before this point to media' fencing operation into the IO stream.

NVMe SSDs and the question of how fast they can flush writes to flash

By: cks
17 October 2025 at 03:17

Over on the Fediverse, I had a question I've been wondering about:

Disk drive people, sysadmins, etc: would you expect NVMe SSDs to be appreciably faster than SATA SSDs for a relatively low bandwidth fsync() workload (eg 40 Mbytes/sec + lots of fsyncs)?

My naive thinking is that AFAIK the slow bit is writing to the flash chips to make things actually durable when you ask, and it's basically the same underlying flash chips, so I'd expect NVMe to not be much faster than SATA SSDs on this narrow workload.

This is probably at least somewhat wrong. This 2025 SSD hierarchy article doesn't explicitly cover forced writes to flash (the fsync() case), but it does cover writing 50 GBytes of data in 30,000 files, which is probably enough to run any reasonable consumer NVMe SSD out of fast write buffer storage (either RAM or fast flash). The write speeds they get on this test from good NVMe drives are well over the maximum SATA data rates, so there's clearly a sustained write advantage to NVMe SSDs over SATA SSDs.

In replies on the Fediverse, several people pointed out that NVMe SSDs are likely using newer controllers than SATA SSDs and these newer controllers may well be better at handling writes. This isn't surprising when I thought about it, especially in light of NVMe perhaps overtaking SATA for SSDs, although apparently 'enterprise' SATA/SAS SSDs are still out there and probably seeing improvements (unlike consumer SATA SSDs where price is the name of the game).

Also, apparently the real bottleneck in writing to the actual flash is finding erased blocks or, if you're unlucky, having to wait for blocks to be erased. Actual writes to the flash chips may be able to go at something close to the PCIe 3.0 (or better) bandwidth, which would help explain the Tom's Hardware large write figures (cf).

(If this is the case, then explicitly telling SSDs about discarded blocks is especially important for any write workload that will be limited by flash write speeds, including fsync() heavy workloads.)

PS: The reason I'm interested in this is that we have a SATA SSD based system that seems to have periodic performance issues related enough write IO combined with fsync()s (possibly due to write buffering interactions), and I've been wondering how much moving it to be NVMe based might help. Since this machine uses ZFS, perhaps one thing we should consider is manually doing some ZFS 'TRIM' operations.

Why I have a GPS bike computer

By: cks
8 October 2025 at 03:42

(This is a story about technology. Sort of.)

Many bicyclists with a GPS bike computer probably have it primarily to record their bike rides and then upload them to places like Strava. I'm a bit unusual in that while I do record my rides and make some of them public, and I've come to value this, it's not my primary reason to have a GPS bike computer. Instead, my primary reason is following pre-made routes.

When I started with my recreational bike club, it was well before the era of GPS bike computers. How you followed (or lead) our routes back then was through printed cue sheets, which had all of the turns and so on listed in order, often with additional notes. One of the duties of the leader of the ride was printing out a sufficient number of cue sheets in advance and distributing them to interested parties before the start of the ride. If you were seriously into using cue sheets, you'd use a cue sheet holder (nowadays you can only find these as 'map holders', which is basically the same job); otherwise you might clip the cue sheet to a handlebar brake or gear cable or fold it up and stick it in a back jersey pocket.

Printed cue sheets have a number of nice features, such as giving you a lot of information at a glance. One of them is that a well done cue sheet was and is a lot more than just a list of all of the turns and other things worthy of note; it's an organized, well formatted list of these. The cues would be broken up into sensibly chosen sections, with whitespace between them to make it easier to narrow in on the current one, and you'd lay out the page (or pages) so that the cue or section breaks happened at convenient spots to flip the cue sheet around in cue holders or clips. You'd emphasize important turns, cautions, or other things in various ways. And so on. Some cue sheets even had a map of the route printed on the back.

(You needed to periodically flip the cue sheet around and refold it because many routes had too many turns and other cues to fit in a small amount of printed space, especially if you wanted to use a decently large font size for easy readability.)

Starting in the early 2010s, more and more TBN people started using GPS bike computers or smartphones (cf). People began converting our cue sheet routes to computerized GPS routes, with TBN eventually getting official GPS routes. Over time, more and more members got smartphones and GPS units and there was more and more interest in GPS routes and less and less interest in cue sheets. In 2015 I saw the writing on the wall for cue sheets and the club more or less deprecated them, so in August 2016 I gave in and got a GPS unit (which drove me to finally get a smartphone, because my GPS unit assumed you had one). Cue sheet first routes lingered on for some years afterward, but they're all gone by now; everything is GPS route first.

You can still get cue sheets for club routes (the club's GPS routes typically have turn cues and you can export these into something you can print). But what we don't really have any more is the old school kind of well done, organized cue sheets, and it's basically been a decade since ride leaders would turn up with any printed cue sheets at all. These days it's on you to print your own cue sheet if you need it, and also on you to make a good cue sheet from the basic cue sheet (if you care enough to do so). There are some people who still use cue sheets, but they're a decreasing minority and they probably already had the cue sheet holders and so on (which are now increasingly hard to find). A new rider who wanted to use cue sheets would have an uphill struggle and they might never understand why long time members could be so fond of them.

Cue sheets are still a viable option for route following (and they haven't fundamentally changed). They're just not very well supported any more in TBN because they stopped being popular. If you insist on sticking with them, you still can, but it's not going to be a great experience. I didn't move to a GPS unit because I couldn't possibly use cue sheets any more (I still have my cue sheet holder); I moved because I could see the writing on the wall about which one would be the more convenient, more usable option.

Applications to the (computing) technologies of your choice are left as an exercise for the reader.

PS: As a whole I think GPS bike computers are mostly superior to cue sheets for route following, but that's a different discussion (and it depends on what sort of bicycling you're doing). There are points on both sides.

What (I think) you need to do basic UDP NAT traversal

By: cks
6 October 2025 at 03:52

Yesterday I wished for a way to do native "blind" WireGuard relaying, without needing to layer something on top of WireGuard. I wished for this both because it's the simplest approach for getting through NATs and the one you need in general under some circumstances. The classic and excellent work on all of the complexities of NAT traversal is Tailscale's How NAT traversal works, which also winds up covering the situation where you absolutely have to have a relay. But, as I understand things, in a fair number of situations you can sort of do without a relay and have direct UDP NAT traversal, although you need to do some extra work to get it and you need additional pieces.

Following RFC 4787, we can divide NAT into to two categories, endpoint-independent mapping (EIM) and endpoint-dependent mapping (EDM). In EIM, the public IP and port of your outgoing NAT'd traffic depend only on your internal IP and port, not on the destination (IP or port); in EDM they (also) depend on the destination. NAT'ing firewalls normally NAT based on what could be called "flows". For TCP, flows are a real thing; you can specifically tell a single TCP connection and it's difficult to fake one. For UDP, a firewall generally has no idea of what is a valid flow, and the best it can do is accept traffic that comes from the destination IP and port, which in theory is replies from the other end.

This leads to the NAT traffic traversal trick that we can do for UDP specifically. If we have two machines that want to talk to each other on each other's UDP port 51820, the first thing they need is to learn the public IP and port being used by the other machine. This requires some sort of central coordination server as well as the ability to send traffic to somewhere on UDP port 51820 (or whatever port you care about). In the case of WireGuard, you might as well make this a server on a public IP running WireGuard and have an actual WireGuard connection to it, and the discount 'coordination server' can then be basically the WireGuard peer information from 'wg' (the 'endpoint' is the public IP and port you need).

Once the two machines know each other's public IP and port, they start sending UDP port 51820 (or whatever) packets to each other, to the public IP and port they learned through the coordination server. When each of them sends their first outgoing packet, this creates a 'flow' on their respective NAT firewall which will allow the other machine's traffic in. Depending on timing, the first few packets from the other machine may arrive before your firewall has set up its state to allow them in and will get dropped, so each side needs to keep sending until it works or until it's clear that at least one side has an EDM (or some other complication).

(For WireGuard, you'd need something that sets the peer's endpoint to your now-known host and port value and then tries to send it some traffic to trigger the outgoing packets.)

As covered in Tailscale's article, it's possible to make direct NAT traversal work in some additional circumstances with increasing degrees of effort. You may be lucky and have a local EDM firewall that can be asked to stop doing EDM for your UDP port (via a number of protocols for this), and otherwise it may be possible to feel your way around one EDM firewall.

If you can arrange a natural way to send traffic from your UDP port to your coordination server, the basic NAT setup can be done without needing the deep cooperation of the software using the port; all you need is a way to switch what remote IP and port it uses for a particular peer. Your coordination server may need special software to listen to traffic and decode which peer is which, or you may be able to exploit existing features of your software (for example, by making the coordination server a WireGuard peer). Otherwise, I think you need either some cooperation from the software involved or gory hacks.

Wishing for a way to do 'blind' (untrusted) WireGuard relaying

By: cks
5 October 2025 at 02:32

Over on the Fediverse, I sort of had a question:

I wonder if there's any way in standard WireGuard to have a zero-trust network relay, so that two WG peers that are isolated from each other (eg both behind NAT) can talk directly. The standard pure-WG approach has a public WG endpoint that everyone talks to and which acts as a router for the internal WG IPs of everyone, but this involves decrypting and re-encrypting the WG traffic.

By 'talk directly' I mean that each of the peers has the WireGuard keys of the other and the traffic between the two of them stays encrypted with those keys all the way through its travels. The traditional approach to the problem of two NAT'd machines that want to talk to each other with WireGuard is to have a WireGuard router that both of them talk to over WireGuard, but this means that the router sees the unencrypted traffic between them. This is less than ideal if you don't want to trust your router machine, for example because you want to make it a low-trust virtual machine rented from some cloud provider.

Since we love indirection in computer science, you can in theory solve this with another layer of traffic encapsulation (with a lot of caveats). The idea is that all of the 'public' endpoint IPs of WireGuard peers are actually on a private network, and you route the private network through your public router. Getting the private network packets to and from the router requires another level of encapsulation and unless you get very clever, all your traffic will go through the router even if two WireGuard peers could talk directly. Since WireGuard automatically keeps track of the current public IPs of peers, it would be ideal to do this with WireGuard, but I'm not sure that WG-in-WG can have the routing maintained the way we want.

This untrusted relay situation is of course one of the things that 'automatic mesh network on top of WireGuard' systems give you, but it would be nice to be able to do this with native features (and perhaps without an explicit control plane server that machines talk to, although that seems unlikely). As far as I know such systems implement this with their own brand of encapsulation, which I believe requires running their WireGuard stack.

(On Linux you might be able to do something clever with redirecting outgoing WireGuard packets to a 'tun' device connected to a user level program, which then wrapped them up, sent them off, received packets back, and injected the received packets into the system.)

Free and open source software is incompatible with (security) guarantees

By: cks
18 September 2025 at 02:53

If you've been following the tech news, one of the recent things that's happened is that there has been another incident where a bunch of popular and widely used packages on a popular package repository for a popular language were compromised, this time with a self-replicating worm. This is very inconvenient to some people, especially to companies in Europe, for some reason, and so some people have been making the usual noises. On the Fediverse, I had a hot take:

Hot take: free and open source is fundamentally incompatible with strong security *guarantees*, because FOSS is incompatible with strong guarantees about anything. It says so right there on the tin: "without warranty of any kind, either expressed or implied". We guarantee nothing by default, you get the code, the project, everything, as-is, where-is, how-is.

Of course companies find this inconvenient, especially with the EU CRA looming, but that's not FOSS's problem. That's a you problem.

To be clear here: this is not about the security and general quality of FOSS (which is often very good), or the responsiveness of FOSS maintainers. This is about guarantees, firm (and perhaps legally binding) assurances of certain things (which people want for software in general). FOSS can provide strong security in practice but it's inimical to FOSS's very nature to provide a strong guarantee of that or anything else. The thing that makes most of FOSS possible is that you can put out software without that guarantee and without legal liability.

An individual project can solemnly say it guarantees its security, and if it does so it's an open legal question whether that writing trumps the writing in the license. But in general a core and absolutely necessary aspect of free and open source is that warranty disclaimer, and that warranty disclaimer cuts across any strong guarantees about anything, including security and lack of bugs.

Are the compromised packages inconvenient to a lot of companies? They certainly are. But neither the companies nor commentators can say that the compromise violated some general strong security guarantee about packages, because there is and never will be such a guarantee with FOSS (see, for example, Thomas Depierre's I am not a supplier, which puts into words a sentiment a lot of FOSS people have).

(But of course the companies and sympathetic commentators are framing it that way because they are interested in the second vision of "supply chain security", where using FOSS code is supposed to magically absolve companies of the responsibility that people want someone to take.)

The obvious corollary of this is that widespread usage of FOSS packages and software, especially with un-audited upgrades of package versions (however that happens), is incompatible with having any sort of strong security or quality guarantee about the result. The result may have strong security and high quality, but if so, those come without guarantees; you've just been lucky. If you want guarantees, you will have to arrange them yourself and it's very unlikely you can achieve strong guarantees while using the typical every-changing pile of FOSS code.

(For example, if dependencies auto-update before you can audit them and their changes, or faster than you can keep up, you have nothing in practice.)

We can't expect people to pick 'good' software

By: cks
7 September 2025 at 02:35

One of the things I've come to believe in (although I'm not consistent about it) is that we can't expect people to pick software that is 'good' in a technical sense. People certainly can and do pick software that is good in that it works nicely, has a user interface that works for them, and so on, which is to say all of the parts of 'good' that they can see and assess, but we can't expect people to go beyond that, to dig deeply into the technical aspects to see how good their choice of software is. For example, how efficiently an IMAP client implements various operations at the protocol level is more or less invisible to most people. Even if you know enough to know about potential technical quality aspects, realistically you have to rely on any documentation the software provides (if it provides anything). Very few people are going to set up an IMAP server test environment and point IMAP clients at it to see how they behave, or try to read the source code of open source clients.

(Plus, you have to know a lot to set up a realistic test environment. A lot of modern software varies its behavior in subtle ways depending on the surrounding environment, such as the server (or client) at the other end, what your system is like, and so on. To extend my example, the same IMAP client may behave differently when talking to two different IMAP server implementations.)

Broadly, the best we can do is get software to describe important technical aspects of itself, to document them even if the software doesn't, and to explain to people why various aspects matter and thus what they should look for if they want to pick good software. I think this approach has seen some success in, for example, messaging apps, where 'end to end encrypted' or similar things has become a technical quality measure that's typically relatively legible to people. Other technical quality measures in other software are much less legible to people in general, including in important software like web browsers.

(One useful way to make technical aspects legible is to create some sort of scorecard for them. Although I don't think it was built for this purpose, there's caniuse for browsers and their technical quality for various CSS and HTML5 features.)

To me, one corollary to this is that there's generally no point in yelling at people (in various ways) or otherwise punishing them because they picked software that isn't (technically) good. It's pretty hard for a non-specialist to know what is actually good or who to trust to tell them what's actually good, so it's not really someone's fault if they wind up with not-good software that does undesirable things. This doesn't mean that we should always accept the undesirable things, but it's probably best to either deal with them or reject them as gracefully as possible.

(This definitely doesn't mean that we should blindly follow Postel's Law, because a lot of harm has been done to various ecosystems by doing so. Sometimes you have to draw a line, even if it affects people who simply had bad luck in what software they picked. But ideally there's a difference between drawing a line and yelling at people about them running into the line.)

Could NVMe disks become required for adequate performance?

By: cks
5 September 2025 at 03:34

It's not news that full speed NVMe disks are extremely fast, as well as extremely good at random IO and doing a lot of IO at once. In fact they have performance characteristics that upset general assumptions about how you might want to design systems, at least for reading data from disk (for example, you want to generate a lot of simultaneous outstanding requests, either explicitly in your program or implicitly through the operating system). I'm not sure how much write bandwidth normal NVMe drives can really deliver for sustained write IO, but I believe that they can absorb very high write rates for a short period as you flush out a few hundred megabytes or more. This is a fairly big sea change from even SATA SSDs (and I believe SAS SSDs), never mind HDDs.

About a decade ago, I speculated that everyone was going to be forced to migrate to SATA SSDs because developers would build programs that required SATA SSD performance. It's quite common for developers to build programs and systems that run well on their hardware (whether that's laptops, desktops, or servers, cloud or otherwise), and developers often use the latest and best. These days, that's going to have NVMe SSDs, and so it wouldn't be surprising if developers increasingly developed for full NVMe performance. Some of this may be inadvertent, in that the developer doesn't realize what the performance impact of their choices are on systems with less speedy storage. Some of this will likely be deliberate, as developers choose to optimize for NVMe performance or even develop systems that only work well with that level of performance.

This is a potential problem because there are a number of ways to not have that level of NVMe performance. Most obviously, you can simply not have NVMe drives; instead you may be using SATA SSDs (as we mostly are, including in our fileservers), or even HDDs (as we are in our Prometheus metrics server). Less obviously, you may have NVMe drives but be driving them in ways that don't give you the full NVMe bandwidth. For instance, you might have a bunch of NVMe drives behind a 'tri-mode' HBA, or have (some of) your NVMe drives hanging off the chipset with shared PCIe lanes to the CPU, or have to drive some of your NVMe drives with fewer than x4 PCIe because of limits on slots or lanes.

(Dedicated NVMe focused storage servers will be able to support lots of NVMe devices at full speed, but such storage servers are likely to be expensive. People will inevitably build systems with lower end setups, us included, and I believe that basic 1U servers are still mostly SATA/SAS based.)

One possible reason for optimism is that in today's operating systems, it can take careful system design and unusual programming patterns to really push NVMe disks to high performance levels. This may make it less likely that software accidentally winds up being written so it only performs well on NVMe disks; if it happens, it will be deliberate and the project will probably tell you about it. This is somewhat unlike the SSD/HDD situation a decade ago, where the difference in (random) IO operations per second was both massive and easily achieved.

(This entry was sparked in part by reading this article (via), which I'm not taking a position on.)

Connecting M.2 drives to various things (and not doing so)

By: cks
25 August 2025 at 03:06

As a result of discovering that (M.2) NVMe SSDs seem to have become the dominant form of SSDs, I started looking into what you could connect M.2 NVMe SSDs to. Especially I started looking to see if you could turn M.2 NVMe SSDs into SATA SSDs, so you could connect high capacity M.2 NVMe SSDs to, for example, your existing stock of ZFS fileservers (which use SATA SSDs). The short version is that as far as I can tell, there's nothing that does this, and once I started thinking about it I wasn't as surprised as I might be.

What you can readily find is passive adapters from M.2 NVMe or M.2 SATA to various other forms of either NVMe or SATA, depending. For example, there are M.2 NVMe to U.2 cases, and M.2 SATA to SATA cases; these are passive because they're just wiring things through, with no protocol conversion. There are also some non-passive products that go the other way; they're a M.2 'NVMe' 2280 card that has four SATA ports on it (and presumably a PCIe SATA controller). However, the only active M.2 NVMe product (one with protocol conversion) that I can find is M.2 NVMe to USB, generally in the form of external enclosures.

(NVMe drives are PCIe devices, so an 'M.2 NVMe' connector is actually providing some PCIe lanes to the M.2 card. Normally these lanes are connected to an NVMe controller, but I don't believe there's any intrinsic reason that you can't connect them to other PCIe things. So you can have 'PCIe SATA controller on an M.2 PCB' and various other things.)

When I thought about it, I realized the problem with my hypothetical 'obvious' M.2 NVMe to SATA board (and case): since it involves protocol conversion (between NVMe and SATA), someone would have to make the controller chipset for it. You can't make a M.2 NVMe to SATA adapter until someone goes to the expense of designing and fabricating (and probably programming) the underlying chipset, and presumably no one has yet found it commercially worthwhile to do so. Since (M.2) NVMe to USB adapters exist, protocol conversion is certainly possible, and since such adapters are surprisingly inexpensive, presumably there's enough demand to drive down the price of the underlying controller chipsets.

(These chipsets are, for example, the Realtek RTL9210B-CG or the ASMedia ASM3242.)

Designing a chipset is not merely expensive, it's very expensive, which to me explains why there aren't any high-priced options for connecting a NVMe drive up via SATA, the way there are high-priced options for some uncommon things (like connecting multiple NVMe drives to a single PCIe slot without PCIe bifurcation, which can presumably be done with the right existing PCIe bridge chipset).

(Since I checked, there also doesn't currently seem to be any high capacity M.2 SATA SSDs (which in theory could just be a controller chipset swap from the M.2 NVMe version). If they existing, you could use a passive M.2 SATA to 2.5" SATA adapter to get them into the form factor you want.)

It seems like NVMe SSDs have overtaken SATA SSDs for high capacities

By: cks
24 August 2025 at 02:20

For a long time, NVMe SSDs were the high end option; as the high end option they cost more than SATA SSDs of the same capacity, and SATA SSDs were generally available in higher capacity than NVMe SSDs (at least at prices you wanted to pay). This is why my home desktop wound up with a storage setup with a mirrored pair of 2 TB NVMe SSDs (which felt pretty indulgent) and a mirrored pair of 4 TB SATA SSDs (which felt normal-ish). Today, for reasons outside the boundary of this entry, I wound up casually looking to see how available large SSDs were. What I expected to find was that large-capacity SATA SSDs would now be reasonably available and not too highly priced, while NVMe SSDs would top out at perhaps 4TB and high prices.

This is not what I found, at least at some large online retailers. Instead, SATA SSDs seem to have almost completely stagnated at 4 TB, with capacities larger than that only available from a few specialty vendors at eye-watering prices. By contrast, 8 TB NVMe SSDs seem readily available at somewhat reasonable prices from mainstream drive vendors like WD (they aren't inexpensive but they're not unreasonable given the prices of 4 TB NVMe, which is roughly the price I remember 4 TB SATA SSDs being at). This makes me personally sad, because my current home desktop has more SATA ports than M.2 slots or even PCIe x1 slots.

(You can get PCIe x1 cards that mount a single NVMe SSD, and I think I'd get somewhat better than SATA speeds out of them. I have one to try out in my office desktop, but I haven't gotten around to it yet.)

At one level this makes sense. Modern motherboards have a lot more M.2 slots than they used to, and I speculated several years ago that M.2 NVMe drives would eventually be cheaper to make than 2.5" SSDs. So in theory I'm not surprised that probable consumer (lack of) demand has basically extinguished SATA SSDs above 4 TB. In practice, I am surprised and it feels disconcerting for NVMe SSDs to now look like the 'mainstream' choice.

(This is also potentially inconvenient for work, where we have a bunch of ZFS fileservers that currently use 4 TB 2.5" SATA SSDs (an update from their original 2 TB SATA SSDs). If there are no reasonably priced SATA SSDs above 4 TB, then our options for future storage expansion become more limited. In the long run we may have to move to U.2 to get hotswappable 4+ TB SSDs. On the other hand, apparently there are inexpensive M.2 to U.2 adapters, and we've done worse sins with our fileservers.)

Responsibility for university physical infrastructure can be complicated

By: cks
7 August 2025 at 02:56

One of the perfectly sensible reactions to my entry on realizing that we needed two sorts of temperature alerts is to suggest that we directly monitor the air conditioners in our machine rooms, so that we don't have to try to assess how healthy they are from second hand, indirect sources like the temperature of the rooms. There are some practical problems, but a broader problem is that by and large they're not 'our' air conditioners. By this I mean that while the air conditioners and the entire building belongs to the university, neither 'belong' to my department and we can't really do stuff to them.

There are probably many companies who have some split between who's responsible for maintaining a building (and infrastructure things inside it) and who is currently occupying (parts of) the building, but my sense is that universities (or at least mine) take this to a more extreme level than usual. There's an entire (administrative) department that looks after buildings and other physical infrastructure, and they 'own' much of the insides of buildings, including the air conditioning units in our machine rooms (including the really old one). Because those air conditioners belong to the building and the people responsible for it, we can't go ahead and connect monitoring up to the AC units or tap into any native monitoring they might have.

(Since these aren't our AC units, we haven't even asked. Most of the AC units are old enough that they probably don't have any digital monitoring, and for the new units the manufacturer probably considers that an extra cost option. Nor can we particularly monitor their power consumption; these are industrial units, with dedicated high-power circuits that we're not even going to get near. Only university electricians are supposed to touch that sort of stuff.)

I believe that some parts of the university have a multi-level division of responsibility for things. One organization may 'own' the building, another 'owns' the network wiring in the walls and is responsible for fixing it if something goes wrong, and a third 'owns' the space (ie, gets to use it) and has responsibility for everything inside the rooms. Certainly there's a lot of wiring within buildings that is owned by specific departments or organizations; they paid to put it in (although possibly through shared conduits), and now they're the people who control what it can be used for.

(We have run a certain amount of our own fiber between building floors, for example. I believe that things can get complicated when it comes to renovating space for something, but this is fortunately not one of the areas we have to deal with; other people in the department look after that level of stuff.)

I've been inside the university for long enough that all of this feels completely normal to me, and it even feels like it makes sense. Within a university, who is using space is something that changes over time, not just within an academic department but also between departments. New buildings are built, old buildings are renovated, and people move around, so separating maintaining the buildings from who occupies them right now feels natural.

(In general, space is a constant struggle at universities.)

Jay Blahnik Accused of Creating a Toxic Workplace Culture at Apple

By: Nick Heer
22 August 2025 at 03:51

Jane Mundy, writing at the imaginatively named Lawyers and Settlements in December:

A former Apple executive has filed a California labor complaint against Apple and Jay Blahnik, the company’s vice president of fitness technologies. Mandana Mofidi accuses Apple of retaliation after she reported sexual harassment and raised concerns about receiving less pay than her male colleagues.

The Superior Court of California for the County of Los Angeles wants nearly seventeen of the finest United States dollars for a copy of the complaint alone.

Tripp Mickle, New York Times:

But along the way, [Jay] Blahnik created a toxic work environment, said nine current and former employees who worked with or for Mr. Blahnik and spoke about personnel issues on the condition of anonymity. They said Mr. Blahnik, 57, who leads a roughly 100-person division as vice president for fitness technologies, could be verbally abusive, manipulative and inappropriate. His behavior contributed to decisions by more than 10 workers to seek extended mental health or medical leaves of absence since 2022, about 10 percent of the team, these people said.

The behaviours described in this article are deeply unprofessional, at best. It is difficult to square the testimony of a sizeable portion of Blahnik’s team with an internal investigation finding no wrongdoing, but that is what Apple’s spokesperson expects us to believe.

βŒ₯ Permalink

Sending drawing commands to your display server versus sending images

By: cks
25 July 2025 at 03:18

One of the differences between X and Wayland is that in the classical version of X you send drawing commands to the server while in Wayland you send images; this can be called server side rendering versus client side rendering. Client side rendering doesn't preclude a 'network transparent' display protocol, but it does mean that you're shipping around images instead of drawing commands. Is this less efficient? In thinking about it recently, I realized that the answer is that it depends on a number of things.

Let's start out by assuming that the display server and the display clients are equally powerful and capable as far as rendering the graphics goes, so the only question is where the rendering happens (and what makes it better to do it in one place instead of another). The factors that I can think of are:

  • How many different active client (machines) there are; if there are enough, the active client machines have more aggregate rendering capacity than the server does. But probably you don't usually have all that many different clients all doing rendering at once (that would be a very busy display).

  • The number of drawing commands as compared to the size of the rendered result. In an extreme case in favor of client side rendering, a client executes a whole bunch of drawing commands in order to render a relatively small image (or window, or etc). In an extreme case the other way, a client can send only a few drawing commands to render a large image area.
  • The amount of input data the drawing commands need compared to the output size of the rendered result. An extreme case in favour of client side rendering is if the client is compositing together a (large) stack of things to produce a single rendered result.
  • How efficiently you can encode (and decode) the rendered result or the drawing commands (and their inputs). There's a tradeoff of space used to encoding and decoding time, where you may not be able to afford aggressive encoding because it gets in the way of fast updates.

    What these add up to is the aggregate size of the drawing commands and all of the inputs that they need relative to the rendered result, possibly cleverly encoded on both sides.

  • How much changes from frame to frame and how easily you can encode that in some compact form. Encoding changes in images is a well studied thing (we call it 'video'), but a drawing command model might be able to send only a few commands to change a little bit of what it sent previously for an even bigger saving.

    (This is affected by how a server side rendering server holds the information from clients. Does it execute their draw commands then only retain the final result, as X does, or does it hold their draw commands and re-execute them whenever it needs to re-render things? Let's assume it holds the rendered result, so you can draw over it with new drawing commands rather than having to send a new full set of 'draw this from now onward' commands.)

    A pragmatic advantage of client side rendering is that encoding image to image changes can be implemented generically after any style of rendering; all you need is to retain a copy of the previous frame (or perhaps more frames than that, depending). In a server rendering model, the client needs specific support for determining a set of drawing operations to 'patch' the previous result, and this doesn't necessarily cooperate with an immediate mode approach where the client regenerates the entire set of draw commands from scratch any time it needs to re-render a frame.

I was going to say that the network speed is important too but while it matters, what I think it does is magnifies or shrinks the effect of the relative size of drawing commands compared to the final result. The faster and lower latency your network is, the less it matters if you ship more data in aggregate. On a slow network, it's much more important.

There's probably other things I'm missing, but even with just these I've wound up feeling that the tradeoffs are not as simple and obvious as I believed before I started thinking about it.

(This was sparked by an offhand Fediverse remark and joke.)

Projects can't be divorced from the people involved in them

By: cks
22 July 2025 at 02:51

Among computer geeks, myself included, there's a long running optimistic belief that projects can be considered in isolation and 'evaluated on their own merits', divorced from the specific people or organizations that are involved with them and the culture that they have created. At best, this view imagines that we can treat everyone involved in the development of something as a reliable Vulcan, driven entirely by cold logic with no human sentiment involved. This is demonstrably false (ask anyone about the sharp edge of Linus Torvalds' emails), but convenient, at least for people with privilege.

(A related thing is considering projects in isolation from the organizations that create and run them, for example ignoring that something is from 'killed by' Google.)

Over time, I have come to understand and know that this is false, much like other things I used to accept. The people involved with a project bring with them attitudes and social views, and they create a culture through their actions, their expressed views, and even their presence. Their mere presence matters because it affects other people, and how other people will or won't interact with the project.

(To put it one way, the odds that I will want to be involved in a project run by someone who openly expresses their view that bicyclists are the scum of the earth and should be violently run off the road are rather low, regardless of how they behave within the confines of the project. I'm not a Vulcan myself and so I am not going to be able to divorce my interactions with this person from my knowledge that they would like to see me and my bike club friends injured or dead.)

You can't divorce a project from its culture or its people (partly because the people create and sustain that culture); the culture and the specific people are entwined into how 'the project' (which is to say, the crowd of people involved in it) behaves, and who it attracts and repels. And once established, the culture of a project, like the culture of anything, is very hard to change, partly because it acts as a filter for who becomes involved in the project. The people who create a project gather like-minded people who see nothing wrong with the culture and often act to perpetuate it, unless the project becomes so big and so important that other people force their way in (usually because a corporation is paying them to put up with the initial culture).

(There is culture everywhere. C++ has a culture (or several), for example, as does Rust. Are they good cultures? People have various opinions that I will let you read about yourself.)

People want someone to be responsible for software that fails

By: cks
16 July 2025 at 03:29

There are various things in the open source tech news these days, like bureaucratic cybersecurity risk assessment request to open source projects, maintainers rejecting the current problematic approach to security issues (also), and 'software supply chain security'. One of my recent thoughts as a result of all of this is that the current situation is fundamentally unsustainable, and one part of it is because people are increasingly going to require someone to be held responsible for software that fails and does damage ('damage' in an abstract sense; people know it when they see it).

This isn't anything unique or special with software. People feel the same way about buildings, bridges, vehicles, food, and anything else that actually matters to leading a regular life, and eventually they've managed to turn that feeling into concrete results for most things. Software has so far had a long period of not being held to account, but then once upon a time so did food and food safety and food has always been very important to people while software spent a long time not being visibly a big deal (or, if you prefer, not being as visibly slipshod as it is today, when a lot more people are directly exposed to a lot more software and thus to its failings).

The bottom line is that people don't consider it (morally) acceptable when no one is held responsible for either negligence or worse, deliberate choices that cause harm. A field can only duck and evade their outrage for so long; sooner or later it stops being able to shrug and walk away. Software is now systematically important in the world, which means that its failings can do real harm, and people have noticed.

(Which is to say that an increasing number of people have been harmed by software and didn't like it, and the number and frequency is only going to go up.)

There are a lot of ways that this could happen, with the EU CRA being only one of them; as various drafts of the EU CRA has shown, there are a lot of ways that things could go badly in the process. And it could also be that the forces of unbridled pure-profit capitalism will manage to fight this off no matter how much people want it, as they're busy doing with other things in the world (see, for example, the LLM crawler plague). But if companies do fight this off I don't think we're going to enjoy that world very much for multiple reasons, and people's desire for this is still going to very much be there. The days of people's indifference are over and one way or another we're going to have to deal with that. Both our software and our profession will be shaped by how we do.

Filesystems and the problems of exposing their internal features

By: cks
6 July 2025 at 02:21

Modern filesystems often have a variety of sophisticated features that go well beyond standard POSIX style IO, such as transactional journals of (all) changes and storing data in compressed form. For certain usage cases, it could be nice to get direct access to those features; for example, so your web server could potentially directly serve static files in their compressed form, without having the kernel uncompress them and then the web server re-compress them (let's assume we can make all of the details work out in this sort of situation, which isn't a given). But filesystems only very rarely expose this sort of thing to programs, even through private interfaces that don't have to be standardized by the operating system.

One of the reasons for filesystems to not do this is that they don't want to turn what are currently internal filesystem details into an API (it's not quite right to call them only an 'implementation' detail, because often the filesystem has to support the resulting on-disk structures more or less forever). Another issue is that the implementation inside the kernel is often not even written so that the necessarily information could be provided to a user-level program, especially efficiently.

Even when exposing a feature doesn't necessarily require providing programs with internal information from the filesystem, filesystems may not want to make promises to user space about what they do and when they do it. One place this comes up is the periodic request that filesystems like ZFS expose some sort of 'transaction' feature, where the filesystem promises that either all of a certain set of operations are visible or none of them are. Supporting such a feature doesn't just require ZFS or some other filesystem to promise to tell you when all of the things are durably on disk; it also requires the filesystem to not make any of them visible early, despite things like memory pressure or the filesystem's other natural activities.

Sidebar: Filesystem compression versus program compression

When you start looking, how ZFS does compression (and probably how other filesystems do it) is quite different from how programs want to handle compressed data. A program such as a web server needs a compressed stream of data that the recipient can uncompress as a single (streaming) thing, but this is probably not what the filesystem does. To use ZFS as an example of filesystem behavior, ZFS compresses blocks independently and separately (typically in 128 Kbyte blocks), may use different compression schemes for different blocks, and may not compress a block at all. Since ZFS reads and writes blocks independently and has metadata for each of them, this is perfectly fine for it but obviously is somewhat messy for a program to deal with.

Operating system kernels could return multiple values from system calls

By: cks
5 July 2025 at 03:13

In yesterday's entry, I talked about how Unix's errno is so limited partly because of how the early Unix kernels didn't return multiple values from system calls. It's worth noting that this isn't a limitation in operating system kernels and typical system call interfaces; instead, it's a limitation imposed by C. If anything, it's natural to return multiple values from system calls.

Typically, system call interfaces use CPU registers because it's much easier (and usually faster) for the kernel to access (user) CPU register values than it is to read or write things from and to user process memory. If you can pass system call arguments in registers, you do so, and similarly for returning results. Most CPU architectures have more than one register that you could put system call results into, so it's generally not particularly hard to say that your OS returns results in the following N CPU registers (quite possibly the registers that are also used for passing arguments).

Using multiple CPU registers for system call return values was even used by Research Unix on the PDP-11, for certain system calls. This is most visible in versions that are old enough to document the PDP-11 assembly versions of system calls; see, for example, the V4 pipe(2) system call, which returns the two ends of the pipe in r0 and r1. Early Unix put errno error codes and non-error results in the same place not because it had no choice but because it was easier that way.

(Because I looked it up, V7 returned a second value in r1 in pipe(), getuid(), getgid(), getpid(), and wait(). All of the other system calls seem to have only used r0; if r1 was unused by a particular call, the generic trap handling code preserved it over the system call.)

I don't know if there's any common operating system today with a system call ABI that routinely returns multiple values, but I suspect not. I also suspect that if you were designing an OS and a system call ABI today and were targeting it for a modern language that directly supported multiple return values, you would probably put multiple return values in your system call ABI. Ideally, including one for an error code, to avoid anything like errno's limitations; in fact it would probably be the first return value, to cope with any system calls that had no ordinary return value and simply returned success or some failure.

The "personal computer" model scales better than the "terminal" model

By: cks
1 July 2025 at 02:49

In an aside in a recent entry, I said that one reason that X terminals faded away is that what I called the "personal computer" model of computing had some pragmatic advantages over the "terminal" model. One of them is that broadly, the personal computer model scales better, even though sometimes it may be more expensive or less capable at any given point in time. But first, let me define my terms. What I mean by the "personal computer" model is one where computing resources are distributed, where everyone is given a computer of some sort and is expected to do much of their work with that computer. What I mean by the "terminal" model is where most computing is done on shared machines, and the objects people have are simply used to access those shared machines.

The terminal model has the advantage that the devices you give each individual person can be cheaper, since they don't need to do as much. It has the potential disadvantage that you need some number of big shared machines for everyone to do their work on, and those machines are often expensive. However, historically, some of the time those big shared servers (plus their terminals) have been less expensive than getting everyone their own computer that was capable enough. So the "terminal" model may win at any fixed point in both time and your capacity needs.

The problem with the terminal model is those big shared resources, which become an expensive choke point. If you want to add some more terminals, you need to also budget for more server capacity. If some of your people turn out to need more power than you initially expected, you're going to need more server capacity. And so on. The problem is that your server capacity generally has to be bought in big, expensive units and increments, a problem that has come up before.

The personal computer model is potentially more expensive up front but it's much easier to scale it, because you buy computer capacity in much smaller units. If you get more people, you get each of them a personal computer. If some of your people need more power, you get them (and just them) more capable, more expensive personal computers. If you're a bit short of budget for hardware updates, you can have some people use their current personal computers for longer. In general, you're free to vary things on a very fine grained level, at the level of individual people.

(Of course you may still have some shared resources, like backups and perhaps shared disk space, but there are relatively fine grained solutions for that too.)

PS: I don't know if big compute is cheaper than a bunch of small compute today, given that we've run into various limits in scaling up things like CPU performance, power and heat limits, and so on. There are "cloud desktop" offerings from various providers, but I'm not sure these are winners based on the hardware economics alone, plus today you'd need something to be the "terminal" as well and that thing is likely to be a capable computer itself, not the modern equivalent of an X terminal..

Tape drives (and robots) versus hard disk drives, and volume

By: cks
27 June 2025 at 02:54

In a conversation on the Fediverse, I had some feelings on tapes versus disks:

I wish tape drives and tape robots were cheaper. At work economics made us switch to backups on HDDs, and apart from other issues it quietly bugs me that every one of them bundles in a complex read-write mechanism on top of the (magnetic) storage that's what we really want.

(But those complex read/write mechanisms are remarkably inexpensive due to massive volume, while the corresponding tape drive+robot read/write mechanism is ... not.)

(I've written up our backup system, also.)

As you can read in many places, hard drives are mechanical marvels, made to extremely fine tolerances, high performance targets, and startlingly long lifetimes (all things considered). And you can get them for really quite low prices.

At a conceptual level, an LTO tape system is the storage medium (the LTO tape) separated from the read/write head and the motors (the tape drive). When you compare this to hard drives, you get to build and buy the 'tape drive' portion only once, instead of including a copy in each instance of the storage medium (the tapes). In theory this should make the whole collection a lot cheaper. In practice it only does so once you have a quite large number of tapes, because the cost of tape drives (and tape robots to move tapes in and out of the drives) is really quite high (and has been for a relatively long time).

There are probably technology challenges and complexities that come with the tape drive operating in an unsealed and less well controlled environment than hard disk mechanisms. But it's hard to avoid the assumption that a lot of the price difference has to do with the vast difference in volume. We make hard drives and thus all of their components in high volume, and have for decades, so there's been a lot of effort spent on making them inexpensively and in bulk. Tape drives are a specialty item with far lower production volumes and are sold to much less price sensitive buyers (as compared to consumer level hard drives, which have a lot of parts in common with 'enterprise' HDDs).

I understand all of this but it still bugs me a bit. It's perfectly understandable but inelegant.

I feel open source has turned into two worlds

By: cks
19 June 2025 at 02:51

One piece of open source news of the time interval is that the sole maintainer of libxml2 will no longer be treating security issues any differently than bugs (also, via Fediverse discussions). In my circles, the reaction to this has generally been positive, and it's seen as an early sign of more of this to come, as more open source maintainers revolt. I have various thoughts on this, but in light of what I wrote about open source moral obligation and popularity, one thing this incident has crystallized for me is that I draw an increasingly sharp distinction between corporate use of open source software and people's cooperative use of it.

Obvious general examples of the latter are the Debian Linux distribution and BSD distributions like OpenBSD and FreeBSD. These are independent open source projects that are maintained by volunteers (although some of them are paid to work on the project). Everyone is working together in cooperation and the result is no one's product or owned object. And at the small scale, everyone who incorporates libxml2, some Python module, or whatever into a personal open source thing is part of this cooperative sphere.

(Ubuntu is not, because Ubuntu is Canonical's. Fedora is probably not really, for all that it has volunteers working on it; it lives and dies at Red Hat's whim, and Red Hat has already demonstrated with CentOS that that whim can change drastically.)

Corporate use of open source software is things like corporations deciding to make libxml2 a security sensitive, load bearing part of their products. Yes, the license allows them to do that and allows them to not support libxml2, but I feel that it's qualitatively different that the personal cooperative sphere of open source, and as a result the social rules are different. You might not want to leave Debian (which is fundamentally people) in the lurch over a security issue, but if a corporation shows up with a security issue, well, you tap the sign. They're not in open source as a cooperative venture, they are using it to make money. Corporations are not like people, even if they employ people who make 'people open source' noises.

Existing open source licenses, practices, and culture don't draw this distinction (and it would be hard to for licenses), but I think we're going to see an increasing amount of it in the future. Corporate use of open source under the current regime is an increasingly bad deal for the open source people involved, so I don't think the current situation is sustainable. Even if licenses don't change, everything else can.

(See also 'software supply chain security', especially "I am not a supplier".)

Will (more) powerful discrete GPUs become required in practice in PCs?

By: cks
14 June 2025 at 03:21

One of the slow discussions I'm involved in over on the Fediverse started with someone wondering what modern GPU to get to run Linux on Wayland (the current answer is said to be an Intel Arc B580, if you have a modern distribution version). I'm a bit interested in this question but not very much, because I've traditionally considered big discrete GPU cards to be vast overkill for my needs. I use an old, text-focused type of X environment and I don't play games, so apart from needing to drive big displays at 60Hz (or maybe someday better than that), it's been a long time since I needed to care about how powerful my graphics was. These days I use 'onboard' graphics whenever possible, which is to say the modest GPU that Intel and AMD now integrate on many CPU models.

(My office desktop has more or less the lowest end discrete AMD GPU with suitable dual outputs that we could find at the time because my CPU didn't have onboard graphics. My current home desktop uses what is now rather old onboard Intel graphics.)

However, graphics aren't the only thing you can do with GPUs these days (and they haven't been for some time). Increasingly, people do a lot of GPU computing (and not just for LLMs; darktable can use your GPU for image processing on digital photographs). In the old days, this GPU processing was basically not worth even trying on your typical onboard GPU (darktable basically laughed at my onboard Intel graphics), and my impression is that's still mostly the case if you want to do serious work. If you're serious, you want a lot of GPU memory, a lot of GPU processing units, and so on, and you only really get that on dedicated discrete GPUs.

You'll probably always be able to use a desktop for straightforward basic things with only onboard graphics (if only because of laptop systems that have price, power, and thermal limits that don't allow for powerful, power-hungry, and hot GPUs). But that doesn't necessarily mean that it will be practical to be a programmer or system administrator without a discrete GPU that can do serious computing, or at least that you'll enjoy it very much. I can imagine a future where your choices are to have a desktop with a good discrete GPU so that you can do necessary (GPU) computation, bulk analysis, and so on locally, or to remote off to some GPU-equipped machine to do the compute-intensive side of your work.

(An alternate version of this future is that CPU vendors stuff more and more GPU compute capacity into CPUs and the routine GPU computation keeps itself to within what the onboard GPU compute units can deliver. After all, we're already seeing CPU vendors include dedicated GPU computation capacity that's not intended for graphics.)

Even if discrete GPUs don't become outright required, it's possible that they'll become so useful and beneficial that I'll feel the need to get one; not having one would be workable but clearly limiting. I might feel that about a discrete GPU today if I did certain sorts of things, such as large scale photo or video processing.

I don't know if I believe in this general future, where a lot of important things require (good) GPU computation in order to work decently well. It seems a bit extreme. But I've been quite wrong about technology trends in the past that similarly felt extreme, so nowadays I'm not so sure of my judgment.

Intel versus AMD is currently an emotional decision for me

By: cks
28 May 2025 at 02:40

I recently read Michael Stapelberg's My 2025 high-end Linux PC. One of the decisions Stapelberg made was choosing an Intel (desktop) CPU because of better (ie lower) idle power draw. This is a perfectly rational decision to make, one with good reasoning behind it, and also as I read the article I realized that it was one I wouldn't have made. Not because I don't value idle power draw; like Stapelberg's machine but more so, my desktops spend most of their time essentially idle. Instead, it was because I realized (or confirmed my opinion) that right now, I can't stand to buy Intel CPUs.

I am tired of all sorts of aspects of Intel. I'm tired of their relentless CPU product micro-segmentation across desktops and servers, with things like ECC allowed in some but not all models. I'm tired of their whole dance of P-cores and E-cores, and also of having to carefully read spec sheets to understand the P-core and E-core tradeoffs for a particular model. I'm tired of Intel just generally being behind AMD and repeatedly falling on its face with desperate warmed over CPU refreshes that try to make up for its process node failings. I'm tired of Intel's hardware design failure with their 13th and 14th generation CPUs (see eg here). I'm sure AMD Ryzens have CPU errata too that would horrify me if I knew, but they're not getting rubbed in my face the way the Intel issue is.

At this point Intel has very little going for its desktop CPUs as compared to the current generation AMD Ryzens. Intel CPUs have better idle power levels, and may have better single-core burst performance. In absolute performance I probably won't notice much difference, and unlike Stapelberg I don't do the kind of work where I really care about build speed (and if I do, I have access to much more powerful machines). As far as the idle power goes, I likely will notice the better idle power level (some of the time), but my system is likely to idle at lower power in general than Stapelberg's will, especially at home where I'll try to use the onboard graphics if at all possible (so I won't have the (idle) power price of a GPU card).

(At work I need to drive two 4K displays at 60Hz and I don't think there are many motherboards that will do that with onboard graphics, even if the CPU's built in graphics system is up to it in general.)

But I don't care about the idle power issue. If or when I build a new home desktop, I'll eat the extra 20 watts or so of idle power usage for an AMD CPU (although this may vary in practice, especially with screens blanked). And I'll do it because right now I simply don't want to give Intel my money.

It's not obvious how to verify TLS client certificates issued for domains

By: cks
18 May 2025 at 02:24

TLS server certificate verification has two parts; you first verify that the TLS certificate is valid, CA-signed certificate, and then you verify that the TLS certificate is for the host you're connecting to. One of the practical issues with TLS 'Client Authentication' certificates for host and domain names (which are on the way out) is that there's no standard meaning for how you do the second part of this verification, and if you even should. In particular, what host name are you validating the TLS client certificate against?

Some existing protocols provide the 'client host name' to the server; for example, SMTP has the EHLO command. However, existing protocols tend not to have explicitly standardized using this name (or any specific approach) for verifying a TLS client certificate if one is presented to the server, and large mail providers vary in what they send as a TLS client certificate in SMTP conversations. For example, Google's use of 'smtp.gmail.com' doesn't match any of the other names available, so its only meaning is 'this connection comes from a machine that has access to private keys for a TLS certificate for smtp.gmail.com', which hopefully means that it belongs to GMail and is supposed to be used for this purpose.

If there is no validation of the TLS client certificate host name, that is all that a validly signed TLS client certificate means; the connecting host has access to the private keys and so can be presumed to be 'part of' that domain or host. This isn't nothing, but it doesn't authenticate what exactly the client host is. If you want to validate the host name, you have to decide what to validate against and there are multiple answers. If you design the protocol you can have the protocol send a client host name and then validate the TLS certificate against the hostname; this is slightly better than using the TLS certificate's hostname as is in the rest of your processing, since the TLS certificate might have a wildcard host name. Otherwise, you might validate the TLS certificate host name against its reverse DNS, which is more complicated than you might expect and which will fail if DNS isn't working. If the TLS client certificate doesn't have a wildcard, you could also try to look up the IP addresses associated with the host names in the TLS certificate and see if any of the IP addresses match, but again you're depending on DNS.

(You can require non-wildcard TLS certificate names in your protocol, but people may not like it for various reasons.)

This dependency on DNS for TLS client certificates is different from the DNS dependency for TLS server certificates. If DNS doesn't work for the server case, you're not connecting at all since you have no target IPs; if you can connect, you have a target hostname to validate against (in the straightforward case of using a hostname instead of an IP address). In the TLS client certificate case, the client can connect but then the TLS server may deny it access for apparently arbitrary reasons.

That your protocol has to specifically decide what verifying TLS client certificates means (and there are multiple possible answers) is, I suspect, one reason that TLS client certificates aren't used more in general Internet protocols. In turn this is a disincentive for servers implementing TLS-based protocols (including SMTP) from telling TLS clients that they can send a TLS client certificate, since it's not clear what you should do with it if one is sent.

Let's Encrypt drops "Client Authentication" from its TLS certificates

By: cks
17 May 2025 at 02:53

The TLS news of the time interval is that Let's Encrypt certificates will no longer be usable to authenticate your client to a TLS server (via a number of people on the Fediverse). This is driven by a change in Chrome's "Root Program", covered in section 3.2, with a further discussion of this in Chrome's charmingly named Moving Forward, Together in the "Understanding dedicated hierarchies" section; apparently only half of the current root Certificate Authorities actually issue TLS server certificates. As far as I know this is not yet a CA/Browser Forum requirement, so this is all driven by Chrome.

In TLS client authentication, a TLS client (the thing connecting to a TLS server) can present its own TLS certificate to the TLS server, just as the TLS server presents its certificate to the client. The server can then authenticate the client certificate however it wants to, although how to do this is not as clear as when you're authenticating a TLS server's certificate. To enable this usage, a TLS certificate and the entire certificate chain must be marked as 'you can use these TLS certificates for client authentication' (and similarly, a TLS certificate that will be used to authenticate a server to clients must be marked as such). That marking is what Let's Encrypt is removing.

This doesn't affect public web PKI, which basically never used conventional CA-issued host and domain TLS certificates as TLS client certificates (websites that used TLS client certificates used other sorts of TLS certificates). It does potentially affect some non-web public TLS, where domain TLS certificates have seen small usage in adding more authentication to SMTP connections between mail systems. I run some spam trap SMTP servers that advertise that sending mail systems can include a TLS client certificate if the sender wants to, and some senders (including GMail and Outlook) do send proper public TLS certificates (and somewhat more SMTP senders include bad TLS certificates). Most mail servers don't, though, and given that one of the best sources of free TLS certificates has just dropped support for this usage, that's unlikely to change. Let's Encrypt's TLS certificates can still be used by your SMTP server for receiving email, but you'll no longer be able to use them for sending it.

On the one hand, I don't think this is going to have material effects on much public Internet traffic and TLS usage. On the other hand, it does cut off some possibilities in non-web public TLS, at least until someone starts up a free, ACME-enabled Certificate Authority that will issue TLS client certificates. And probably some number of mail servers will keep sending their TLS certificates to people as client certificates even though they're no longer valid for that purpose.

PS: If you're building your own system and you want to, there's nothing stopping you from accepting public TLS server certificates from TLS clients (although you'll have to tell your TLS library to validate them as TLS server certificates, not client certificates, since they won't be marked as valid for TLS client usage). Doing the security analysis is up to you but I don't think it's a fatally flawed idea.

Classical "Single user computers" were a flawed or at least limited idea

By: cks
16 May 2025 at 02:33

Every so often people yearn for a lost (1980s or so) era of 'single user computers', whether these are simple personal computers or high end things like Lisp machines and Smalltalk workstations. It's my view that the whole idea of a 1980s style "single user computer" is not what we actually want and has some significant flaws in practice.

The platonic image of a single user computer in this style was one where everything about the computer (or at least its software) was open to your inspection and modification, from the very lowest level of the 'operating system' (which was more of a runtime environment than an OS as such) to the highest things you interacted with (both Lisp machines and Smalltalk environments often touted this as a significant attraction, and it's often repeated in stories about them). In personal computers this was a simple machine that you had full control over from system boot onward.

The problem is that this unitary, open environment is (or was) complex and often lacked resilience. Famously, in the case of early personal computers, you could crash the entire system with programming mistakes, and if there's one thing people do all the time, it's make mistakes. Most personal computers mitigated this by only doing one thing at once, but even then it was unpleasant, and the Amiga would let you blow multiple processes up at once if you could fit them all into RAM. Even on better protected systems, like Lisp and Smalltalk, you still had the complexity and connectedness of a unitary environment.

One of the things that we've learned from computing over the past N decades is that separation, isolation, and abstraction are good ideas. People can only keep track of so many things in their heads at once, and modularity (in the broad sense) is one large way we keep things within that limit (or at least closer to it). Single user computers were quite personal but usually not very modular. There are reasons that people moved to computers with things like memory protection, multiple processes, and various sorts of privilege separation.

(Let us not forget the great power of just having things in separate objects, where you can move around or manipulate or revert just one object instead of 'your entire world'.)

I think that there is a role for computers that are unapologetically designed to be used by only a single person who is in full control of everything and able to change it if they want to. But I don't think any of the classical "single user computer" designs are how we want to realize a modern version of the idea.

(As a practical matter I think that a usable modern computer system has to be beyond the understanding of any single person. There is just too much complexity involved in anything except very restricted computing, even if you start from complete scratch. This implies that an 'understandable' system really needs strong boundaries between its modules so that you can focus on the bits that are of interest to you without having to learn lots of things about the rest of the system or risk changing things you don't intend to.)

How and why typical (SaaS) pricing is too high for university departments

By: cks
12 May 2025 at 02:48

One thing I've seen repeatedly is that companies that sell SaaS or SaaS like things and offer educational pricing (because they want to sell to universities too) are setting (initial) educational pricing that is in practice much too high. Today I'm going to work through a schematic example to explain what I mean. All of this is based on how it works in Canadian and I believe US universities; other university systems may be somewhat different.

Let's suppose that you're a SaaS vendor and like many vendors, you price your product at $X per person per month; I'll pick $5 (US, because most of the time the prices are in USD). Since you want to sell to universities and other educational institutions and you understand they don't have as much money to spend as regular companies, you offer a generous academic discount; they pay only $3 USD per person per month.

(If these numbers seem low, I'm deliberately stacking the deck in the favour of the SaaS company. Things get worse for your pricing as the numbers go up.)

The research and graduate student side of a large but not enormous university department is considering your software. They have 100 professors 'in' the department, 50 technical and administrative support staff (this is a low ratio), and professors have an average of 10 graduate students, research assistants, postdocs, outside collaborators, undergraduate students helping out with research projects, and so on around them, for a total of 1,000 additional people 'in' the department who will also have to be covered. These 1,150 people will cost the department $3,450 USD a month for your software, a total of $41,400 USD a year, which is a significant saving over what a commercial company would pay for the same number of people.

Unfortunately, unless your software is extremely compelling or absolutely necessary, this cost is likely to be a very tough sell. In many departments, that's enough money to fund (or mostly fund) an additional low-level staff position, and it's certainly enough money to hire more TAs, supplement more graduate student stipends (these are often the same thing, since hiring graduate students as TAs is one of the ways that you support them), or pay for summer projects, all of which are likely to be more useful and meaningful to the department than a year of your service. It's also more than enough money to cause people in the department to ask awkward questions like 'how much technical staff time will it take to put together an inferior but functional enough alternate to this', which may well not be $41,000 worth of time (especially not every year).

(Of course putting together a complete equivalent of your SaaS will cost much more than that, since you have multiple full time programmers working on it and you've invested years in your software at this point. But university departments are already used to not having nice things, and staff time is often considered almost free.)

If you decide to make your pricing nicer by only charging based on the actual number of people who wind up using your stuff, unfortunately you've probably made the situation worse for the university department. One thing that's worse than a large predictable bill is an uncertain but possibly large bill; the department will have to reserve and allocate the money in its budget to cover the full cost, and then figure out what to do with the unused budget at the end of the year (or the end of every month, or whatever). Among other things, this may lead to awkward conversations with higher powers about how the department's initial budget and actual spending don't necessarily match up.

As we can see from the numbers, one big part of the issue is those 1,000 non-professor, non-staff people. These people aren't really "employees" the way they would be in a conventional organization (and mostly don't think of themselves as employees), and the university isn't set up to support their work and spend money on them in the way it is for the people it considers actual employees. The university cares if a staff member or a professor can't get their work done, and having them work faster or better is potentially valuable to the university. This is mostly not true for graduate students and many other additional people around a department (and almost entirely not true if the person is an outside collaborator, an undergraduate doing extra work to prepare for graduate studies elsewhere, and so on).

In practice, most of those 1,000 extra people will and must be supported on a shoestring basis (for everything, not just for your SaaS). The university as a whole and their department in particular will probably only pay a meaningful per-person price for them for things that are either absolutely necessary or extremely compelling. At the same time, often the software that the department is considering is something that those people should be using too, and they may need a substitute if the department can't afford the software for them. And once the department has the substitute, it becomes budgetarily tempting and perhaps politically better if everyone uses the substitute and the department doesn't get your software at all.

(It's probably okay to charge a very low price for such people, as opposed to just throwing them in for free, but it has to be low enough that the department or the university doesn't have to think hard about it. One way to look at it is that regardless of the numbers, the collective group of those extra people is 'less important' to provide services to than the technical staff, the administrative staff, and the professors, and the costs probably should work out accordingly. Certainly the collective group of extra people isn't more important than the other groups, despite having a lot more people in it.)

Incidentally, all of this applies just as much (if not more so) when the 'vendor' is the university's central organizations and they decide to charge (back) people within the university for something on a per-person basis. If this is truly cost recovery and accurately represents the actual costs to provide the service, then it's not going to be something that most graduate students get (unless the university opts to explicitly subsidize it for them).

PS: All of this is much worse if undergraduate students need to be covered too, because there are even more of them. But often the department or the university can get away with not covering them, partly because their interactions with the university are often much narrower than those of graduate students.

LLMs ('AI') are coming for our jobs whether or not they work

By: cks
5 May 2025 at 02:58

Over on the Fediverse, I said something about this:

Hot take: I don't really know what vibe coding is but I can confidently predict that it's 'coming for', if not your job, then definitely the jobs of the people who work in internal development at medium to large non-tech companies. I can predict this because management at such companies has *always* wanted to get rid of programmers, and has consistently seized on every excuse presented by the industry to do so. COBOL, report generators, rule based systems, etc etc etc at length.

(The story I heard is that at one point COBOL's English language basis was at least said to enable non-programmers to understand COBOL programs and maybe even write them, and this was seen as a feature by organizations adopting it.)

The current LLM craze is also coming for the jobs of system administrators for the same reason; we're overhead, just like internal development at (most) non-tech companies. In most non-tech organizations, both internal development and system administration is something similar to janitorial services; you have to have it because otherwise your organization falls over, but you don't like it and you're happy to spend as little on it as possible. And, unfortunately, we have a long history in technology that shows the long term results don't matter for the people making short term decisions about how many people to hire and who.

(Are they eating their seed corn? Well, they probably don't think it matters to them, and anyway that's a collective problem, which 'the market' is generally bad at solving.)

As I sort of suggested by using 'excuse' in my Fediverse post, it doesn't really matter if LLMs truly work, especially if they work over the long run. All they need to do in order to get senior management enthused about 'cutting costs' is appear to work well enough over the short term, and appearing to work is not necessarily a matter of substance. In sort of a flipside of how part of computer security is convincing people, sometimes it's enough to simply convince (senior) people and not have obvious failures.

(I have other thoughts about the LLM craze and 'vibe coding', as I understand it, but they don't fit within the margins of this entry.)

PS: I know it's picky of me to call this an 'LLM craze' instead of an 'AI craze', but I feel I have to both as someone who works in a computer science department that does all sorts of AI research beyond LLMs and as someone who was around for a much, much earlier 'AI' craze (that wasn't all of AI either, cf).

Trying to understand OpenID Connect (OIDC) and its relation to OAuth2

By: cks
27 April 2025 at 02:34

The OIDC specification describes it as "a simple identity layer" on top of OAuth2. As I've been discovering, this is sort of technically true but also misleading. Since I think I've finally sorted this out, here's what I've come to understand about the relationship.

OAuth2 describes a HTTP-based protocol where a client (typically using a web browser) can obtain an access token from an authorization server and then present this token to a resource server to gain access to something. For example, your mail client works with a browser to obtain an access token from an OAuth2 identity provider, which it then presents to your IMAP server. However, the base OAuth2 specification is only concerned with the interaction between clients and the authorization server; it explicitly has nothing to say about issues like how a resource server validations and uses the access tokens. This is right at the start of RFC 6749:

The interaction between the authorization server and resource server is beyond the scope of this specification. [...]

Because it's purely about the client to authorization server flows, the base OAuth2 RFC provides nothing that will let your IMAP server validate the alleged 'OAuth2 access token' your mail client has handed it (or find out from the token who you are). There were customary ways to do this, and then later you had RFC 7662 Token Introspection or perhaps RFC 9068 JWT access tokens, but those are all outside basic OAuth2.

(This has obvious effects on interoperability. You can't write a resource server that works with arbitrary OAuth2 identity providers, or an OAuth2 identity provider of your own that everyone will be able to work with. I suspect that this is one reason why, for example, IMAP mail clients often only support a few big OAuth2 identity providers.)

OIDC takes the OAuth2 specification and augments it in a number of ways. In addition to an OAuth2 access token, an OIDC identity provider can also give clients (you) an ID Token that's a (signed) JSON Web Token (JWT) that has a specific structure and contains at least a minimal set of information about who authenticated. An OIDC IdP also provides an official Userinfo endpoint that will provide information about an access token, although this is different information than the RFC 7662 Token Introspection endpoint.

Both of these changes make resource servers and by extension OIDC identity providers much more generic. If a client hands a resource server either an OIDC ID Token or an OIDC Access Token, the resource server ('consumer') has standard ways to inspect and verify them. If your resource server isn't too picky (or is sufficiently configurable), I think it can work with either an OIDC Userinfo endpoint or an OAuth2 RFC 7662 Token Introspection endpoint (I believe this is true of Dovecot, cf).

(OIDC is especially convenient in cases like websites, where the client that gets the OIDC ID Token and Access Token is the same thing that uses them.)

An OAuth2 client can talk to an OIDC IdP as if it was an OAuth2 IdP and get back an access token, because the OIDC IdP protocol flow is compatible with the OAuth2 protocol flow. This access token could be described as an 'OAuth2' access token, but this is sort of meaningless to say since OAuth2 gives you nothing you can do with an access token. An OAuth2 resource server (such as an IMAP server) that expects to get 'OAuth2 access tokens' may or may not be able to interact with any particular OIDC IdP to verify those OIDC IdP provided tokens to its satisfaction; it depends on what the resource server supports and requires. For example, if the resource server specifically requires RFC 7662 Token Introspection you may be out of luck, because OIDC IdPs aren't required to support that and not all do.

In practice, I believe that OIDC has been around for long enough and has been popular enough that consumers of 'OAuth2 access tokens', like your IMAP server, will likely have been updated so that they can work with OIDC Access Tokens. Servers can do this either by verifying the access tokens through an OIDC Userinfo endpoint (with suitable server configuration to tell them what to look for) or by letting you tell them that the access token is a JWT and verifying the JWT. OIDC doesn't require the access token to be a JWT but OIDC IdPs can (and do) use JWTs for this, and perhaps you can actually have your client software send the ID Token (which is guaranteed to be a JWT) instead of the OIDC Access Token.

(It helps that OIDC is obviously better if you want to write 'resource server' side software that works with any IdP without elaborate and perhaps custom configuration or even programming for each separate IdP.)

(I have to thank Henryk PlΓΆtz for helping me understand OAuth2's limited scope.)

(The basic OAuth2 has been extended with multiple additional standards, see eg RFC 8414, and if enough of them are implemented in both your IdP and your resource servers, some of this is fixed. OIDC has also been extended somewhat, see eg OpenID Provider Metadata discovery.)

Looking at OIDC tokens and getting information on them as a 'consumer'

By: cks
26 April 2025 at 01:56

In OIDC, roughly speaking and as I understand it, there are three possible roles: the identity provider ('OP'), a Client or 'Relying Party' (the program, website, or whatever that has you authenticate with the IdP and that may then use the resulting authentication information), and what is sometimes called a 'resource server', which uses the IdP's authentication information that it gets from you (your client, acting as a RP). 'Resource Server' is actually an OAuth2 term, which comes into the picture because OIDC is 'a simple identity layer' on top of OAuth2 (to quote from the core OIDC specification). A website authenticating you with OIDC can be described as acting both as a 'RP' and a 'RS', but in cases like IMAP authentication with OIDC/OAuth2, the two roles are separate; your mail client is a RP, and the IMAP server is a RS. I will broadly call both RPs and RSs 'consumers' of OIDC tokens.

When you talk to an OIDC IdP to authenticate, you can get back either or both of an ID Token and an Access Token. The ID Token is always a JWT with some claims in it, including the 'sub(ject)', the 'issuer', and the 'aud(ience)' (which is what client the token was requested by), although this may not be all of the claims you asked for and are entitled to. In general, to verify an ID Token (as a consumer), you need to extract the issuer, consult the issuer's provider metadata to find how to get their keys, and then fetch the keys so you can check the signature on the ID Token (and then proceed to do a number of additional verifications on the information in the token, as covered in the specification). You may cache the keys to save yourself the network traffic, which allows you to do offline verification of ID Tokens. Quite commonly, you'll only accept ID Tokens from pre-configured issuers, not any random IdP on the Internet (ie, you will verify that the 'iss' claim is what you expect). As far as I know, there's no particular way in OIDC to tell if the IdP still considers the ID Token valid or to go from an ID Token alone to all of the claims you're entitled to.

The Access Token officially doesn't have to be anything more than an opaque string. To validate it and get the full set of OIDC claim information, including the token's subject (ie, who it's for), you can use the provider's Userinfo endpoint. However, this doesn't necessarily give you the 'aud' information that will let you verify that this Access Token was created for use with you and not someone else. If you have to know this information, there are two approaches, although an OIDC identity provider doesn't have to support either.

The first is that the Access Token may actually be a RFC 9068 JWT. If it is, you can validate it in the usual OIDC JWT way (as for an ID Token) and then use the information inside, possibly in combination with what you get from the Userinfo endpoint. The second is that your OAuth2 provider may support an RFC 7662 Token Introspection endpoint. This endpoint is not exposed in the issuer's provider metadata and isn't mandatory in OIDC, so your IdP may or may not support it (ours doesn't, although that may change someday).

(There's also an informal 'standard' way of obtaining information about Access Tokens that predates RFC 7662. For all of the usual reasons, this may still be supported by some large, well-established OIDC/OAuth2 identity providers.)

Under some circumstances, the ID Token and the Access Token may be tied together in that the ID Token contains a claim field that you can use to validate that you have the matching Access Token. Otherwise, if you're purely a Resource Server and someone hands you a theoretically matching ID Token and Access Token, all that you can definitely do is use the Access Token with the Userinfo endpoint and verify that the 'sub' matches. If you have a JWT Access Token or a Token Introspection endpoint, you can get more information and do more checks (and maybe the Userinfo endpoint also gives you an 'aud' claim).

If you're a straightforward Relying Party client, you get both the ID Token and the Access Token at the same time and you're supposed to keep track of them together yourself. If you're acting as a 'resource server' as well and need the additional claims that may not be in the ID Token, you're probably going to use the Access Token to talk to the Userinfo endpoint to get them; this is typically how websites acting as OIDC clients behave.

Because the only OIDC standard way to get additional claims is to obtain an Access Token and use it to access the Userinfo endpoint, I think that many OIDC clients that are acting as both a RP and a RS will always request both an ID Token and an Access Token. Unless you know the Access Token is a JWT, you want both; you'll verify the audience in the ID Token, and then use the Access Token to obtain the additional claims. Programs that are only getting things to pass to another server (for example, a mail client that will send OIDC/OAuth2 authentication to the server) may only get an Access Token, or in some protocols only obtain an ID Token.

(If you don't know all of this and you get a mail client testing program to dump the 'token' it obtains from the OIDC IdP, you can get confused because a JWT format Access Token can look just like an ID Token.)

This means that OIDC doesn't necessarily provide a consumer with a completely self-contained single object that both has all of the information about the person who authenticated and that lets you be sure that this object is intended for you. An ID Token by itself doesn't necessarily contain all of the claims, and while you can use any (opaque) Access Token to obtain a full set of claims, I believe that these claims don't have to include the 'aud' claim (although your OIDC IdP may chose to include it).

This is in a sense okay for OIDC. My understanding is that OIDC is not particularly aimed at the 'bearer token' usage case where the RP and the Resource Server are separate systems; instead, it's aimed at the 'website authenticating you' case where the RP is the party that will directly rely on the OIDC information. In this case the RP has (or can have) both the ID Token and the Access Token and all is fine.

(A lot of my understanding on this is due to the generosity of @Denvercoder9 and others after I was confused about this.)

Sidebar: Authorization flow versus implicit flow in OIDC authentication

In the implicit flow, you send people to the OIDC IdP and the OIDC IdP directly returns the ID Token and Access Token you asked for to your redirect URI, or rather has the person's browser do it. In this flow, the ID Token includes a partial hash of the Access Token and you use this to verify that the two are tied together. You need to do this because you don't actually know what happened in the person's browser to send them to your redirect URI, and it's possible things were corrupted by an attacker.

In the authorization flow, you send people to the OIDC IdP and it redirects them back to you with an 'authorization code'. You then use this code to call the OIDC IdP again at another endpoint, which replies with both the ID Token and the Access Token. Because you got both of these at once during the same HTTP conversation directly with the IdP, you automatically know that they go together. As a result, the ID Token doesn't have to contain any partial hash of the Access Token, although it can.

I think the corollary of this is that if you want to be able to hand the ID Token and the Access Token to a Resource Server and allow it to verify that the two are connected, you want to use the implicit flow, because that definitely means that the ID Token has the partial hash the Resource Server will need.

(There's also a hybrid flow which I'll let people read about in the standard.)

The many ways of getting access to information ('claims') in OIDC

By: cks
24 April 2025 at 03:41

Any authentication and authorization framework, such as OIDC, needs a way for the identity provider (an 'OIDC OP') to provide information about the person or thing that was just authenticated. In OIDC specifically, what you get are claims that are grouped into scopes. You have to ask for specific scopes, and the IdP may restrict what scopes a particular client has access to. Well, that is not quite the full story, and the full story is complicated (more so than I expected when I started writing this entry).

When you talk to the OIDC identity server (OP) to authenticate, you (the program or website or whatever acting as the client) can get back either or both of an ID Token and an Access Token. I believe that in general your Access Token is an opaque string, although there's a standard for making it a JWT. Your ID Token is ultimately some JSON (okay, it's a JWT) and has certain mandatory claims like 'sub' (the subject) that you don't have to ask for with a scope. It would be nice if all of the claims from all of the scopes that you asked for were automatically included in the ID Token, but the OIDC standard doesn't require this. Apparently many but not all OIDC OPs include all the claims (at least by default); however, our OIDC OP doesn't currently do so, and I believe that Google's OIDC OP also doesn't include some claims.

(Unsurprisingly, I believe that there is a certain amount of OIDC-using software out there that assumes that all OIDC OPs return all claims in the ID Token.)

The standard approved and always available way to obtain the additional claims (which in some cases will be basically all claims) is to present your Access Token (not your ID Token) to the OIDC Userinfo endpoint at your OIDC OP. If your Access Token is (still) valid, what you will get back is either a plain, unsigned JSON listing of those claims (and their values) or perhaps a signed JWT of the same thing (which you can find out from the provider metadata). As far as I can see, you don't necessarily use the ID Token in this additional information flow, although you may want to be cautious and verify that the 'sub' claim is the same in the Userinfo response and the ID Token that is theoretically paired with your Access Token.

(As far as I can tell, the ID Token doesn't include a copy of the Access Token as another pseudo-claim. The two are provided to you at the same time (if you asked the OIDC OP for both), but are independent. The ID Token can't quite be verified offline because you need to get the necessary public key from the OIDC OP to check the signature.)

If I'm understanding things correctly (which may not be the case), in an OAuth2 authentication context, such as using OAUTHBEARER with the Dovecot IMAP server, I believe your local program will send the Access Token to the remote end and not do much with the ID Token, if it even requested one. The remote end then uses the Access Token with a pre-configured Userinfo endpoint to get a bunch of claims, and incidentally to validate that the Access Token is still good. In other protocols, such as the current version of OpenPubkey, your local program sends the ID Token (perhaps wrapped up) and so needs it to already include the full claims, and can't use the Userinfo approach. If what you have is a website that is both receiving the OIDC stuff and processing it, I believe that the website will normally ask for both the ID Token and the Access Token and then augment the ID Token information with additional claims from the Userinfo response (this is what the Apache OIDC module does, as far as I can see).

An OIDC OP may optionally allow clients to specifically request that certain claims be included in the ID Token that they get, through the "claims" request parameter on the initial request. One potential complication here is that you have to ask for specific claims, not simple 'all claims in this scope'; it's up to you to know what potentially non-standard claims you should ask for (and I believe that the claims you get have to be covered by the scopes you asked for and that the OIDC OP allows you to get). I don't know how widely implemented this is, but our OIDC OP supports it.

(An OIDC OP can list all of its available claims in its metadata, but doesn't have to. I believe that most OPs will list their scopes, although technically this is just 'recommended'.)

If you really want a self-contained signed object that has all of the information, I think you have to hope for an OIDC OP that either puts all claims in the ID Token by default or lets you ask for all of the claims you care about to be added for your request. Even if an OIDC OP gives you a signed userinfo response, it may not include all of the ID Token information and it might not be possible to validate various things later. You can always validate an Access Token by making a Userinfo request with it, but I don't know if there's any way to validate an ID Token.

I feel that DANE is not a good use of DNS

By: cks
21 April 2025 at 03:10

DANE is commonly cited as as "wouldn't it be nice" alternative to the current web TLS ('PKI') system. It's my view that DANE is an example of why global DNS isn't a database and shouldn't be used as one. The usual way to describe DANE is that 'it lets you publish your TLS certificates in DNS'. This is not actually what it does, because DNS does not 'publish' anything in the sense of a database or a global directory. DANE lets some unknown set of entities advertise some unknown set of TLS certificates for your site to an unknown set of people. Or at least you don't know the scope of the entities, the TLS certificates, and the people, apart from you, your TLS certificate, and the people who (maybe) come directly to you without being intercepted.

(This is in a theoretical world where DNSSEC is widely deployed and reaches all the way to programs that are doing DNS resolution. That is not this world, where DNSSEC has failed.)

DNS specifically allows servers (run by people) to make up answers to things they get asked. Obviously this would be bad when the answers are about your TLS certificates, so DANE and other things like it try to paper over the problem by adding a cascading hierarchy of signing. The problem is that this doesn't eliminate the issue, it merely narrows who can insert themselves into the chain of trust, from 'the entire world' to 'anyone already in the DNSSEC path or who can break into it', including the TLD operator for your domain's TLD.

There are a relatively small number of Certificate Authorities in the world and even large ones have had problems, never mind the one that got completely compromised. Our most effective tool against TLS mis-issuance is exactly a replicated, distributed global record of issued certificates. DNS and DANE bypass this, unless you require all DANE-obtained TLS certificates to be in Certificate Transparency logs just like CA-issued TLS certificates (and even then, Certificate Transparency is an after the fact thing; the damage has probably been done once you detect it).

In addition, there's no obvious way to revoke or limit DNSSEC the way there is for a mis-behaving Certificate Authority. If a TLD had its DNSSEC environment completely compromised, does anyone think it would be removed from global DNS, the way DigiNotar was removed from global trust? That's not very likely; the damage would be too severe for most TLDs. One of the reasons that Certificate Authorities can be distrusted is that what they do is limited and you can replace one with another. This isn't true for DNS and TLDs.

DNS is extremely bad fit for a system where you absolutely want everyone to see the same 'answer' and to have high assurance that you know what that answer is (and that you are the only person who can put it there). It's especially bad if you want to globally limit who is trusted and allow that trust to be removed or limited without severe damage. In general, if security would be significantly compromised should people received a different answer than the one you set up, DNS is not what you want to use.

(I know, this is how DNS and email mostly work today, but that is historical evolution and backward compatibility. We would not design email to work like that if we were doing it from scratch today.)

(This entry was sparked by ghop's comment mentioning DANE on my first entry.)

The clever tricks of OpenPubkey and OPKSSH

By: cks
19 April 2025 at 02:24

OPKSSH (also) is a clever way of using OpenID Connect (OIDC) to authenticate your OpenSSH sessions (it's not the only way to do this). How it works is sufficiently ingenious and clever that I want to write it up, especially as one underlying part uses a general trick.

OPKSSH itself is built on top of OpenPubkey, which is a trick to associated your keypair with an OIDC token. When you perform OIDC authentication, what you get back (at an abstract level) is a signed set of 'claims' and, crucially, a nonce. The nonce is supplied by the client that initiated the OIDC authentication so that it can know that the ID token it eventually gets back actually comes from this authentication session and wasn't obtained through some other one. The client initiating OIDC authentication doesn't get to ask the OIDC identity provider (OP) to include other fields.

What OpenPubkey does is turn the nonce into a signature for a combination of your public key and a second nonce of its own, by cryptographically hashing these together through a defined process. Because the OIDC IdP is signing a set of claims that include the calculated nonce, it is effectively signing a signature of your public key. If you give people the signed OIDC ID token, your public key, and your second nonce, they can verify this (and you can augment the ID Token you back to get a PK Token that embeds this additional information).

(As I understand it, calculating the OIDC ID Token nonce this way is safe because it still includes a random value (the inner nonce) and due to the cryptographic hashing, the entire calculated nonce is still effectively a non-repeating random value.)

To smuggle this PK Token to the OpenSSH server, OPKSSH embeds it as an additional option field in an OpenSSH certificate (called 'openpubkey-pkt'). The certificate itself is for your generated PK Token private key and is (self) signed with it, but this is all perfectly fine with OpenSSH; SSH clients will send the certificate off to the server as a candidate authentication key and the server will read it in. Normally the server would reject it since it's signed by an unknown SSH certificate authority, but OPKSSH uses a clever trick with OpenSSH's AuthorizedKeysCommand server option to get its hands on the full certificate, which lets it extract the PK Token, verify everything, and tell the SSH server daemon that your public key is the underlying OpenPubkey key (which you have the private key for in your SSH client).

Smuggling information through OpenSSH certificates and then processing them with AuthorizedKeysCommand is a clever trick, but it's specific to OpenSSH. Turning a nonce into a signature is a general trick that was eye-opening to me, especially because you can probably do it repeatedly.

The DNS system isn't a database and shouldn't be used as one

By: cks
16 April 2025 at 02:53

Over on the Fediverse, I said something:

Thesis: DNS is not meaningfully a database, because it's explicitly designed and used today so that it gives different answers to different people. Is it implemented with databases? Sure. But treating it as a database is a mistake. It's a query oracle, and as a query oracle it's not trustworthy in the way that you would normally trust a database to be, for example, consistent between different people querying it.

It would be nice if we had a global, distributed, relatively easily queryable, consistent database system. It would make a lot of things pretty nice, especially if we could wrap some cryptography around it to make sure we were getting honest answers. However, the general DNS system is not such a database and can't be used as one, and as a result should not be pressed into service as one in protocols.

DNS is designed from the ground up to lie to you in unpredictable ways, and parts of the DNS system lie to you every day. We call these lies things like 'outdated cached data' or 'geolocation based DNS' (or 'split horizon DNS'), but they're lies, or at least inconsistent alternate versions of some truth. The same fundamental properties that allow these inconsistent alternate versions also allow for more deliberate and specific lies, and they also mean that no one can know with assurance what version of DNS anyone else is seeing.

(People who want to reduce the chance for active lies as much as possible must do a variety of relatively extreme things, like query DNS from multiple vantage points around the Internet and perhaps through multiple third party DNS servers. No, checking DNSSEC isn't enough, even when it's present (also), because that just changes who can be lying to you.)

Anything that uses the global DNS system should be designed to expect outdated, inconsistent, and varying answers to the questions it asks (and sometimes incorrect answers, for various reasons). Sometimes those answers will be lies (including the lie of 'that name doesn't exist'). If your design can't deal with all of this, you shouldn't be using DNS.

The problem of general OIDC identity provider support in clients

By: cks
10 April 2025 at 02:34

I've written entries criticizing things that support using OIDC (OAuth2) authentication for not supporting it with general OIDC identity providers ('OPs' in OIDC jargon), only with specific (large) ones like Microsoft and Google (and often Github in tech-focused things). For example, there are almost no mail clients that support using your own IdP, and it's much easier to find web-based projects that support the usual few big OIDC providers and not your own OIDC OP. However, at the same time I want to acknowledge the practical problems with supporting arbitrary OIDC OPs in things, especially in things that ordinary people are going to be expected to set up themselves.

The core problem is that there is no way to automatically discover all of the information that you need to know in order to start OIDC authentication. If the person gives you their email address, perhaps you can use WebFinger to discover basic information through OIDC Identity Provider discovery, but that isn't sufficient by itself (and it also requires aligning a number of email addresses). In practice, the OIDC OP will require you to have an 'client identifier' and perhaps a 'client secret', both of which are essentially arbitrary strings. If you're a website, the OIDC standards require your 'redirect URI' to have been pre-registered with it. If you're a client program, hopefully you can supply some sort of 'localhost' redirect URI and have it accepted, but you may need to tell the person setting things up on the OIDC OP side that you need specific strings set.

(The client ID and especially the client secret are not normally supposed to be completely public; there are various issues if you publish them widely and then use them for a bunch of different things, cf.)

If you need specific information, even to know who the authenticated person is, this isn't necessarily straightforward. You may have to ask for exactly the right information, neither too much nor too little, and you can't necessarily assume you know where a user or login name is; you may have to ask the person setting up the custom OIDC IdP where to get this. On the good side, there is at least a specific place for where people's email addresses are (but you can't assume that this is the same as someone's login).

(In OIDC terms, you may need to ask for specific scopes and then use a specific claim to get the user or login name. You can't assume that the always-present 'sub' claim is a login name, although it often is; it can be an opaque identifier that's only meaningful to the identity provider.)

Now imagine that you're the author of a mail client that wants to provide a user friendly experience to people. Today, the best you can do is provide a wall of text fields that people have to enter the right information into, with very little validation possible. If people get things even a little bit wrong, all you and they may see is inscrutable error messages. You're probably going to have to describe what people need to do and the information they need to get in technical OIDC terms that assume people can navigate their specific OIDC IdP (or that someone can navigate this for them). You could create a configuration file format for this where the OIDC IdP operator can write down all of the information, give it to the people using your software, and they import it (much like OpenVPN can provide canned configuration files), but you'll be inventing that format (cue xkcd).

If you have limited time and resources to develop your software and help people using it, it's much simpler to support only a few large, known OIDC identity providers. If things need specific setup on the OIDC IdP side, you can feasibly provide that in your documentation (since there's only a few variations), and you can pre-set everything in your program, complete with knowledge about things like OIDC scopes and claims. It's also going to be fairly easy to test your code and procedures against these identity providers, while if you support custom OIDC IdPs you may need to figure out how to set up one (or several), how to configure it, and so on.

OIDC/OAuth2 as the current all purpose 'authentication hammer'

By: cks
4 April 2025 at 02:50

Today, for reasons, I found myself reflecting that OIDC/OAuth2 seems to have become today's all purpose authentication method, rather than just being a web authentication and Single Sign On system. Obviously you can authenticate websites with OIDC, as well as anything that you can reasonably implement using a website as part of things, but it goes beyond this. You can use OIDC/OAuth2 tokens to authenticate IMAP, POP3, and authenticated SMTP (although substantial restrictions apply), you can (probably) authenticate yourself to various VPN software through OIDC, there are several ways of doing SSH authentication with OIDC, and there's likely others. OIDC/OAuth2 is a supported SASL mechanism, so protocols with SASL support can in theory use OIDC tokens for authentication (although your backend has to support this, as I suppose do your clients). And in general you can pass OAuth2 tokens around somehow to validate yourself over some bespoke protocol.

On the one hand, this is potentially quite useful if you have an OIDC identity server (an 'OP'), perhaps one with some special custom authentication behavior. Once you have your special server, OIDC is your all purpose tool to get its special behavior supported everywhere (as opposed to having to build and hook up your special needs with bespoke behavior in everything, assuming that's even possible). It does have the little drawback that you wind up with OIDC on the brain and see OIDC as the solution to all of your problems, much like hammers.

(Another use of OIDC is to outsource all of your authentication and perhaps even identity handling to some big third party provider (such as Google, Microsoft/Office365, Github, etc). This saves you from having to run your own authentication and identity servers, manage your own Multi-Factor Authentication handling, and so on.)

On the other hand, the OIDC authentication flow is unapologetically web based, and in practice often needs a browser with JavaScript and cookies (cookies may be required in the protocol, I haven't checked). This means that any regular program that wants to use OIDC to authenticate you to something must either call up your browser somehow and then collect the result or it must embed a browser within itself in a little captive browser interface (where it's probably easier to collect the result). This has a variety of limitations and implications, especially if you want to authenticate yourself through OIDC on a server style machine where you don't even have a web browser you can readily run (or a GUI).

(There are awkward tricks around this, cf, or you can outsource part of the authentication to a trusted website that the server program checks in with.)

OIDC isn't the first or the only web authentication protocol; there's also at least SAML, which I believe predates it. But I don't think SAML caught on outside of (some) web authentication. Perhaps it's the XML, which has had what you could call 'some problems' over the years (also, which sort of discusses how SAML requires specific XML handling guarantees that general XML libraries don't necessarily provide).

Public Figures Keep Leaving Their Venmo Accounts Public

By: Nick Heer
27 March 2025 at 04:00

The high-test idiocy of a senior U.S. politician inviting a journalist to an off-the-record chat planning an attack on Yemen, killing over thirty people and continuing a decade of war, seems to have popularized a genre of journalism dedicated to the administration’s poor digital security hygiene. Some of these articles feel less substantial; others suggest greater crimes. One story feels like deja vu.

Dhruv Mehrotra and Tim Marchman, Wired:

The Venmo account under [Mike] Waltz’s name includes a 328-person friend list. Among them are accounts sharing the names of people closely associated with Waltz, such as [Walker] Barrett, formerly Waltz’s deputy chief of staff when Waltz was a member of the House of Representatives, and Micah Thomas Ketchel, former chief of staff to Waltz and currently a senior adviser to Waltz and President Donald Trump.

[…]

One of the most notable appears to belong to [Susie] Wiles, one of Trump’s most trusted political advisers. That account’s 182-person friend list includes accounts sharing the names of influential figures like Pam Bondi, the US attorney general, and Hope Hicks, Trump’s former White House communications director.

In 2021, reporters for Buzzfeed News found Joe Biden’s Venmo account and his contacts. Last summer, the same Wired reporters plus Andrew Couts found J.D. Vance’s and, in February, reporters for the American Prospect found Pete Hegseth’s. It remains a mystery to me why one of the most popular U.S. payment apps is this public.

βŒ₯ Permalink

In universities, sometimes simple questions aren't simple

By: cks
29 March 2025 at 02:13

Over on the Fediverse I shared a recent learning experience:

Me, an innocent: "So, how many professors are there in our university department?"
Admin person with a thousand yard stare: "Well, it depends on what you mean by 'professor', 'in', and 'department." <unfolds large and complicated chart>

In many companies and other organizations, the status of people is usually straightforward. In a university, things are quite often not so clear, and in my department all three words in my joke are in fact not a joke (although you could argue that two overlap).

For 'professor', there are a whole collection of potential statuses beyond 'tenured or tenure stream'. Professors may be officially retired but still dropping by to some degree ('emeritus'), appointed only for a limited period (but doing research, not just teaching), hired as sessional instructors for teaching, given a 'status-only' appointment, and other possible situations.

(In my university, there's such a thing as teaching stream faculty, who are entirely distinct from sessional instructors. In other universities, all professors are what we here would call 'research stream' professors and do research work as well as teaching.)

For 'in', even once you have a regular full time tenure stream professor, there's a wide range of possibilities for a professor to be cross appointed (also) between departments (or sometimes 'partially appointed' by two departments). These sort of multi-department appointments are done for many reasons, including to enable a professor in one department to supervise graduate students in another one. How much of the professor's salary each department pays varies, as does where the professor actually does their research and what facilities they use in each department.

(Sometime a multi-department professor will be quite active in both departments because their core research is cross-disciplinary, for example.)

For 'department', this is a local peculiarity in my university. We have three campuses, and professors are normally associated with a specific campus. Depending on how you define 'the department', you might or might not consider Computer Science professors at the satellite campuses to be part of the (main campus) department. Sometimes it depends on what the professors opt to do, for example whether or not they will use our main research computing facilities, or whether they'll be supervising graduate students located at our main campus.

Which answers you want for all of these depends on what you're going to use the resulting number (or numbers) for. There is no singular and correct answer for 'how many professors are there in the department'. The corollary to this is that any time we're asked how many professors are in our department, we have to quiz the people asking about what parts matter to them (or guess, or give complicated and conditional answers, or all of the above).

(Asking 'how many professor FTEs do we have' isn't any better.)

PS: If you think this complicates the life of any computer IAM system that's trying to be a comprehensive source of answers, you would be correct. Locally, my group doesn't even attempt to track these complexities and instead has a much simpler view of things that works well enough for our purposes (mostly managing Unix accounts).

OIDC claim scopes and their interactions with OIDC token authentication

By: cks
17 March 2025 at 02:31

When I wrote about how SAML and OIDC differed in sharing information, where SAML shares every SAML 'attribute' by default and OIDC has 'scopes' for its 'claims', I said that the SAML approach was probably easier within an organization, where you already have trust in the clients. It turns out that there's an important exception to this I didn't realize at the time, and that's when programs (like mail clients) are using tokens to authenticate to servers (like IMAP servers).

In OIDC/OAuth2 (and probably in SAML as well), programs that obtain tokens can open them up and see all of the information that they contain, either inspecting them directly or using a public OIDC endpoint that allows them to 'introspect' the token for additional information (this is the same endpoint that will be used by your IMAP server or whatever). Unless you enjoy making a bespoke collection of (for example) IMAP clients, the information that programs need to obtain tokens is going to be more or less public within your organization and will probably (or even necessarily) leak outside of it.

(For example, you can readily discover all of the OIDC client IDs used by Thunderbird for the various large providers it supports. There's nothing stopping you from using those client IDs and client secrets yourself, although large providers may require your target to have specifically approved using Thunderbird with your target's accounts.)

This means that anyone who can persuade your people to authenticate through a program's usual flow can probably extract all of the information available in the token. They can do this either on the person's computer (capturing the token locally) or by persuading people that they need to 'authenticate to this service with IMAP OAuth2' or the like and then extracting the information from the token.

In the SAML world, this will by default be all of the information contained in the token. In the OIDC world, you can restrict the information made available through tokens issued through programs by restricting the scopes that you allow programs to ask for (and possibly different scopes for different programs, although this is a bit fragile; attackers may get to choose which program's client ID and so on they use).

(Realizing this is going to change what scopes we allow in our OIDC IdP for program client registrations. So far I had reflexively been giving them access to everything, just like our internal websites; now I think I'm going to narrow it down to almost nothing.)

Sidebar: How your token-consuming server knows what created them

When your server verifies OAuth2/OIDC tokens presented to it, the minimum thing you want to know is that they come from the expected OIDC identity provider, which is normally achieved automatically because you'll ask that OIDC IdP to verify that the token is good. However, you may also want to know that the token was specifically issued for use with your server, or through a program that's expected to be used for your server. The normal way to do this is through the 'aud' OIDC claim, which has at least the client ID (and in theory your OIDC IdP could add additional entries). If your OIDC IdP can issue tokens through multiple identities (perhaps to multiple parties, such as the major IdPs of, for example, Google and Microsoft), you may also want to verify the 'iss' (issuer) field instead or in addition to 'aud'.

Some notes on the OpenID Connect (OIDC) 'redirect uri'

By: cks
16 March 2025 at 02:57

The normal authentication process for OIDC is web-based and involves a series of HTTP redirects, interspersed with web pages that you interact with. Something that wants to authenticate you will redirect you to the OIDC identity server's website, which will ask you for your login and password and maybe MFA authentication, check them, and then HTTP redirect you back to a 'callback' or 'redirect' URL that will transfer a magic code from the OIDC server to the OIDC client (generally as a URL query parameter). All of this happens in your browser, which means that the OIDC client and server don't need to be able to directly talk to each other, allowing you to use an external cloud/SaaS OIDC IdP to authenticate to a high-security internal website that isn't reachable from the outside world and maybe isn't allowed to make random outgoing HTTP connections.

(The magic code transferred in the final HTTP redirect is apparently often not the authentication token itself but instead something the client can use for a short time to obtain the real authentication token. This does require the client to be able to make an outgoing HTTP connection, which is usually okay.)

When the OIDC client initiates the HTTP redirection to the OIDC IdP server, one of the parameters it passes along is the 'redirect uri' it wants the OIDC server to use to pass the magic code back to it. A malicious client (or something that's gotten a client's ID and secret) could do some mischief by manipulating this redirect URL, so the standard specifically requires that OIDC IdP have a list of allowed redirect uris for each registered client. The standard also says that in theory, the client's provided redirect uri and the configured redirect uris are compared as literal string values. So, for example, 'https://example.org/callback' doesn't match 'https://example.org/callback/'.

This is straightforward when it comes to websites as OIDC clients, since they should have well defined callback urls that you can configure directly into your OIDC IdP when you set up each of them. It gets more hairy when what you're dealing with is programs as OIDC clients, where they are (for example) trying to get an OIDC token so they can authenticate to your IMAP server with OAuth2, since these programs don't normally have a website. Historically, there are several approaches that people have taken for programs (or seem to have, based on my reading so far).

Very early on in OAuth2's history, people apparently defined the special redirect uri value 'urn:ietf:wg:oauth:2.0:oob' (which is now hard to find or identify documentation on). An OAuth2 IdP that saw this redirect uri (and maybe had it allowed for the client) was supposed to not redirect you but instead show you a HTML page with the magic OIDC code displayed on it, so you could copy and paste the code into your local program. This value is now obsolete but it may still be accepted by some IdPs (you can find it listed for Google in mutt_oauth2.py, and I spotted an OIDC IdP server that handles it).

Another option is that the IdP can provide an actual website that does the same thing; if you get HTTP redirected to it with a valid code, it will show you the code on a HTML page and you can copy and paste it. Based on mutt_oauth2.py again, it appears that Microsoft may have at one point done this, using https://login.microsoftonline.com/common/oauth2/nativeclient as the page. You can do this too with your own IdP (or your own website in general), although it's not recommended for all sorts of reasons.

The final broad approach is to use 'localhost' as the target host for the redirect. There are several ways to make this work, and one of them runs into complications with the IdP's redirect uri handling.

The obvious general approach is for your program to run a little HTTP server that listens on some port on localhost, and capture the code when the (local) browser gets the HTTP redirect to localhost and visits the server. The problem here is that you can't necessarily listen on port 80, so your redirect uri needs to include the port you're listening (eg 'http://localhost:7000'), and if your OIDC IdP is following the standard it must be configured not just with 'http://localhost' as the allowed redirect uri but the specific port you'll use. Also, because of string matching, if the OIDC IdP lists 'http://localhost:7000', you can't send 'http://localhost:7000/' despite them being the same URL.

(And your program has to use 'localhost', not '127.0.0.1' or the IPv6 loopback address; although the two have the same effect, they're obviously not string-identical.)

Based on experimental evidence from OIDC/OAuth2 client configurations, I strongly suspect that some large IdP providers have non-standard, relaxed handling of 'localhost' redirect uris such that their client configuration lists 'http://localhost' and the IdP will accept some random port glued on in the actual redirect uri (or maybe this behavior has been standardized now). I suspect that the IdPs may also accept the trailing slash case. Honestly, it's hard to see how you get out of this if you want to handle real client programs out in the wild.

(Some OIDC IdP software definitely does the standard compliant string comparison. The one I know of for sure is SimpleSAMLphp's OIDC module. Meanwhile, based on reading the source code, Dex uses a relaxed matching for localhost in its matching function, provided that there are no redirect uris register for the client. Dex also still accepts the urn:ietf:wg:oauth:2.0:oob redirect uri, so I suspect that there are still uses out there in the field.)

If the program has its own embedded web browser that it's in full control of, it can do what Thunderbird appears to do (based on reading its source code). As far as I can tell, Thunderbird doesn't run a local listening server; instead it intercepts the HTTP redirection to 'http://localhost' itself. When the IdP sends the final HTTP redirect to localhost with the code embedded in the URL, Thunderbird effectively just grabs the code from the redirect URL in the HTTP reply and never actually issues a HTTP request to the redirect target.

The final option is to not run a localhost HTTP server and to tell people running your program that when their browser gives them an 'unable to connect' error at the end of the OIDC authentication process, they need to go to the URL bar and copy the 'code' query parameter into the program (or if you're being friendly, let them copy and paste the entire URL and you extract the code parameter). This allows your program to use a fixed redirect uri, including just 'http://localhost', because it doesn't have to be able to listen on it or on any fixed port.

(This is effectively a more secure but less user friendly version of the old 'copy a code that the website displayed' OAuth2 approach, and that approach wasn't all that user friendly to start with.)

PS: An OIDC redirect uri apparently allows things other than http:// and https:// URLs; there is, for example, the 'openid-credential-offer' scheme. I believe that the OIDC IdP doesn't particularly do anything with those redirect uris other than accept them and issue a HTTP redirect to them with the appropriate code attached. It's up to your local program or system to intercept HTTP requests for those schemes and react appropriately, much like Thunderbird does, but perhaps easier because you can probably register the program as handling all 'whatever-special://' URLs so the redirect is automatically handed off to it.

(I suspect that there are more complexities in the whole OIDC and OAuth2 redirect uri area, since I'm new to the whole thing.)

The commodification of desktop GUI behavior

By: cks
13 March 2025 at 03:08

Over on the Fediverse, I tried out a thesis:

Thesis: most desktop GUIs are not opinionated about how you interact with things, and this is why there are so many GUI toolkits and they make so little difference to programs, and also why the browser is a perfectly good cross-platform GUI (and why cross-platform GUIs in general).

Some GUIs are quite opinionated (eg Plan 9's Acme) but most are basically the same. Which isn't necessarily a bad thing but it creates a sameness.

(Custom GUIs are good for frequent users, bad for occasional ones.)

Desktop GUIs differ in how they look and to some extent in how you do certain things and how you expect 'native' programs to behave; I'm sure the fans of any particular platform can tell you all about little behaviors that they expect from native applications that imported ones lack. But I think we've pretty much converged on a set of fundamental behaviors for how to interact with GUI programs, or at least how to deal with basic ones, so in a lot of cases the question about GUIs is how things look, not how you do things at all.

(Complex programs have for some time been coming up with their own bespoke alternatives to, for example, huge cascades of menus. If these are successful they tend to get more broadly adopted by programs facing the same problems; consider the 'ribbon', which got what could be called a somewhat mixed reaction on its modern introduction.)

On the desktop, changing the GUI toolkit that a program uses (either on the same platform or on a different one) may require changing the structure of your code (in addition to ordinary code changes), but it probably won't change how your program operates. Things will look a bit different, maybe some standard platform features will appear or disappear, but it's not a completely different experience. This often includes moving your application from the desktop into the browser (a popular and useful 'cross-platform' environment in itself).

This is less true on mobile platforms, where my sense is that the two dominant platforms have evolved somewhat different idioms for how you interact with applications. A proper 'native' application behaves differently on the two platforms even if it's using mostly the same code base.

GUIs such as Plan 9's Acme show that this doesn't have to be the case; for that matter, so does GNU Emacs. GNU Emacs has a vague shell of a standard looking GUI but it's a thin layer over a much different and stranger vastness, and I believe that experienced Emacs people do very little interaction with it.

How SAML and OIDC differ in sharing information, and perhaps why

By: cks
9 March 2025 at 04:39

In practice, SAML and OIDC are two ways of doing third party web-based authentication (and thus a Single Sign On (SSO)) system; the web site you want to use sends you off to a SAML or OIDC server to authenticate, and then the server sends authentication information back to the 'client' web site. Both protocols send additional information about you along with the bare fact of an authentication, but they differ in how they do this.

In SAML, the SAML server sends a collection of 'attributes' back to the SAML client. There are some standard SAML attributes that client websites will expect, but the server is free to throw in any other attributes it feels like, and I believe that servers do things like turn every LDAP attribute they get from a LDAP user lookup into a SAML attribute (certainly SimpleSAMLphp does this). As far as I know, any filtering of what SAML attributes are provided by the server to any particular client is a server side feature, and SAML clients don't necessarily have any way of telling the SAML server what attributes they want or don't want.

In OIDC, the equivalent way of returning information is 'claims', which are grouped into 'scopes', along with basic claims that you get without asking for a scope. The expectation in OIDC is that clients that want more than the basic claims will request specific scopes and then get back (only) the claims for those scopes. There are standard scopes with standard claims (not all of which are necessarily returned by any given OIDC server). If you want to add additional information in the form of more claims, I believe that it's generally expected that you'll create one or more custom scopes for those claims and then have your OIDC clients request them (although not all OIDC clients are willing and able to handle custom scopes).

(I think in theory an OIDC server may be free to shove whatever claims it wants to into information for clients regardless of what scopes the client requested, but an OIDC client may ignore any information it didn't request and doesn't understand rather than pass it through to other software.)

The SAML approach is more convenient for server and client administrators who are working within the same organization. The server administrator can add whatever information to SAML responses that's useful and convenient, and SAML clients will generally automatically pick it up and often make it available to other software. The OIDC approach is less convenient, since you need to create one or more additional scopes on the server and define what claims go in them, and then get your OIDC clients to request the new scopes; if an OIDC client doesn't update, it doesn't get the new information. However, the OIDC approach makes it easier for both clients and servers to be more selective and thus potentially for people to control how much information they give to who. An OIDC client can ask for only minimal information by only asking for a basic scope (such as 'email') and then the OIDC server can tell the person exactly what information they're approving being passed to the client, without the OIDC server administrators having to get involved to add client-specific attribute filtering.

(In practice, OIDC probably also encourages giving less information to even trusted clients in general since you have to go through these extra steps, so you're less likely to do things like expose all LDAP information as OIDC claims in some new 'our-ldap' scope or the like.)

My guess is that OIDC was deliberately designed this way partly in order to make it better for use with third party clients. Within an organization, SAML's broad sharing of information may make sense, but it makes much less sense in a cross-organization context, where you may be using OIDC-based 'sign in with <large provider>' on some unrelated website. In that sort of case, you certainly don't want that website to get every scrap of information that the large provider has on you, but instead only ask for (and get) what it needs, and for it to not get much by default.

The OpenID Connect (OIDC) 'sub' claim is surprisingly load-bearing

By: cks
8 March 2025 at 04:24

OIDC (OpenID Connect) is today's better or best regarded standard for (web-based) authentication. When a website (or something) authenticates you through an OpenID (identity) Provider (OP), one of the things it gets back is a bunch of 'claims', which is to say information about the authenticated person. One of the core claims is 'sub', which is vaguely described as a string that is 'subject - identifier for the end-user at the issuer'. As I discovered today, this claim is what I could call 'load bearing' in a surprising way or two.

In theory, 'sub' has no meaning beyond identifying the user in some opaque way. The first way it's load bearing is that some OIDC client software (a 'Relying Party (RP)') will assume that the 'sub' claim has a human useful meaning. For example, the Apache OpenIDC module defaults to putting the 'sub' claim into Apache's REMOTE_USER environment variable. This is fine if your OIDC IdP software puts, say, a login name into it; it is less fine if your OIDC IdP software wants to create 'sub' claims that look like 'YXVzZXIxMi5zb21laWRw'. These claims mean something to your server software but not necessarily to you and the software you want to use on (or behind) OIDC RPs.

The second and more surprising way that the 'sub' claim is load bearing involves how external consumers of your OIDC IdP keep track of your people. In common situations your people will be identified and authorized by their email address (using some additional protocols), which they enter into the outside OIDC RP that's authenticating against your OIDC IdP, and this looks like the identifier that RP uses to keep track of them. However, at least one such OIDC RP assumes that the 'sub' claim for a given email address will never change, and I suspect that there are more people who either quietly use the 'sub' claim as the master key for accounts or who require 'sub' and the email address to be locked together this way.

This second issue makes the details of how your OIDC IdP software generates its 'sub' claim values quite important. You want it to be able to generate those 'sub' values in a clear and documented way that other OIDC IdP software can readily duplicate to create the same 'sub' values, and that won't change if you change some aspect of the OIDC IdP configuration for your current software. Otherwise you're at least stuck with your current OIDC IdP software, and perhaps with its exact current configuration (for authentication sources, internal names of things, and so on).

(If you have to change 'sub' values, for example because you have to migrate to different OIDC IdP software, this could go as far as the outside OIDC RP basically deleting all of their local account data for your people and requiring all of it to be entered back from scratch. But hopefully those outside parties have a better procedure than this.)

The problem facing MFA-enabled IMAP at the moment (in early 2025)

By: cks
7 March 2025 at 04:32

Suppose that you have an IMAP server and you would like to add MFA (Multi-Factor Authentication) protection to it. I believe that in theory the IMAP protocol supports multi-step 'challenge and response' style authentication, so again in theory you could implement MFA this way, but in practice this is unworkable because people would be constantly facing challenges. Modern IMAP clients (and servers) expect to be able to open and close connections more or less on demand, rather than opening one connection, holding it open, and doing everything over it. To make IMAP MFA practical, you need to do it with some kind of 'Single Sign On' (SSO) system. The current approach for this uses an OIDC identity provider for the SSO part and SASL OAUTHBEARER authentication between the IMAP client and the IMAP server, using information from the OIDC IdP.

So in theory, your IMAP client talks to your OIDC IdP to get a magic bearer token, provides this token to the IMAP server, the IMAP server verifies that it comes from a configured and trusted IdP, and everything is good. You only have to go through authenticating to your OIDC IdP SSO system every so often (based on whatever timeout it's configured with); the rest of the time the aggregate system does any necessary token refreshes behind the scenes. And because OIDC has a discovery process that can more or less start from your email address (as I found out), it looks like IMAP clients like Thunderbird could let you more or less automatically use any OIDC IdP if people had set up the right web server information.

If you actually try this right now, you'll find that Thunderbird, apparently along with basically all significant IMAP client programs, will only let you use a few large identity providers; here is Thunderbird's list (via). If you read through that Thunderbird source file, you'll find one reason for this limitation, which is that each provider has one or two magic values (the 'client ID' and usually the 'client secret', which is obviously not so secret here), in addition to URLs that Thunderbird could theoretically autodiscover if everyone supported the current OIDC autodiscovery protocols (my understanding is that not everyone does). In most current OIDC identity provider software, these magic values are either given to the IdP software or generated by it when you set up a given OIDC client program (a 'Relying Party (RP)' in the OIDC jargon).

This means that in order for Thunderbird (or any other IMAP client) to work with your own local OIDC IdP, there would have to be some process where people could load this information into Thunderbird. Alternately, Thunderbird could publish default values for these and anyone who wanted their OIDC IdP to work with Thunderbird would have to add these values to it. To date, creators of IMAP client software have mostly not supported either option and instead hard code a list of big providers who they've arranged more or less explicit OIDC support with.

(Honestly it's not hard to see why IMAP client authors have chosen this approach. Unless you're targeting a very technically inclined audience, walking people through the process of either setting this up in the IMAP client or verifying if a given OIDC IdP supports the client is daunting. I believe some IMAP clients can be configured for OIDC IdPs through 'enterprise policy' systems, but there the people provisioning the policies are supposed to be fairly technical.)

PS: Potential additional references on this mess include David North's article and this FOSDEM 2024 presentation (which I haven't yet watched, I only just stumbled into this mess).

Always sync your log or journal files when you open them

By: cks
1 March 2025 at 03:10

Today I learned of a new way to accidentally lose data 'written' to disk, courtesy of this Fediverse post summarizing a longer article about CouchDB and this issue. Because this is so nifty and startling when I encountered it, yet so simple, I'm going to re-explain the issue in my own words and explain how it leads to the title of this entry.

Suppose that you have a program that makes data it writes to disk durable through some form of journal, write ahead log (WAL), or the like. As we all know, data that you simply write() to the operating system isn't yet on disk; the operating system is likely buffering the data in memory before writing it out at the OS's own convenience. To make the data durable, you must explicitly flush it to disk (well, ask the OS to), for example with fsync(). Your program is a good program, so of course it does this; when it updates the WAL, it write()s then fsync()s.

Now suppose that your program is terminated after the write but before the fsync. At this point you have a theoretically incomplete and improperly written journal or WAL, since it hasn't been fsync'd. However, when your program restarts and goes through its crash recovery process, it has no way to discover this. Since the data was written (into the OS's disk cache), the OS will happily give the data back to you even though it's not yet on disk. Now assume that your program takes further actions (such as updating its main files) based on the belief that the WAL is fully intact, and then the system crashes, losing that buffered and not yet written WAL data. Oops. You (potentially) have a problem.

(These days, programs can get terminated for all sorts of reasons other than a program bug that causes a crash. If you're operating in a modern containerized environment, your management system can decide that your program or its entire container ought to shut down abruptly right now. Or something else might have run the entire system out of memory and now some OOM handler is killing your program.)

To avoid the possibility of this problem, you need to always force a disk flush when you open your journal, WAL, or whatever; on Unix, you'd immediately fsync() it. If there's no unwritten data, this will generally be more or less instant. If there is unwritten data because you're restarting after the program was terminated by surprise, this might take a bit of time but insures that the on-disk state matches the state that you're about to observe through the OS.

(CouchDB's article points to another article, Justin Jaffray’s NULL BITMAP Builds a Database #2: Enter the Memtable, which has a somewhat different way for this failure to bite you. I'm not going to try to summarize it here but you might find the article interesting reading.)

Institutions care about their security threats, not your security threats

By: cks
23 February 2025 at 03:45

Recently I was part of a conversation on the Fediverse that sparked an obvious in retrospect realization about computer security and how we look at and talk about security measures. To put it succinctly, your institution cares about threats to it, not about threats to you. It cares about threats to you only so far as they're threats to it through you. Some of the security threats and sensible responses to them overlap between you and your institution, but some of them don't.

One of the areas where I think this especially shows up is in issues around MFA (Multi-Factor Authentication). For example, it's a not infrequently observed thing that if all of your factors live on a single device, such as your phone, then you actually have single factor authentication (this can happen with many of the different ways to do MFA). But for many organizations, this is relatively fine (for them). Their largest risk is that Internet attackers are constantly trying to (remotely) phish their people, often in moderately sophisticated ways that involve some prior research (which is worth it for the attackers because they can target many people with the same research). Ignoring MFA alert fatigue for a moment, even a single factor physical device will cut of all of this, because Internet attackers don't have people's smartphones.

For individual people, of course, this is potentially a problem. If someone can gain access to your phone, they get everything, and probably across all of the online services you use. If you care about security as an individual person, you want attackers to need more than one thing to get all of your accounts. Conversely, for organizations, compromising all of their systems at once is sort of a given, because that's what it means to have a Single Sign On system and global authentication. Only a few organizational systems will be separated from the general SSO (and organizations have to hope that their people cooperate by using different access passwords).

Organizations also have obvious solutions to things like MFA account recovery. They can establish and confirm the identities of people associated with them, and a process to establish MFA in the first place, so if you lose whatever lets you do MFA (perhaps your work phone's battery has gotten spicy), they can just run you through the enrollment process again. Maybe there will be a delay, but if so, the organization has broadly decided to tolerate it.

(And I just recently wrote about the difference between 'internal' accounts and 'external' accounts, where people generally know who is in an organization and so has an account, so allowing this information to leak in your authentication isn't usually a serious problem.)

Another area where I think this difference in the view of threats is in the tradeoffs involved in disk encryption on laptops and desktops used by people. For an organization, choosing non-disclosure over availability on employee devices makes a lot of sense. The biggest threat as the organization sees it isn't data loss on a laptop or desktop (especially if they write policies about backups and where data is supposed to be stored), it's an attacker making off with one and having the data disclosed, which is at least bad publicity and makes the executives unhappy. You may feel differently about your own data, depending on how your backups are.

'Internal' accounts and their difference from 'external' accounts

By: cks
14 February 2025 at 03:22

In the comments on my entry on how you should respond to authentication failures depends on the circumstances, sapphirepaw said something that triggered a belated realization in my mind:

Probably less of a concern for IMAP, but in a web app, one must take care to hide the information completely. I was recently at a site that wouldn't say whether the provided email was valid for password reset, but would reveal it was in use when trying to create a new account.

The realization this sparked is that we can divide accounts and systems into two sorts, which I will call internal and external, and how you want to treat things around these accounts is possibly quite different.

An internal account is one that's held by people within your organization, and generally is pretty universal. If you know that someone is a member of the organization you can predict that they have an account on the system, and not infrequently what the account name is. For example, if you know that someone is a graduate student here it's a fairly good bet that they have an account with us and you may even be able to find and work out their login name. The existence of these accounts and even specifics about who has what login name (mostly) isn't particularly secret or sensitive.

(Internal accounts don't have to be on systems that the organization runs; they could be, for example, 'enterprise' accounts on someone else's SaaS service. Once you know that the organization uses a particular SaaS offering or whatever, you're usually a lot of the way to identifying all of their accounts.)

An external account is one that's potentially held by people from all over, far outside the bounds of a single organization (including the one running the the systems the account is used with). A lot of online accounts with websites are like this, because most websites are used by lots of people from all over. Who has such an account may be potentially sensitive information, depending on the website and the feelings of the people involved, and the account identity may be even more sensitive (it's one thing to know that a particular email address has an Fediverse account on mastodon.social, but it may be quite different to know which account that is, depending on various factors).

There's a spectrum of potential secrecy between these two categories. For example, the organization might not want to openly reveal which external SaaS products they use, what entity name the organization uses on them, and the specific names people use for authentication, all in the name of making it harder to break into their environment at the SaaS product. And some purely internal systems might have a very restricted access list that is kept at least somewhat secret so attackers don't know who to target. But I think the broad division between internal and external is useful because it does a lot to point out where any secrecy is.

When I wrote my entry, I was primarily thinking about internal accounts, because internal accounts are what we deal with (and what many internal system administration groups handle). As sapphirepaw noted, the concerns and thus the rules are quite different for external accounts.

(There may be better labels for these two sorts of accounts. I'm not great with naming)

Why writes to disk generally wind up in your OS's disk read cache

By: cks
4 February 2025 at 03:44

Recently, someone was surprised to find out that ZFS puts disk writes in its version of a disk (read) cache, the ARC ('Adaptive Replacement Cache'). In fact this is quite common, as almost every operating system and filesystem puts ordinary writes to disk into their disk (read) cache. In thinking about the specific issue of the ZFS ARC and write data, I realized that there's a general broad reason for this and then a narrower technical one.

The broad reason that you'll most often hear about is that it's not uncommon for your system to read things back after you've written them to disk. It would be wasteful to having something in RAM, write it to disk, remove it from RAM, and then have to more or less immediately read it back from disk. If you're dealing with spinning HDDs, this is quite bad since HDDs can only do a relatively small amount of IO a second; in this day of high performance, low latency NVMe SSDs, it might not be so terrible any more, but it still costs you something. Of course you have to worry about writes flooding the disk cache and evicting more useful data, but this is also an issue with certain sorts of reads.

The narrower technical reason is dealing with issues that come up once you add write buffering to the picture. In practice a lot of ordinary writes to files aren't synchronously written out to disk on the spot; instead they're buffered in memory for some amount of time. This require some pool of (OS) memory to hold the these pending writes, which might as well be your regular disk (read) cache. Putting not yet written out data in the disk read cache also deals with the issue of coherence, where you want programs that are reading data to see the most recently written data even if it hasn't been flushed out to disk yet. Since reading data from the filesystem already looks in the disk cache, you'll automatically find the pending write data there (and you'll automatically replace an already cached version of the old data). If you put pending writes into a different pool of memory, you have to specifically manage it and tune its size, and you have to add extra code to potentially get data from it on reads.

(I'm going to skip considering memory mapped IO in this picture because it only makes things even more complicated, and how OSes and filesystems handle it potentially varies a lot. For example, I'm not sure if Linux or ZFS normally directly use pages in the disk cache, or if even shared memory maps get copies of the disk cache pages.)

PS: Before I started thinking about the whole issue as a result of the person's surprise, I would have probably only given you the broad reason off the top of my head. I hadn't thought about the technical issues of not putting writes in the read cache before now.

Languages don't version themselves using semantic versioning

By: cks
25 January 2025 at 03:46

A number of modern languages have effectively a single official compiler or interpreter, and they version this toolchain with what looks like a semantic version (semver). So we have (C)Python 3.12.8, Go 1.23.5, Rust(c) 1.84.0, and so on, which certainly look like a semver major.minor.patchlevel triplet. In practice, this is not how languages think of their version numbers.

In practice, the version number triplets of things like Go, Rust, and CPython have a meaning that's more like '<dialect>.<release>.<patchlevel>'. The first number is the language dialect and it changes extremely infrequently, because it's a very big deal to significantly break backward compatibility or even to make major changes in language semantics that are sort of backward compatible. Python 1, Python 2, and Python 3 are all in effect different but closely related languages.

(Python 2 is much closer to Python 1 than Python 3 is to Python 2, which is part of why you don't read about a painful and protracted transition from Python 1 to Python 2.)

The second number is somewhere between a major and a minor version number. It's typically increased when the language or the toolchain (or both) do something significant, or when enough changes have built up since the last time the second number was increased and people want to get them out in the world. Languages can and do make major additions with only a change in the second number; Go added generics, CPython added and improved an asynchronous processing system, and Rust has stabilized a whole series of features and improvements, all in Go 1.x, CPython 3.x, and Rust 1.x.

The third number is a patchlevel (or if you prefer, a 'point release'). It's increased when a new version of an X.Y release must be made to fix bugs or security problems, and generally contains minimal code changes and no new language features. I think people would look at the language's developers funny if they landed new language features in a patchlevel instead of an actual release, and they'd definitely be unhappy if something was broken or removed in a patchlevel. It's supposed to be basically completely safe to upgrade to a new patchlevel of the language's toolchain.

Both Go and CPython will break, remove, or change things in new 'release' versions. CPython has deprecated a number of things over the course of the 3.x releases so far, and Go has changed how its toolchain behaves and turned off some old behavior (the toolchain's behavior is not covered by Go's language and standard library compatibility guarantee). In this regard these Go and CPython releases are closer to major releases than minor releases.

(Go uses the term 'major release' and 'minor release' for, eg, 'Go 1.23' and 'Go 1.23.3'; see here. Python often calls each '3.x' a 'series', and '3.x.y' a 'maintenance release' within that series, as seen in the Python 3.13.1 release note.)

The corollary of this is that you can't apply semver expectations about stability to language versioning. Languages with this sort of versioning are 'less stable' than they should be by semver standards, since they make significant and not necessarily backward compatible changes in what semver would call a 'minor' release. This isn't a violation of semver because these languages never claimed or promised to be following semver. Language versioning is different (and basically has to be).

(I've used CPython, Go, and Rust here because they're the three languages where I'm most familiar with the release versioning policies. I suspect that many other languages follow similar approaches.)

The problem with combining DNS CNAME records and anything else

By: cks
11 January 2025 at 03:55

A famous issue when setting up DNS records for domains is that you can't combine a CNAME record with any other type, such as a MX record or a SOA (which is required at the top level of a domain). One modern reason that you would want such a CNAME record is that you're hosting your domain's web site at some provider and the provider wants to be able to change what IP addresses it uses for this, so from the provider's perspective they want you to CNAME your 'web site' name to 'something.provider.com'.

The obvious reason for 'no CNAME and anything else' is 'because the RFCs say so', but this is unsatisfying. Recently I wondered why the RFCs couldn't have said that when a CNAME is combined with other records, you return the other records when asked for them but provide the CNAME otherwise (or maybe you return the CNAME only when asked for the IP address if there are other records). But when I thought about it more, I realized the answer, the short version of which is caching resolvers.

If you're the authoritative DNS server for a zone, you know for sure what DNS records are and aren't present. This means that if someone asks you for an MX record and the zone has a CNAME, a SOA, and an MX, you can give them the MX record, and if someone asks for the A record, you can give them the CNAME, and everything works fine. But a DNS server that is a caching resolver doesn't have this full knowledge of the zone; it only knows what's in its cache. If such a DNS server has a CNAME for a domain in its cache (perhaps because someone asked for the A record) and it's now asked for the MX records of that domain, what is it supposed to do? The correct answer could be either the CNAME record the DNS server has or the MX records it would have to query an authoritative server for. At a minimum combining CNAME plus other records this way would require caching resolvers to query the upstream DNS server and then remember that they got a CNAME answer for a specific query.

In theory this could have been written into DNS originally, at the cost of complicating caching DNS servers and causing them to make more queries to upstream DNS servers (which is to say, making their caching less effective). Once DNS existed with the CNAME behavior such that caching DNS resolvers could cache CNAME responses and serve them, the CNAME behavior was fixed.

(This is probably obvious to experienced DNS people, but since I had to work it out in my head I'm going to write it down.)

Sidebar: The pseudo-CNAME behavior offered by some DNS providers

Some DNS providers and DNS servers offer an 'ANAME' or 'ALIAS' record type. This isn't really a DNS record; instead it's a processing instruction to the provider's DNS software that it should look up the A and AAAA records of the target name and insert them into your zone in place of the ANAME/ALIAS record (and redo the lookup every so often in case the target name's IP addresses change). In theory any changes in the A or AAAA records should trigger a change in the zone serial number; in practice I don't know if providers actually do this.

(If your DNS provider doesn't have ANAME/ALIAS 'records' but does have an API, you can build this functionality yourself.)

There are different sorts of WireGuard setups with different difficulties

By: cks
5 January 2025 at 04:37

I've now set up WireGuard in a number of different ways, some of which were easy and some of which weren't. So here are my current views on WireGuard setups, starting with the easiest and going to the most challenging.

The easiest WireGuard setup is where the 'within WireGuard' internal IP address space is completely distinct from the outside space, with no overlap. This makes routing completely straightforward; internal IPs reachable over WireGuard aren't reachable in any other way, and external IPs aren't reachable over WireGuard. You can do this as a mesh or use the WireGuard 'router' pattern (or some mixture). If you allocate all internal IP addresses from the same network range, you can set a single route to your WireGuard interface and let AllowedIps sort it out.

(An extreme version of this would be to configure the inside part of WireGuard with only link local IPv6 addresses, although this would probably be quite inconvenient in practice.)

A slightly more difficult setup is where some WireGuard endpoints are gateways to additional internal networks, networks that aren't otherwise reachable. This setup potentially requires more routing entries but it remains straightforward in that there's no conflict on how to route a given IP address.

The next most difficult setup is using different IP address types inside WireGuard than from outside it, where the inside IP address type isn't otherwise usable for at least one of the ends. For example, you have an IPv4 only machine that you're giving a public IPv6 address through an IPv6 tunnel. This is still not too difficult because the inside IP addresses associated with each WireGuard peer aren't otherwise reachable, so you never have a recursive routing problem.

The most difficult type of WireGuard setup I've had to do so far is a true 'VPN' setup, where some or many of the WireGuard endpoints you're talking to are reachable both outside WireGuard and through WireGuard (or at least there are routes that try to send traffic to those IPs through WireGuard, such as a VPN 'route all traffic through my WireGuard link' default route). Since your system could plausibly recursively route your encrypted WireGuard traffic over WireGuard, you need some sort of additional setup to solve this. On Linux, this will often be done using a fwmark (also) and some policy based routing rules.

One of the reasons I find it useful to explicitly think about these different types of setups is to better know what to expect and what I'll need to do when I'm planning a new WireGuard environment. Either I will be prepared for what I'm going to have to do, or I may rethink my design in order to move it up the hierarchy, for example deciding that we can configure services to talk to special internal IPs (over WireGuard) so that we don't have to set up fwmark-based routing on everything.

(Some services built on top of WireGuard handle this for you, for example Tailscale, although Tailscale can have routing challenges of its own depending on your configuration.)

My screens now have areas that are 'good' and 'bad' for me

By: cks
30 December 2024 at 04:23

Once upon a time, I'm sure that everywhere on my screen (because it would have been a single screen at that time) was equally 'good' for me; all spots were immediately visible, clearly readable, didn't require turning my head, and so on. As the number of screens I use has risen, as the size of the screens has increased (for example when I moved from 24" non-HiDPI 3:2 LCD panels to 27" HiDPI 16:9 panels), and as my eyes have gotten older, this has changed. More and more, there is a 'good' area that I've set up so I'm looking straight at and then increasingly peripheral areas that are not as good.

(This good area is not necessarily the center of the screen; it depends on how I sit relatively to the screen, the height of the monitor, and so on. If I adjust these I can change what the good spot is, and I sometimes will do so for particular purposes.)

Calling the peripheral areas 'bad' is a relative term. I can see them, but especially on my office desktop (which has dual 27" 16:9 displays), these days the worst spots can be so far off to the side that I don't really notice things there much of the time. If I want to really look, I have to turn my head, which means I have to have a reason to look over there at whatever I put there. Hopefully it's not too important.

For a long time I didn't really notice this change or think about its implications. As the physical area covered by my 'display surface' expanded, I carried over the much the same desktop layout that I had used (in some form) for a long time. It didn't register that some things were effectively being exiled into the outskirts where I would never notice them, or that my actual usage was increasingly concentrated in one specific area of the screen. Now that I have consciously noticed this shift (which is a story for another entry), I may want to rethink some of how I lay things out on my office desktop (and maybe my home one too) and what I put where.

(One thing I've vaguely considered is if I should turn my office displays sideways, so the long axis is vertical, although I don't know if is feasible with their current stands. I have what is in practice too much horizontal space today, so that would be one way to deal with it. But probably this would give me two screens that each are a bit too narrow to be comfortable for me. And sadly there are no ideal LCD panels these days; I would ideally like a HiDPI 24" or 25" 3:2 panel but vendors don't do those.)

x86 servers, ATX power supply control, and reboots, resets, and power cycles

By: cks
26 December 2024 at 04:15

I mentioned recently a case when power cycling an (x86) server wasn't enough to recover it, although perhaps I should have put quotes around "power cycling". The reason for the scare quotas is that I was doing this through the server's BMC, which means that what was actually happening was not clear because there are a variety of ways the BMC could be doing power control and the BMC may have done something different for what it described as a 'power cycle'. In fact, to make it less clear, this particular server's BMC offers both a "Power Cycle" and a "Power Reset" option.

(According to the BMC's manual, a "power cycle" turns the system off and then back on again, while a "power reset" performs a 'warm restart'. I may have done a 'power reset' instead of a 'power cycle', it's not clear from what logs we have.)

There are a spectrum of ways to restart an x86 server, and they (probably) vary in their effects on peripherals, PCIe devices, and motherboard components. The most straightforward looking is to ask the Linux kernel to reboot the system, although in practice I believe that actually getting the hardware to do the reboot is somewhat complex (and in the past Linux sometimes had problems where it couldn't persuade the hardware, so your 'reboot' would hang). Looking at the Linux kernel code suggests that there are multiple ways to invoke a reboot, involving ACPI, UEFI firmware, old fashioned BIOS firmware, a PCIe configuration register, via the keyboard, and so on (for a fun time, look at the 'reboot=' kernel parameter). In general, a reboot can only be initiated by the server's host OS, not by the BMC; if the host OS is hung you can't 'reboot' the server as such.

Your x86 desktop probably has a 'reset' button on the front panel. These days the wire from this is probably tied into the platform chipset (on Intel, the ICH, which came up for desktop motherboard power control) and is interpreted by it. Server platforms probably also have a (conceptual) wire and that wire may well be connected to the BMC, which can then control it to implement, for example a 'reset' operation. I believe that a server reboot can also trigger the same platform chipset reset handling that the reset button does, although this isn't sure. If I'm reading Intel ICH chipset documentation correctly, triggering a reset this way will or may signal PCIe devices and so on that a reset has happened, although I don't think it cuts power to them; in theory anything getting this signal should reset its state.

(The CF9 PCI "Reset Control Register" (also) can be used to initiate a 'soft' or 'hard' CPU reset, or a full reset in which the (Intel) chipset will do various things to signals to peripherals, not just the CPU. I don't believe that Linux directly exposes these options to user space (partly because it may not be rebooting through direct use of PCI CF9 in the first place), although some of them can be controlled through kernel command line parameters. I think this may also control whether the 'reset' button and line do a CPU reset or a full reset. It seems possible that the warm restart of this server's BMC's "power reset" works by triggering the reset line and assuming that CF9 is left in its default state to make this a CPU reset instead of a full reset.)

Finally, the BMC can choose to actually cycle the power off and then back on again. As discussed, 'off' is probably not really off, because standby power and BMC power will remain available, but this should put both the CPU and the platform chipset through a full power-on sequence. However, it likely won't leave power off long enough for various lingering currents to dissipate and capacitors to drain. And nothing you do through the BMC can completely remove power from the system; as long as a server is connected to AC power, it's supplying standby power and BMC power. If you want a total reset, you must either disconnect its power cords or turn its outlet or outlets off in your remote controllable PDU (which may not work great if it's on a UPS). And as we've seen, sometimes a short power cycle isn't good enough and you need to give the server a time out.

(While the server's OS can ask for the server to be powered down instead of rebooted, I don't think it can ask for the server to be power cycled, not unless it talks to the BMC instead of doing a conventional reboot or power down.)

One of the things I've learned from this is that if I want to be really certain I understand what a BMC is doing, I probably shouldn't rely on any option to do a power cycle or power reset. Instead I should explicitly turn power off, wait until that's taken, and then turn power on. Asking a BMC to do a 'power cycle' is a bit optimistic, although it will probably work most of the time.

(If there's another time of our specific 'reset is not enough' hang, I will definitely make sure to use at least the BMC's 'power cycle' and perhaps the full brief off then on approach.)

When power cycling your (x86) server isn't enough to recover it

By: cks
22 December 2024 at 03:43

We have various sorts of servers here, and generally they run without problems unless they experience obvious hardware failures. Rarely, we experience Linux kernel hangs on them, and when this happens, we power cycle the machines, as one does, and the server comes back. Well, almost always. We have two servers (of the same model), where something different has happened once.

Each of the servers either crashed in the kernel and started to reboot or hung in the kernel and was power cycled (both were essentially unused at the time). As each server was running through the system firmware ('BIOS'), both of them started printing an apparently endless series of error dumps to their serial consoles (which had been configured in the BIOS as well as in the Linux kernel). These were like the following:

!!!! X64 Exception Type - 12(#MC - Machine-Check)  CPU Apic ID - 00000000 !!!!
RIP  - 000000006DABA5A5, CS  - 0000000000000038, RFLAGS - 0000000000010087
RAX  - 0000000000000008, RCX - 0000000000000000, RDX - 0000000000000001
RBX  - 000000007FB6A198, RSP - 000000005D29E940, RBP - 000000005DCCF520
RSI  - 0000000000000008, RDI - 000000006AB1B1B0
R8   - 000000005DCCF524, R9  - 000000005D29E850, R10 - 000000005D29E8E4
R11  - 000000005D29E980, R12 - 0000000000000008, R13 - 0000000000000001
R14  - 0000000000000028, R15 - 0000000000000000
DS   - 0000000000000030, ES  - 0000000000000030, FS  - 0000000000000030
GS   - 0000000000000030, SS  - 0000000000000030
CR0  - 0000000080010013, CR2 - 0000000000000000, CR3 - 000000005CE01000
CR4  - 0000000000000668, CR8 - 0000000000000000
DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 0000000076E46000 0000000000000047, LDTR - 0000000000000000
IDTR - 000000006AC3D018 0000000000000FFF,   TR - 0000000000000000
FXSAVE_STATE - 000000005D29E5A0
!!!! Can't find image information. !!!!

(The last line leaves me with questions about the firmware/BIOS but I'm unlikely to get answers to them. I'm putting the full output here for the usual reason.)

Some of the register values varied between reports, others didn't after the first one (for example, from the second onward the RIP appears to have always been 6DAB14D1, which suggests maybe it's an exception handler).

In both cases, we turned off power to the machines (well, to the hosts; we were working through the BMC, which stayed powered on), let them sit for a few minutes, and then powered them on again. This returned them to regular, routine, unexciting service, where neither of them have had problems since.

I knew in a theoretical way that there are parts of an x86 system that aren't necessarily completely reset if the power is only interrupted briefly (my understanding is that a certain amount of power lingers until capacitors drain and so on, but this may be wrong and there's a different mechanism in action). But I usually don't have it demonstrated in front of me this way, where a simple power cycle isn't good enough to restore a system but a cool down period works.

(Since we weren't cutting external power to the entire system, this also left standby power (also) available, which means some things never completely lost power even with the power being 'off' for a couple of minutes.)

PS: Actually there's an alternate explanation, which is that the first power cycle didn't do enough to reset things but a second one would have worked if I'd tried that instead of powering the servers off for a few minutes. I'm not certain I believe this and in any case, powering the servers off for a cool down period was faster than taking a chance on a second power cycle reset.

Common motherboards are supporting more and more M.2 NVMe drive slots

By: cks
7 December 2024 at 04:27

Back at the start of 2020, I wondered if common (x86 desktop) motherboards would ever have very many M.2 NVMe drive slots, where by 'very many' I meant four or so, which even back then was a common number of SATA ports for desktop motherboards to provide. At the time I thought the answer was probably no. As I recently discovered from investigating a related issue, I was wrong, and it's now fairly straightforward to find x86 desktop motherboards that have as many as four M.2 NVMe slots (although not all four may be able to run at x4 PCIe lanes, especially if you have things like a GPU).

For example, right now it's relatively easy to find a page full of AMD AM5-based motherboards that have four M.2 NVMe slots. Most of these seem to be based on the high end X series AMD chipsets (such as the X670 or the X870, but I found a few that were based on the B650 chipset. On the Intel side, should you still be interested in an Intel CPU in your desktop at this point, there's also a number of them based primarily on the Z790 chipset (and some the older Z690). There's even a B760 based motherboard with four M.2 NVMe slots (although two of them are only x1 lanes and PCIe 3.0), and an H770 based one that manages to (theoretically) support all four M.2 slots at x4 lanes.

One of the things that I think has happened on the way to this large supply of M.2 slots is that these desktop motherboards have dropped most of their PCIe slots. These days, you seem to commonly get three slots in total on the kind of motherboard that has four M.2 slots. There's always one x16 slot, often two, and sometimes three (although that's physical x16; don't count on getting all 16 PCIe lanes in every slot). It's not uncommon to see the third PCIe slot be physically x4, or a little x1 slot tucked away at the bottom of the motherboard. It also isn't necessarily the case that lower end desktops have more PCIe slots to go with their fewer M.2 slots; they too seem to have mostly gone with two or three PCIe slots, generally with limited number of lanes even if they're physically x16.

(I appreciate having physical x16 slots even if they're only PCIe x1, because that means you can use any card that doesn't require PCIe bifurcation and it should work, although slowly.)

As noted by commentators on my entry on PCIe bifurcation and its uses for NVMe drives, a certain amount of what we used to need PCIe slots for can now be provided through high speed USB-C and similar things. And of course there are only so many PCIe lanes to go around from the CPU and the chipset, so those USB-C ports and other high-speed motherboard devices consume a certain amount of them; the more onboard devices the motherboard has the fewer PCIe lanes there are left for PCIe slots, whether or not you have any use for those onboard devices and connectors.

(Having four M.2 NVMe slots is useful for me because I use my drives in mirrored pairs, so four M.2 slots means I can run my full old pair in parallel with a full new pair, either in a four way mirror or doing some form of migration from one mirrored pair to the other. Three slots is okay, since that lets me add a new drive to a mirrored pair for gradual migration to a new pair of drives.)

Sorting out 'PCIe bifurcation' and how it interacts with NVMe drives

By: cks
5 December 2024 at 03:01

Suppose, not hypothetically, that you're switching from one mirrored set of M.2 NVMe drives to another mirrored set of M.2 NVMe drives, and so would like to have three or four NVMe drives in your desktop at the same time. Sadly, you already have one of your two NVMe drives on a PCIe card, so you'd like to get a single PCIe card that handles two or more NVMe drives. If you look around today, you'll find two sorts of cards for this; ones that are very expensive, and ones that are relatively inexpensive but require that your system supports a feature that is generally called PCIe bifurcation.

NVMe drives are PCIe devices, so a PCIe card that supports a single NVMe drive is a simple, more or less passive thing that wires four PCIe lanes and some other stuff through to the M.2 slot. I believe that in theory, a card could be built that only required x2 or even x1 PCIe lanes, but in practice I think all such single drive cards are physically PCIe x4 and so require a physical x4 or better PCIe slot, even if you'd be willing to (temporarily) run the drive much slower.

A PCIe card that supports more than one M.2 NVMe drive has two options. The expensive option is to put a PCIe bridge on the card, with the bridge (probably) providing a full set of PCIe lanes to the M.2 NVMe drives locally on one side and doing x4, x8, or x16 PCIe with the motherboard on the other. In theory, such a card will work even at x4 or x2 PCIe lanes, because PCIe cards are supposed to do that if the system says 'actually you only get this many lanes' (although obviously you can't drive four x4 NVMe drives at full speed through a single x4 or x2 PCIe connection).

The cheap option is to require that the system be able to split a single PCIe slot into multiple independent groups of PCIe lanes (I believe these are usually called links); this is PCIe bifurcation. In PCIe bifurcation, the system takes what is physically and PCIe-wise an x16 slot (for example) and splits it into four separate x4 links (I've seen this sometimes labeled as 'x4/x4/x4/x4'). This is cheap for the card because it can basically be four single M.2 NVMe PCIe cards jammed together, with each set of x4 lanes wired through to a single M.2 NVMe slot. A PCIe card for two M.2 NVMe drives will require an x8 PCIe slot bifurcated to two x4 links; if you stick this card in an x16 slot, the upper 8 PCIe lanes just get ignored (which means that you can still set your BIOS to x4/x4/x4/x4).

As covered in, for example, this Synopsys page, PCIe bifurcation isn't something that's negotiated as part of bringing up PCIe connections; a PCIe device can't ask for bifurcation and can't be asked whether or not it supports it. Instead, the decision is made as part of configuring the PCIe root device or bridge, which in practice means it's a firmware ('BIOS') decision. However, I believe that bifurcation may also requires hardware support in the 'chipset' and perhaps the physical motherboard.

I put chipset into quotes because for quite some time, some PCIe lanes come directly from the CPU and only some others come through the chipset as such. For example, in desktop motherboards, the x16 GPU slot is almost always driven directly by CPU PCIe lanes, so it's up to the CPU to have support (or not have support) for PCIe bifurcation of that slot. I don't know if common desktop chipsets support bifurcation on the chipset PCIe slots and PCIe lanes, and of course you need chipset-driven PCIe slots that have enough lanes to be bifurcated in the first place. If the PCIe slots driven by the chipset are a mix of x4 and x1 slots, there's no really useful bifurcation that can be done (at least for NVMe drives).

If you have a limited number of PCIe slots that can actually support x16 or x8 and you need a GPU card, you may not be able to use PCIe bifurcation in practice even if it's available for your system. If you have only one PCIe slot your GPU card can go in and it's the only slot that supports bifurcation, you're stuck; you can't have both a bifurcated set of NVMe drives and a GPU (at least not without a bifurcated PCIe riser card that you can use).

(This is where I would start exploring USB NVMe drive enclosures, although on old desktops you'll probably need one that doesn't require USB-C, and I don't know if a NVMe drive set up in a USB enclosure can later be smoothly moved to a direct M.2 connection without partitioning-related problems or other issues.)

(This is one of the entries I write to get this straight in my head.)

Sidebar: Generic PCIe riser cards and other weird things

The traditional 'riser card' I'm used to is a special proprietary server 'card' (ie, a chunk of PCB with connectors and other bits) that plugs into a likely custom server motherboard connector and makes a right angle turn that lets it provide one or two horizontal PCIe slots (often half-height ones) in a 1U or 2U server case, which aren't tall enough to handle PCIe cards vertically. However, the existence of PCIe bifurcation opens up an exciting world of general, generic PCIe riser cards that bifurcate a single x16 GPU slot to, say, two x8 PCIe slots. These will work (in some sense) in any x16 PCIe slot that supports bifurcation, and of course you don't have to restrict yourself to x16 slots. I believe there are also PCIe riser cards that bifurcate an x8 slot into two x4 slots.

Now, you are perhaps thinking that such a riser card puts those bifurcated PCIe slots at right angles to the slots in your case, and probably leaves any cards inserted into them with at least their tops unsupported. If you have light PCIe cards, maybe this works out. If you don't have light PCIe cards, one option is another terrifying thing, a PCIe ribbon cable with a little PCB that is just a PCIe slot on one end (the other end plugs into your real PCIe slot, such as one of the slots on the riser card). Sometimes these are even called 'riser card extenders' (or perhaps those are a sub-type of the general PCIe extender ribbon cables).

Another PCIe adapter device you can get is an x1 to x16 slot extension adapter, which plugs into an x1 slot on your motherboard and has an x16 slot (with only one PCIe lane wired through, of course). This is less crazy than it sounds; you might only have an x1 slot available, want to plug in a x4, x8, or x16 card that's short enough, and be willing to settle for x1 speeds. In theory PCIe cards are supposed to still work when their lanes are choked down this way.

The general issue of terminal programs and the Alt key

By: cks
23 November 2024 at 23:26

When you're using a terminal program (something that provides a terminal window in a GUI environment, which is now the dominant form of 'terminals'), there's a fairly straightforward answer for what should happen when you hold down the Ctrl key while typing another key. For upper and lower case letters, the terminal program generates ASCII bytes 1 through 26, for Ctrl-[ you get byte 27 (ESC), and there are relatively standard versions of some other characters. For other characters, your specific terminal program may treat them as aliases for some of the ASCII control characters or ignore the Ctrl. All of this behavior is relatively standard from the days of serial terminals, and none of it helps terminal programs decide what should be generated when you hold down the Alt key while typing another key.

(A terminal program can hijack Alt-<key> to control its behavior, but people will generally find this hostile because they want to use Alt-<key> with things running inside the terminal program. In general, terminal programs are restricted to generating things at the character layer, where what they send has to fit in a sequence of bytes and be generally comprehensible to whatever is reading those bytes.)

Historically and even currently there have been three answers. The simplest answer is that Alt sets the 8th bit on what would otherwise be a seven-bit ASCII character. This behavior is basically a relic of the days when things actually were seven bit ASCII (at least in North America) and doing this wouldn't mangle things horribly (provided that the program inside the terminal understood this signal). As a result it's not too popular any more and I think it's basically died out.

The second answer is what I'll call the Emacs answer, which is that Alt plus another key generates ESC (Escape) and then the other key. This matches how Emacs handled its Meta key binding modifier (written 'M-...' in Emacs terminology) in the days of serial terminals; if an Emacs keybinding was M-a, you typed 'ESC a' to invoke it. Even today when we have real Alt keys and some programs could see a real Meta modifier (cf), basically every Emacs or Emacs-compatible system will accept ESC as the Meta prefix even if they're not running in a terminal.

(I started with Emacs sufficiently long ago that ESC-<key> is an ingrained reflex that I still sometimes use even though Alt is right there on my keyboard.)

The third answer is that Alt-<key> generates various accented or special characters in the terminal program's current locale (or in UTF-8, because that's increasingly hard-coded). Once upon a time this was the same as the first answer, because accented and special characters were whatever was found in the upper half of ASCII single-byte characters (bytes 128 to 255). These days, with people using UTF-8, it's generally different; for example, your Alt-a might generate 'Γ‘', but the actual UTF-8 representation of this single Unicode codepoint is actually two bytes, 0xc3 0xa1.

Some terminal programs still allow you to switch between the second and the third answers (Unix xterm is one such program and can even be switched on the fly, see the 'Meta sends Escape' option in the menu you get with Ctrl-<mouse button 1>). Others are hard-coded with the second answer, where Alt-<key> sends ESC <key>. My impression is that the second answer is basically the dominant one these days and only a few terminal programs even potentially support the third option.

PS: How xterm behaves can be host specific due to different default X resources settings on different hosts. Fedora makes xterm default to Alt-<key> sending ESC-<key>, while Ubuntu leaves it with the xterm code default of Alt creating accented characters.

A rough guess at how much IPv6 address space we might need

By: cks
10 November 2024 at 03:54

One of the reactions I saw to my entry on why NAT might be inevitable (at least for us) even with IPv6 was to ask if there really was a problem with being generous with IPv6 allocations, since they are (nominally) so large. Today I want to do some rough calculations on this, working backward from what we might reasonably assign to end user devices. There's a lot of hand-waving and assumptions here, and you can question a lot of them.

I'll start with the assumption that the minimum acceptable network size is a /64, for various reasons including SLAAC. As discussed, end devices presenting themselves on our network may need some number of /64s for internal use. Let's assume that we'll allocate sixteen /64s to each device, meaning that we give out /60s to each device on each of our subnets.

I think it's unlikely we'll want to ever have a subnet with more than 2048 devices on it (and even that's generous). That many /60s is a /49. However, some internal groups have more than one IPv4 subnet today, so for future expansion let's say that each group gets eight IPv6 subnets, so we give out /46s to research groups (or we could trim some of these sizes and give out /48s, which seems to be a semi-standard allocation size that various software may be more happy with).

We have a number of IPv4 subnets (and of research groups). If we want to allow for growth, various internal uses, and so on, we want some extra room, so I think we'd want space for at least 128 of these /46 allocations, which gets us to an overall allocation for our department of a /39 (a /38 if we want 256 just to be sure). The University of Toronto currently has a /32, so we actually have some allocation problems. For a start, the university has three campuses and it might reasonably want to split its /32 allocation into four and give one /34 to each campus. At a /34 for the campus, there's only 32 /39s and the university has many more departments and groups than that.

If the university starts with a /32, splits it to /34s for campuses, and wants to have room for 1024 or 2048 allocations within a campus, each department or group can get only a /44 or a /45 and all of our sizes would have to shrink accordingly; we'd need to drop at least five or six bits somewhere (say four subnets per group, eight or even four /64s per device, maybe 1024 devices maximum per subnet, etc).

If my understanding of how you're supposed to do IPv6 is correct, what makes all of this more painful in a purist IPv6 model is that you're not supposed to allocate multiple, completely separate IPv6 subnets to someone, unlike in the IPv4 world. Instead, everything is supposed to live under one IPv6 prefix. This means that the IPv6 prefix absolutely has to have enough room for future growth, because otherwise you have to go through a very painful renumbering to move to another prefix.

(For instance, today the department has multiple IPv4 /24s allocated to it, not all of them contiguous. We also work this way with our internal use of RFC 1918 address space, where we just allocate /16s as we need them.)

Being able to allocate multiple subnets of some size (possibly a not that large one) to departments and groups would make it easier to not over-allocate to deal with future growth. We might still have problems with the 'give every device eight /64s' plan, though.

(Of course we could do this multiple subnets allocation internally even if the university gives us only a single IPv6 prefix. Probably everything can deal with IPv6 used this way, and it would certainly reduce the number of bits we need to consume.)

The general problem of losing network based locks

By: cks
6 November 2024 at 03:38

There are many situations and protocols where you want to hold some sort of lock across a network between, generically, a client (who 'owns' the lock) and a server (who manages the locks on behalf of clients and maintains the locking rules). Because a network is involved, one of the broad problems that can happen in such a protocol is that the client can have a lock abruptly taken away from it by the server. This can happen because the server was instructed to break the lock, or the server restarted in some way and notified the clients that they had lost some or all of their locks, or perhaps there was a network partition that led to a lock timeout.

When the locking protocol and the overall environment is specifically designed with this in mind, you can try to require clients to specifically think about the possibility. For example, you can have an API that requires clients to register a callback for 'you lost a lock', or you can have specific error returns to signal this situation, or at the very least you can have a 'is this lock still valid' operation (or 'I'm doing this operation on something that I think I hold a lock for, give me an error if I'm wrong'). People writing clients can still ignore the possibility, just as they can ignore the possibility of other network errors, but at least you tried.

However, network locking is sometimes added to things that weren't originally designed for it. One example is (network) filesystems. The basic 'filesystem API' doesn't really contemplate locking and especially it doesn't consider that you can suddenly have access to a 'file' taken away from you in mid-flight. If you add network locking you don't have a natural answer to handling losing locks and there's no obvious point in the API to add it, especially if you want to pretend that your network filesystem is the same as a local filesystem. This makes it much easier for people writing programs to not even think about the possibility of losing a network lock during operation.

(If you're designing a purely networked filesystem-like API, you have more freedom; for example, you can make locking operations turn a regular 'file descriptor' into a special 'locked file descriptor' that you have to do subsequent IO through and that will generate errors if the lock is lost.)

One of the meta-problems with handling losing a network lock is that there's no single answer for what you should do about it. In some programs, you've violated an invariant and the only safe move for the program is to exit or crash. In some programs, you can pause operations until you can re-acquire the lock. In other programs you need to bail out to some sort of emergency handler that persists things in another way or logs what should have been done if you still held the lock. And when designing your API (or APIs) for losing locks, how likely you think each option is will influence what features you offer (and it will also influence how interested programs are in handling losing locks).

PS: A contributing factor to programmers and programs not being interested in handling losing network locks is that they're generally somewhere between uncommon and rare. If lots of people are writing code to deal with your protocol and losing locks are uncommon enough, some amount of those people will just ignore the possibility, just like some amount of programmers ignore the possibility of IO errors.

I feel that NAT is inevitable even with IPv6

By: cks
3 November 2024 at 02:23

Over on the Fediverse, I said something unpopular about IPv6 and NAT:

Hot take: NAT is good even in IPv6, because otherwise you get into recursive routing and allocation problems that have been made quite thorny by the insistence of so many things that a /64 is the smallest block they will work with (SLAAC, I'm looking at you).

Consider someone's laptop running multiple VMs and/or containers on multiple virtual subnets, maybe playing around with (virtual) IPv6 routers too.

(Partly in re <other Fediverse post>.)

The basic problem is straightforward. Imagine that you're running a general use wired or wireless network, where people connect their devices. One day, someone shows up with a (beefy) laptop that they've got some virtual machines (or container images) with a local (IPv6) network that is 'inside' their laptop. What IPv6 network addresses do these virtual machines get when the laptop is connected to your network and how do you make this work?

In a world where IPv6 devices and software reliably worked on subnet sizes smaller than a /64, this would be sort of straightforward. Your overall subnet might be a /64, and you would give each device connecting to it a /96 via some form of prefix delegation. This would allow a large number of devices on your network and also for each device to sub-divide its own /96 for local needs, with lots of room for multiple internal subnets for virtual machines, containers, or whatever else.

(And if a device didn't signal a need for a prefix delegation, you could give it a single IPv6 address from the /64, which would probably be the common case.)

In a world where lots of things insist on being on an IPv6 /64, this is extremely not trivial. Hosts will show up that want zero, one, or several /64s delegated to them, and both you and they may need those multiple /64s to fit into the same larger allocation of a /63, a /62, or so on. Worse, if more hosts than you expected show up asking for more delegations than you budgeted for, you'll need to expand the overall allocation to the entire network and everything under it, which at a minimum may be disruptive. Also, the IPv6 address space is large, but if you chop off half of it it's not that large, especially when you need to consume large blocks of it for contiguous delegations and sub-delegations and sub-sub delegations and so on.

I've described this as a laptop but there are other scenarios that are also perfectly reasonable. For example, suppose that you're setting up a subnet for a university research group that currently operates zero containers, virtual machine hosts, and the like (each of which would require at least one /64). Considering that research groups can and do change their mind on what they're running, how many additional /64s should you budget for them eventually needing, and what do you do when it turns out that they want to operate more than that?

IPv6 NAT gets you out of all of this. You assign an IPv6 address on your subnet's /64 to that laptop or server (or it SLAAC's one for itself), and everything else is its problem, not yours. Its containers and virtual machines get IPv6 addresses from some address space that's not your problem, and the laptop (or server) NATs all of their traffic back and forth. You don't have to know or care about how many internal networks the laptop (or server) is hiding, if it's got some sort of internal routing hierarchy, or anything.

I expect this use of IPv6 NAT to primarily be driven by the people with these laptops and servers, not by the people in charge of IPv6 network design. If you're someone with a laptop that has some containers or VMs that you need to work with, and you plug in to a network that isn't already specifically designed to accommodate you (for example it's just a /64), your practical choices are either IPv6 NAT or containers that can't talk to anything. The people running the network are pretty unlikely to redesign it for you (often their answer will be 'that's not supported on this network'), and if they do, the new network design is unlikely to be deployed immediately (or even very soon).

(I don't believe that delegating a single /64 to each machine is a particularly workable solution. It still leaves you with problems if any machine wants multiple internal IPv6 subnets, and it consumes your IPv6 address space at a prodigious rate if you're designing for a reasonable number of machines on each subnet. I'm also not sure how everyone on the subnet is supposed to know how to talk to each other, which is something that people often do on subnets.)

Two visions of 'software supply chain security'

By: cks
21 October 2024 at 03:04

Although the website that is insisting I use MFA if I want to use it to file bug reports doesn't use the words in its messages to me, we all know that the reason it is suddenly demanding I use MFA is what is broadly known as "software supply chain security" and the 'software supply chain' (which is a contentious name for deciding that you're going to rely on other people's open source code). In thinking about this, I feel that you can have (at least) two visions of "software supply chain security".

In one vision, software supply chain security is a collection of well intentioned moves and changes that are intended to make it harder for bad actors to compromise open source projects and their source code. For instance, all of the package repositories and other places where software is distributed try to get everyone to use multi-factor authentication, so people with the ability to publish new versions of packages can't get their (single) password compromised and have that password used by an attacker to publish a compromised version of their package. You might also expect to see people looking into heavily used, security critical projects to see if they have enough resources and then some moves to provide those resources.

In the other vision, software supply chain security is a way for corporations to avoid being blamed when there's a security issue in open source software that they've pulled into their products or their operations (or both). Corporations mostly don't really care about achieving actual security, especially since real security may not be legibly secure, but they are sensitive to blame, especially because it can result in lawsuits, fines, and other consequences. If a corporation can demonstrate that it was following convincing best practices to obtain secure (open source) software, maybe it can deflect the blame. And when doing this, it's useful if the 'best practices' are clearly legible and easy to assess, such as 'where we get open source software from insists on MFA'.

In the second vision, you might expect a big (corporate) push for visible but essentially performative 'security' steps, with relatively little difficult analysis of underlying root causes of various security risks, much less much of an attempt to address deep structural issues like sustainable open source maintenance.

(If you want an extremely crude measuring stick, you can simply ask "would this measure have prevented the XZ Utils backdoor". Generally the answer is 'no'.)

Forced MFA is effectively an annoying, harder to deal with second password

By: cks
20 October 2024 at 02:32

Suppose, not hypothetically, that some random web site you use is forcing you to enable MFA on your account, possibly an account that in practice you use only to do unimportant things like report issues on other people's open source software. I've written before how MFA is both 'simple' and non-trivial work, but that entry half assumed that you might actually care about the extra security benefits of MFA. If some random unimportant (to you) website is forcing you to get MFA, this goes out the window.

What the website is really doing is forcing you to enable a second password for your account, one that you must use in addition to your first password. Instead of using a password saved in your password manager of choice, you must now use the same saved password plus an additional password that is invariably slower and more work to produce. We understand today that websites that prevent you (or your password manager) from pasting in passwords and force you to type them out by hand are doing it wrong; well, that's what MFA is doing, except that often you're going to need a second device to get that password (whether that is a phone or a security key).

(For extra bonus points, losing the second 'password' alone may be enough to permanently lose your account on the website. At the very least, you're going to need to do a number of extra things to avoid this.)

My view is that if something unimportant is forcing MFA on you you don't feel like giving up on the site entirely, you might as well use the simplest, easiest to use MFA approach that you can. If the website will never let you in with the second factor alone, then it's perfectly okay for it to be relatively or completely insecure, and in any case you don't need to make it any more secure than your existing password management. In fact you might as well put it in your existing password management if possible, although I suspect that there are no current password managers that will both hold your password for a site and (automatically) generate the related TOTP MFA codes to go with it.

(You can get this on the same device, when you log in from your smartphone using its saved passwords and whatever authenticator app you're using. Don't ask how this is actually 'multi-factor', since anyone with your unlocked phone can use both factors; almost everyone in the MFA space is basically ignoring the issue because it would be too inconvenient to take it seriously.)

Will this defeat the website's security goals for forcing MFA down your throat? Yes, absolutely. But that's their problem, not yours. You are under no obligation to take any website (or your presence on it) as seriously as it takes itself. MFA that is not helping anything you care about is an obstacle, not a service.

Of course, sauce for the goose is sauce for the gander, so if you're implementing MFA for your good local security needs, you should be considering if the people who have to use it are going to think of your MFA in this way. Maybe they shouldn't, but remember, people don't actually care about security (and people matter because security is people).

A surprising IC in a LED light chain.

By: cpldcpu
25 November 2024 at 19:23

LED-based festive decorations are a fascinating subject for exploration of ingenuity in low-cost electronics. New products appear every year and often very surprising technology approaches are used to achieve some differentiation while adding minimal cost.

This year, there wasn’t any fancy new controller, but I was surprised how much the cost of simple light strings was reduced. The LED string above includes a small box with batteries and came in a set of ten for less than $2 shipped, so <$0.20 each. While I may have benefitted from promotional pricing, it is also clear that quite some work went into making the product cheap.

The string is constructed in the same way as one I had analyzed earlier: it uses phosphor-converted blue LEDs that are soldered to two insulated wires and covered with an epoxy blob. In contrast to the earlier device, they seem to have switched from copper wire to cheaper steel wires.

The interesting part is in the control box. It comes with three button cells, a small PCB, and a tactile button that turns the string on and cycles through different modes of flashing and and constant light.

Curiously, there is nothing on the PCB except the button and a device that looks like an LED. Also, note how some β€œredundant” joints have simply been left unsoldered.

Closer inspection reveals that the β€œLED” is actually a very small integrated circuit packaged in an LED package. The four pins are connected to the push button, the cathode of the LED string, and the power supply pins. I didn’t measure the die size exactly, but I estimate that it is smaller than 0.3Γ—0.2 mmΒ² = ~0.1 mmΒ².

What is the purpose of packaging an IC in an LED package? Most likely, the company that made the light string is also packaging their own LEDs, and they saved costs by also packaging the IC themselvesβ€”in a package type they had available.

I characterized the current-voltage behavior of IC supply pins with the LED string connected. The LED string started to emit light at around 2.7V, which is consistent with the forward voltage of blue LEDs. The current increased proportionally to the voltage, which suggests that there is no current limit or constant current sink in the IC – it’s simply a switch with some series resistance.

Left: LED string in β€œconstantly on” mode. Right: Flashing

Using an oscilloscope, I found that the string is modulated with an on-off ratio of 3:1 at a frequency if ~1.2 kHz. The image above shows the voltage at the cathode, the anode is connected to the positive supply. This is most likely to limit the current.

All in all, it is rather surprising to see an ASIC being used when it barely does more than flashing the LED string. It would have been nice to see a constant current source to stabilize the light levels over the lifetime of the battery and maybe more interesting light effects. But I guess that would have increased the cost of the ASIC too much and then using an ultra-low cost microcontroller may have been cheaper. This almost calls for a transplant of a MCU into this device…

Keep the crap going

By: VM
6 December 2024 at 09:16

Have you seen the new ads for Google Gemini?

In one version, just as a young employee is grabbing her fast-food lunch, she notices her snooty boss get on an elevator. So she drops her sandwich, rushes to meet her just as the doors are about to close, and submits her proposal in the form of a thick dossier. The boss asks her for a 500-word summary to consume during her minute-long elevator ride. The employee turns to Google Gemini, which digests the report and spits out the gist, and which the employee regurgitates to the boss’s approval. The end.


Isn’t this unsettling? Google isn’t alone either. In May this year, Apple released a tactless ad for its new iPad Pro. From Variety:

The β€œCrush!” ad shows various creative and cultural objects β€” including a TV, record player, piano, trumpet, guitar, cameras, a typewriter, books, paint cans and tubes, and an arcade game machine β€” getting demolished in an industrial press. At the end of the spot, the new iPad Pro pops out, shiny and new, with a voiceover that says, β€œThe most powerful iPad ever is also the thinnest.”

After the backlash, Apple bactracked and apologised β€” and then produced two ads in November for its Apple Intelligence product showcasing how it could help thoughtless people continue to be thoughtless.



The second video is additionally weird because it seems to suggest reaching all the way for an AI tool makes more sense than setting a reminder on the calendar that comes in all smartphones these days.

And they are now joined in spirit by Google, because bosses can now expect their subordinates to Geminify their way through what could otherwise have been tedious work or just impossible to do on punishingly short deadlines β€” without the bosses having to think about whether their attitudes towards what they believe is reasonable to ask of their teammates need to change. (This includes a dossier of details that ultimately won’t be read.)

If AI is going to absorb the shock that comes of someone being crappy to you, will we continue to notice that crappiness and demand they change or β€” as Apple and Google now suggest β€” will we blame ourselves for not using AI to become crappy ourselves? To quote from a previous post:

When machines make decisions, the opportunity to consider the emotional input goes away. This is a recurring concern I’m hearing about from people working with or responding to AI in some way. … This is Anna Mae Duane, director of the University of Connecticut Humanities Institute, in The Conversation: β€œI fear how humans will be damaged by the moral vacuum created when their primary social contacts are designed solely to serve the emotional needs of the β€˜user’.”

The applications of these AI tools have really blossomed and millions of people around the world are using them for all sorts of tasks. But even if the ads don’t pigeonhole these tools, they reveal how their makers β€” Apple and Google β€” are thinking about what the tools bring to the table and what these tech companies believe to be their value. To Google’s credit at least, its other ads in the same series are much better (see here and here for examples), but they do need to actively cut down on supporting or promoting the idea that crappy behaviour is okay.

Two views of what a TLS certificate verifies

By: cks
2 October 2024 at 01:58

One of the things that you could ask about TLS is what a validated TLS certificate means or is verifying. Today there is a clear answer, as specified by the CA/Browser Forum, and that answers is that when you successfully connect to https://microsoft.com/, you are talking to the "real" microsoft.com, not an impostor who is intercepting your traffic in some way. This is known as 'domain control' in the jargon; to get a TLS certificate for a domain, you must demonstrate that you have control over the domain. The CA/Browser Forum standards (and the browsers) don't require anything else.

Historically there has been a second answer, what TLS (then SSL) sort of started with. A TLS certificate was supposed to verify that not just the domain but that you were talking to the real "Microsoft" (which is to say the large, world wide corporation with its headquarters in Redmond WA, not any other "Microsoft" that might exist). More broadly, it was theoretically verifying that you were talking to a legitimate and trustworthy site that you could, for example, give your credit card number to over the Internet, which used to be a scary idea.

This second answer has a whole raft of problems in practice, which is why the CA/Browser Forum has adopted the first answer, but it started out and persists because it's much more useful to actual people. Most people care about talking to (the real) Google, not some domain name, and domain names are treacherous things as far as identity goes (consider IDN homograph attacks, or just 'facebook-auth.com'). We rather want this human version of identity and it would be very convenient if we could have it. But we can't. The history of TLS certificates has convincingly demonstrated that this version of identity has comprehensively failed for a collection of reasons including that it's hard, expensive, difficult or impossible to automate, and (quite) fallible.

(The 'domain control' version of what TLS certificates mean can be automated because it's completely contained within the Internet. The other version is not; in general you can't verify that sort of identity using only automated Internet resources.)

A corollary of this history is that no Internet protocol that's intended for wide spread usage can assume a 'legitimate identity' model of participants. This includes any assumption that people can only have one 'identity' within your system; in practice, since Internet identity can only verify that you are something, not that you aren't something, an attacker can have as many identities as they want (including corporate identities).

PS: The history of commercial TLS certificates also demonstrates that you can't use costing money to verify legitimacy. It sounds obvious to say it, but all that charging someone money demonstrates is that they willing and able to spend some money (perhaps because they have a pet cause), not that they're legitimate.

TLS certificates were (almost) never particularly well verified

By: cks
22 September 2024 at 02:32

Recently there was a little commotion in the TLS world, as discussed in We Spent $20 To Achieve RCE And Accidentally Became The Admins Of .MOBI. As part of this adventure, the authors of the article discovered that some TLS certificate authorities were using WHOIS information to validate who controlled a domain (so if you could take over a WHOIS server for a TLD, you could direct domain validation to wherever you wanted). This then got some people to realize that TLS Certificate Authorities were not actually doing very much to verify who owned and controlled a domain. I'm sure that there were also some people who yearned for a hypothetical old days when Certificate Authorities actually did that, as opposed to the modern days when they don't.

I'm afraid I have bad news for anyone with this yearning. Certificate Authorities have never done a particularly strong job of verifying who was asking for a TLS (then SSL) certificate. I will go further and be more controversial; we don't want them to be thorough about identity verification for TLS certificates.

There are a number of problems with identity verification in theory and in practice, but one of them is that it's expensive, and the more thorough and careful the identity verification, the more expensive it is. No Certificate Authority is in a position to absorb this expense, so a world where TLS certificates are carefully verified is also a world where they are expensive. It's also probably a world where they're difficult or impossible to obtain from a Certificate Authority that's not in your country, because the difficulty of identity verification goes up significantly in that case.

(One reason that thorough and careful verification is expensive is that it takes significant time from experienced, alert humans, and that time is not cheap.)

This isn't the world that we had even before Let's Encrypt created the ACME protocol for automated domain verifications. The pre-LE world might have started out with quite expensive TLS certificates, but it shifted fairly rapidly to ones that cost only $100 US or less, which is a price that doesn't cover very much human verification effort. And in that world, with minimal human involvement, WHOIS information is probably one of the better ways of doing such verification.

(Such a world was also one without a lot of top level domains, and most of the TLDs were country code TLDs. The turnover in WHOIS servers was probably a lot smaller back then.)

PS: The good news is that using WHOIS information for domain verification is probably on the way out, although how soon this will happen is an open question.

Threads, asynchronous IO, and cancellation

By: cks
14 September 2024 at 02:23

Recently I read Asynchronous IO: the next billion-dollar mistake? (via), and had a reaction to one bit of it. Then yesterday on the Fediverse I said something about IO in Go:

I really wish you could (easily) cancel io Reads (and Writes) in Go. I don't think there's any particularly straightforward way to do it today, since the io package was designed way before contexts were a thing.

(The underlying runtime infrastructure can often actually do this because it decouples 'check for IO being possible' from 'perform the IO', but stuff related to this is not actually exposed.)

Today this sparked a belated realization in my mind, which is that a model of threads performing blocking IO in each thread is simply a harder environment to have some sort of cancellation in than an asynchronous or 'event loop' environment. The core problem is that in their natural state, threads are opaque and therefor difficult to interrupt or stop safely (which is part of why Go's goroutines can't be terminated from the outside). This is the natural inverse of how threads handle state for you.

(This is made worse if the thread is blocked in the operating system itself, for example in a 'read()' system call, because now you have to use operating system facilities to either interrupt the system call so the thread can return to user level to notice your user level cancellation, or terminate the thread outright.)

Asynchronous IO generally lets you do better in a relatively clean way. Depending on the operating system facilities you're using, either there is a distinction between the OS telling you that IO is possible and your program doing IO, providing you a chance to not actually do the IO, or in an 'IO submission' environment you generally can tell the OS to cancel a submitted but not yet completed IO request. The latter is racy, but in many situations the IO is unlikely to become possible right as you want to cancel it. Both of these let you implement a relatively clean model of cancelling a conceptual IO operation, especially if you're doing the cancellation as the result of another IO operation.

Or to put it another way, event loops may make you manage state explicitly, but that also means that that state is visible and can be manipulated in relatively natural ways. The implicit state held in threads is easy to write code with but hard to reason about and work with from the outside.

Sidebar: My particular Go case

I have a Go program that at its core involves two goroutines, one reading from standard input and writing to a network connection, one reading from the network connection and writing to standard output. Under some circumstances, the goroutine reading from the network will want to close down the network collection and return to a top level, where another two way connection will be made. In the process, it needs to stop the 'read from stdin, write to the network' goroutine while it is parked in 'read from stdin', without closing stdin (because that will be reused for the next connection).

To deal with this cleanly, I think I would have to split the 'read from standard input, write to the network' goroutine into two that communicated through a channel. Then the 'write to the network' side could be replaced separately from the 'read from stdin' side, allowing me to cleanly substitute a new network connection.

(I could also use global variables to achieve the same substitution, but let's not.)

Ways ATX power supply control could work on server motherboards

By: cks
11 September 2024 at 03:02

Yesterday I talked about how ATX power supply control seems to work on desktop motherboards, which is relatively straightforward; as far as I can tell from various sources, it's handled in the chipset (on modern Intel chipsets, in the PCH), which is powered from standby power by the ATX power supply. How things work on servers is less clear. Here when I say 'server' I mean something with a BMC (Baseboard management controller), because allowing you to control the server's power supply is one of the purposes of a BMC, which means the BMC has to hook into this power management picture.

There appear to be a number of ways that the power control and management could or may be done and the BMC connected to it. People on the Fediverse replying to my initial question gave me a number of possible answers:

I found documentation for some of Intel's older Xeon server chipsets (with provisions for BMCs) and as of that generation, power management was still handled in the PCH and described in basically the same language as for desktops. I couldn't spot a mention of special PCH access for the BMC, so BMC control over server power might have been implemented with the 'BMC controls the power button wire' approach.

I can also imagine hybrid approaches. For example, you could in theory give the BMC control over the 'turn power on' wire to the power supplies, and route the chipset's version of that line to the BMC, in addition to routing the power button wire to the BMC. Then the BMC would be in a position to force a hard power off even if something went wrong in the chipset (or a hard power on, although if the chipset refuses to trigger a power on there might be a good reason for that).

(Server power supplies aren't necessarily 'ATX' power supplies as such, but I suspect that they all have similar standby power, 'turn power on', and 'is the PSU power stable' features as ATX PSUs do. Server PSUs often clearly aren't plain ATX units because they allow the BMC to obtain additional information on things like the PSU's state, temperature, current power draw, and so on.)

Our recent experience with BMCs that wouldn't let their servers power on when they should have suggests that on these servers (both Dell R340s), the BMC has some sort of master control or veto power over the normal 'return to last state' settings in the BIOS. At the same time, the 'what to do after AC power returns' setting is in the BIOS, not in the BMC, so it seems that the BMC is not the sole thing controlling power.

(I tried to take a look at how this was done in OpenBMC, but rapidly got lost in a twisty maze of things. I think at least some of the OpenBMC supported hardware does this through I2C commands, although what I2C device it's talking to is a good question. Some of the other hardware appears to have GPIO signal definitions for power related stuff, including power button definitions.)

How ATX power supply control seems to work on desktop motherboards

By: cks
10 September 2024 at 03:11

Somewhat famously, the power button on x86 PC desktop machines with ATX power supplies is not a 'hard' power switch that interrupts or enables power through the ATX PSU but a 'soft' button that is controlled by the overall system. The actual power delivery is at least somewhat under software control, both the operating system (which enables modern OSes to actually power off the machine under software control) and the 'BIOS', broadly defined, which will do things like signal the OS to do an orderly shutdown if you merely tap the power button instead of holding it down for a few seconds. Because they're useful, 'soft' power buttons and the associated things have also spread to laptops and servers, even if their PSUs are not necessarily 'ATX' as such. After recent events, I found myself curious about actually did handle the chassis power button and associated things. Asking on the Fediverse produced a bunch of fascinating answers, so today I'm starting with plain desktop motherboards, where the answer seems to be relatively straightforward.

(As I looked up once, physically the power button is normally a momentary-contact switch that is open (off) when not pressed. A power button that's stuck 'pressed' can have odd effects.)

At the direct electrical level, ATX PSUs are either on, providing their normal power, or "off", which is not really completely off but has the PSU providing +5V standby power (with a low current limit) on a dedicated pin (pin 9, the ATX cable normally uses a purple wire for this). To switch an ATX PSU from "off" to on, you ground the 'power on' pin and keep it grounded (pin 16; the green wire in normal cables, and ground is black wires). After a bit of stabilization time, the ATX PSU will signal that all is well on another pin (pin 8, the grey wire). The ATX PSU's standby power is used to power the RTC and associated things, to provide the power for features like wake-on-lan (which requires network ports to be powered up at least a bit), and to power whatever handles the chassis power button when the PSU is "off".

On conventional desktop motherboards, the actual power button handling appears to be in the PCH or its equivalent (per @rj's information on the ICH, and also see Whitequark's ICH/PCH documentation links). In the ICH/PCH, this is part of general power management, including things like 'suspend to RAM'. Inside the PCH, there's a setting (or maybe two or three) that controls what happens when external power is restored; the easiest to find one is called AFTERG3_EN, which is a single bit in one of the PCH registers. To preserve this register's settings over loss of external power, it's part of what the documentation calls the "RTC well", which is apparently a chunk of stuff that's kept powered as part of the RTC, either from standby power or from the RTC's battery (depending on whether or not there's external power available). The ICH/PCH appears to have a direct "PWRBTN#" input line, which is presumably eventually connected to the chassis power button, and it directly implements the logic for handling things like the 'press and hold for four seconds to force a power off' feature (which Intel describes as 'transitioning to S5', the "Soft-Off" state).

('G3' is the short Intel name for what Intel calls "Mechanical Off", the condition where there's no external power. This makes the AFTERG3_EN name a bit clearer.)

As far as I can tell there's no obvious and clear support for the modern BIOS setting of 'when external power comes back, go to your last state'. I assume that what actually happens is that the ICH/PCH register involved is carefully updated by something (perhaps ACPI) as the system is powered on and off. When the system is powered on, early in the sequence you'd set the PCH to 'go to S0 after power returns'; when the system is powered off, right at the end you'd set the PCH to 'stay in S5 after power returns'.

(And apparently you can fiddle with this register yourself (via).)

All of the information I've dug up so far is for Intel ICH/PCH, but I suspect that AMD's chipsets work in a similar manner. Something has to do power management for suspend and sleep, and it seems that the chipset is the natural spot for it, and you might as well put the 'power off' handling into the same place. Whether AMD uses the same registers and the same bits is an open question, since I haven't turned up any chipset documentation so far.

Operating system threads are always going to be (more) expensive

By: cks
7 September 2024 at 04:01

Recently I read Asynchronous IO: the next billion-dollar mistake? (via). Among other things, it asks:

Now imagine a parallel universe where instead of focusing on making asynchronous IO work, we focused on improving the performance of OS threads [...]

I don't think this would have worked as well as you'd like, at least not with any conventional operating system. One of the core problems with making operating system threads really fast is the 'operating system' part.

A characteristic of all mainstream operating systems is that the operating system kernel operates in a separate hardware security domain than regular user (program) code. This means that any time the operating system becomes involved, the CPU must do at least two transitions between these security domains (into kernel mode and then back out). Doing these transitions is always more costly than not doing them, and on top of that the CPU's ISA often requires the operating system go through non-trivial work in order to be safe from user level attacks.

(The whole speculative execution set of attacks has only made this worse.)

A great deal of the low level work of modern asynchronous IO is about not crossing between these security domains, or doing so as little as possible. This is summarized as 'reducing system calls because they're expensive', which is true as far as it goes, but even the cheapest system call possible still has to cross between the domains (if it is an actual system call; some operating systems have 'system calls' that manage to execute entirely in user space).

The less that doing things with threads crosses the CPU's security boundary into (and out of) the kernel, the faster the threads go but the less we can really describe them as 'OS threads' and the harder it is to get things like forced thread preemption. And this applies not just for the 'OS threads' themselves but also to their activities. If you want 'OS threads' that perform 'synchronous IO through simple system calls', those IO operations are also transitioning into and out of the kernel. If you work to get around this purely through software, I suspect that what you wind up with is something that looks a lot like 'green' (user-space) threads with asynchronous IO once you peer behind the scenes of the abstractions that programs are seeing.

(You can do this today, as Go's runtime demonstrates. And you still benefit significantly from the operating system's high efficiency asynchronous IO, even if you're opting to use a simpler programming model.)

(See also thinking about event loops versus threads.)

TLS Server Name Indications can be altered by helpful code

By: cks
4 September 2024 at 03:25

In TLS, the Server Name Indication is how (in the modern TLS world) you tell the TLS server what (server) TLS certificate you're looking for. A TLS server that has multiple TLS certificates available, such as a web server handling multiple websites, will normally use your SNI to decide what server TLS certificate to provide to you. If you provide an SNI that the TLS server doesn't know or don't provide a SNI at all, the TLS server can do a variety of things, but many will fall back to some default TLS certificate. Use of SNI is pervasive in web PKI but not always used elsewhere; for example, SMTP clients don't always send SNI when establishing TLS with a SMTP server.

The official specification for SNI is section 3 of RFC 6066, and it permits exactly one format of the SNI data, which is, let's quote:

"HostName" contains the fully qualified DNS hostname of the server, as understood by the client. The hostname is represented as a byte string using ASCII encoding without a trailing dot. [...]

Anything other than this is an incorrectly formatted SNI. In particular, sending a SNI using a DNS name with a dot at the end (the customary way of specifying a fully qualified name in the context of DNS) is explicitly not allowed under RFC 6066. RFC 6066 SNI names are always fully qualified and without the trailing dots.

So what happens if you provide a SNI with a trailing dot? That depends. In particular, if you're providing a name with a trailing dot to a client library or a client program that does TLS, the library may helpfully remove the trailing dot for you when it sends the SNI. Go's crypto/tls definitely behaves this way, and it seems that some TLS libraries may. Based on observing behavior on systems I have access to, I believe that OpenSSL does strip the trailing dot but GnuTLS doesn't, and probably Mozilla's NSS doesn't either (since Firefox appears to not do this).

(I don't know what a TLS server sees as the SNI if it uses these libraries, but it appears likely that OpenSSL doesn't strip the trailing dot but instead passes it through literally.)

This dot stripping behavior is generally silent, which can lead to confusion if you're trying to test the behavior of providing a trailing dot in the SNI (which can cause web servers to give you errors). At the same time it's probably sensible behavior for the client side of TLS libraries, since some of the time they will be deriving the SNI hostname from the host name the caller has given them to connect to, and the caller may want to indicate a fully qualified DNS name in the customary way.

PS: Because I looked it up, the Go crypto/tls client code strips a trailing dot while the server code rejects a TLS ClientHelo that includes a SNI with a trailing dot (which will cause the TLS connection to fail).

The status of putting a '.' at the end of domain names

By: cks
2 September 2024 at 02:29

A variety of things that interact with DNS interpret the host or domain name 'host.domain.' (with a '.' at the end) as the same as the fully qualified name 'host.domain'; for example this appears in web browsers and web servers. At this point one might wonder whether this is an official thing in DNS or merely a common convention and practice. The answer is somewhat mixed.

In the DNS wire protocol, initially described in RFC 1035, we can read this (in section 3.1):

Domain names in messages are expressed in terms of a sequence of labels. Each label is represented as a one octet length field followed by that number of octets. Since every domain name ends with the null label of the root, a domain name is terminated by a length byte of zero. [...]

DNS has a 'root', which all DNS queries (theoretically) start from, and a set of DNS servers, the root nameservers, that answer the initial queries that tell you what the DNS servers are for the top level domain is (such as the '.edu' or the '.ca' DNS servers). In the wire format, this root is explicitly represented as a 'null label', with zero length (instead of being implicit). In the DNS wire format, all domain names are fully qualified (and aren't represented as plain text).

RFC 1035 also defines a textual format to represent DNS information, Master files. When processing these files there is usually an 'origin', and textual domain names may be relative to that origin or absolute. The RFC says:

[...] Domain names that end in a dot are called absolute, and are taken as complete. Domain names which do not end in a dot are called relative; the actual domain name is the concatenation of the relative part with an origin specified in a $ORIGIN, $INCLUDE, or as an argument to the master file loading routine. A relative name is an error when no origin is available.

So in textual DNS data that follows RFC 1035's format, 'host.domain.' is how you specify an absolute (fully qualified) DNS name, as opposed to one that is under the current origin. Bind uses this format (or something derived from it, here in 2024 I don't know if it's strictly RF 1035 compliant any more), and in hand-maintained Bind format zone files you can find lots of use of both relative and absolute domain names.

DNS data doesn't have to be represented in text in RFC 1035 form (and doing so has some traps), either for use by DNS servers or for use by programs who do things like look up domain names. However, it's not quite accurate to say that 'host.domain.' is only a convention. A variety of things use a more or less RFC 1035 format, and in those things a terminal '.' means an absolute name because that's how RFC 1035 says to interpret and represent it.

Since RFC 1035 uses a '.' at the end of a domain name to mean a fully qualified domain name, it's become customary for code to accept one even if the code already only deals with fully qualified names (for example, DNS lookup libraries). Every program that accepts or reports this format creates more pressure on other programs to accept it.

(It's also useful as a widely understood signal that the textual domain name returned through some API is fully qualified. This may be part of why Go's net package consistently returns results from various sorts of name resolutions with a terminating '.', including in things like looking up the name(s) of IP addresses.)

At the same time, this syntax for fully qualified domain names is explicitly not accepted in certain contexts that have their own requirements. One example is in email addresses, where 'user@some.domain.' is almost invariably going to be rejected by mail systems as a syntax error.

In practice, abstractions hide their underlying details

By: cks
1 September 2024 at 01:58

Very broadly, there are two conflicting views of abstractions in computing. One camp says that abstractions simplify the underlying complexity but people still should know about what is behind the curtain, because all abstractions are leaky. The other camp says that abstractions should hide the underlying complexity entirely and do their best not to leak the details through, and that people using the abstraction should not need to know those underlying details. I don't particularly have a side, but I do have a pragmatic view, which is that many people using abstractions don't know the underlying details.

People can debate back and forth about whether people should know the underlying details and whether they are incorrect to not know them, but the well established pragmatic reality is that a lot of people writing a lot of code and building a lot of systems don't know more than a few of the details behind the abstractions that they use. For example, I believe that a lot of people in web development don't know that host and domain names can often have a dot at the end. And people who have opinions about programming probably have a favorite list of leaky abstractions that people don't know as much about as they should.

(One area a lot of programming abstractions 'leak' is around performance issues. For example, the (C)Python interpreter is often much faster if you make things local variables inside a function than if you use global variables because of things inside the abstraction it presents to you.)

That this happens should not be surprising. People have a limited amount of time and a limited amount of things that they can learn, remember, and keep track of. When presented with an abstraction, it's very attractive to not sweat the details, especially because no one can keep track of all of them. Computing is simply too complicated to see behind all of the abstractions all of the way down. Almost all of the time, your effort is better focused on learning and mastering your layer of the abstraction stack rather than trying to know 'enough' about every layer (especially when it's not clear how much is enough).

(Another reason to not dig too deeply into the details behind abstractions is that those details can change, especially if one reason the abstraction exists is to allow the details to change. We call some of these abstractions 'APIs' and discourage people investigating and using the specific details behind the current implementations.)

One corollary of this is that safety and security related abstractions need to be designed with the assumption that people using them won't know or remember all of the underlying details. If forgetting one of those details will leave people using the abstraction with security problems, the abstraction has a design flaw that will inevitably lead to a security issue sooner or later. This security issue is not the fault of the people using the abstraction, except in a mathematical security way.

My (current) view on open source moral obligations and software popularity

By: cks
24 August 2024 at 02:59

A while back I said something pretty strong in a comment on my entry on the Linux kernel CVE story:

(I feel quite strongly that the importance of a project cannot create substantial extra obligations on the part of the people working on the project. We do not get to insist that other people take on more work just because their project got popular. In my view, this is a core fallacy at the heart of a lot of "software supply chain security" stuff, and I think things like the Linux kernel CVE handling are the tip of an iceberg of open source reactions to it.)

After writing that, I thought about it more and I think I have a somewhat more complicated view on moral obligations (theoretically) attached to open source software. To try to boil it down, I feel that other people's decisions should not create a moral obligation on your part.

If you write a project to scratch your itch and a bunch of other people decide to use it too, that is on them, not on you. You have no moral obligation to them that accrues because they started using your software, however convenient it might be for them if you did or however much time might be saved if you did something instead of many or all of them doing something. Of course you may be a nice person, and you may also be the kind of person who is extremely conscious of how many people are relying on your software and what might happen to them if you did or didn't do various things, but that is your decision. You don't have a positive moral obligation to them.

(It's my view that this lack of obligations is a core part of what makes free software and open source software work at all. If releasing open source software came with firm moral or legal obligations, we would see far less of it.)

However, in a bit of a difference from what I implied in my comment, I also feel that while other people's actions don't create a moral obligation on you, your own actions may. If you go out and actively promote your software, try to get it widely used, put forward that you're responsive and stand ready to fix problems, and so on, then the moral waters are at least muddy. If you explicitly acted to put yourself and your software forward, other people sort of do have the (moral) right to assume that you're going to live up to your promises (whether they're explicit or implicit). However, there has to be a line somewhere; you shouldn't acquire an unlimited, open-ended obligation to do work for other people using your software just because you promoted your software a bit.

(The issue of community norms is another thing entirely. I'm sure there are some software communities where merely releasing something into the community comes with the social expectation that you'll support it.)

Staged rollouts of things still have limitations

By: cks
6 August 2024 at 02:45

One of the commonly suggested remedies for deploying things that can go wrong is to do staged rollouts, where you deploy to only a subset of the things at a time and look for problems before proceeding. Staged rollouts are in general a good idea, but it's important to understand that there are limits on how much they can improve the situation, especially if the staged rollouts are going out to outside people ('customers') instead of internally, within your organization in environments that you control.

The first limitation is that staged rollouts only help to the extent that you can actually detect problems before continuing with the rollout. Often what problems you can detect (and how soon) are limited by the telemetry you have available and the degree to which you can inspect and monitor the systems that you're rolling out to. If you're rolling out internally, this can possibly be quite high, but if you're rolling out to customers, you may have limited telemetry (partly because customers will object to your software constantly reporting things back to you, especially if you want to report lots of details) and no ability to reach out and inspect systems. A related issue is that when you build rollout telemetry and monitoring, you're probably basing the telemetry on what problems you expect. If your rollout triggers a problem that you didn't foresee, you may have no telemetry that would tell you about it.

(For a topical example, consider the telemetry you'd need to detect that your application has made your customer's machines crash and be unable to boot. Since the machines aren't booting, you can't send any telemetry from them to actually report the problem; instead you'd need some telemetry signal that your application was running fine and then monitor this signal for a rapid decrease in your staging group. Would you think to both build and monitor this telemetry signal in advance?)

The second limitation is that if your staged rollout detects problems, you've (still) inflicted problems on some people, just not as many of them as without a staged rollout. Again, this is more of a problem with external staged rollouts than with internal ones. When your staged rollout is internal, you're inflicting problems on yourself; when your staged rollout is external, you're inflicting problems on other people and they're going to be unhappy with you. Staged external rollouts don't eliminate problems, they merely reduce them.

(For instance, Ubuntu has a system of 'phased updates' for non-security updates of some packages, such as OpenSSH, but if an update is bad and detected in this phased update process, and you happen to be one of the people who got the update early, you get to sort out whatever mess it's made of your system.)

In addition, staged rollouts are in conflict with rapid updates. The slower and more carefully you do a staged rollout, the longer (on average) it takes for your update to reach people and become functional. This isn't vital for some updates, but we know update speed matters for some things. As an extreme example, if you're pushing out an update to deal with a security problem that's being actively exploited, most people are going to want it right now and the slower your staged rollout runs, the more people will wind up being exploited.

This doesn't excuse doing a non-staged rollout that blows up. Or even a staged rollout that only blows up some people. It's your job to only roll out good changes, and as part of that to test your changes (and your systems) before throwing them into the field. Staged rollouts are an emergency backup in case an error slipped through your other precautions, especially external staged rollouts, where you can't easily fix any problems that you caused.

(The corollary is that if staged rollouts are regularly saving you, you have additional problems and should probably fix them first.)

PS: There are probably situations where it's sensible to make internal staged rollouts your main defense against bad updates. But otherwise it's my view that staged rollouts should be your emergency backup to all of the other testing and validation you're doing.

Part of (computer) security is convincing people that it works

By: cks
20 July 2024 at 02:17

One of the ways that security is people, not math is that as part of security being ultimately about people, part of the work of computer security is convincing people that your security measures actually work. I don't mean this in a narrow technical sense of specific technical features working as designed; I mean this in the broader sense that they achieve the general goals of security, which is really about safety. People want to know that their data and what they do on the computer is safe, in the full sense of the confidentiality, integrity, and availability triad.

Often, convincing people that your security works requires making it legible to them, in the "Seeing Like a State" sense. One way to describe this situation is that partly due to the sorry history of computer security and people not doing effective computer security, many people and organizations have adopted a view that they assume computer security measures don't work or aren't effective until proven otherwise. If you can't convince them that your security measure works, in the process making it legible to them, they assume it doesn't. Historically they were often right.

One complication is that the people you're trying to make your security measures convincing and legible to are almost always people who don't have specialist knowledge in computer security. Often they have little to no knowledge in the field at all (just like you don't have expert-level knowledge in their fields). This means that you generally can't convince them by explaining the technical details, because they don't have the knowledge and experience they'd require to evaluate those details. Handling this has no straightforward solution, but it will often require some degree of building their trust in your skill and honest coupled with some degree of using things that other independent and trusted (by the people you're trying to convince) parties have already called secure. This is part of what it means to have legibility in your security measures; you're making something that other people can understand and assess, even if it's not what you'd make for yourself.

Some system administrators and other computer people can wind up feeling upset about this, because from their perspective the organization is preferring inferior outside solutions (that have the social approval of the crowd) to the superior home grown work. However, all of us inclined to see things from this angle really should turn around and look at it from the organization's perspective. For the organization, it's not a choice between inferior but generally approved security and home grown 'real security', it's a choice between known (although maybe flawed) security and an unknown state where they may be more secure, as secure, less secure, or completely exposed. It's perfectly sensible for the organization to choose a known state over a risky unknown one.

(It's taken me a long time to come around to this perspective over the course of my career, because of course in the beginning I was solidly in the 'this is obviously better security, because of ...' camp. Even today I'm in the camp of 'real security', it's just that I've come to appreciate that part of my job is convincing the organization that what we're offering is not a 'risky unknown' state.)

My self-inflicted UPS and computer conundrum

By: cks
17 July 2024 at 04:44

Today the area where I live experienced what turned out to be an extended power outage, and I wound up going in to the office. In the process, I wound up shooting myself in the foot as far as my ability to tell from the office if power had returned to home, oddly because I have a UPS at home.

The easy way to use a UPS is just to let it run down until it powers off. But this is kind of abrupt on the computer, even if you've more or less halted it already, and it also means that the computer's load reduces the run time for everything else. In my case this matters a bit because after a power loss, my phone line is typically slow to get DSL signal and sync up, so that I can start doing PPPoE and bring up my Internet connection. So if it looks like the UPS's power is running low, my reflex is to power off the computer and hope that power will come back before the UPS bottoms out and the DSL modem turns off (and thus loses line sync).

The first problem is that this only really works if I'm going to stick around to turn the computer on should the power outage end early enough (before the UPS loses power). That turned out not to be a problem this time; the power outage lasted more than long enough to run the UPS out of power, even with only the minor load of the DSL modem, a five-port switch, and a few other small things. The bigger problem is that because of how I have my computer set up right now due to hardware reasons, if I want the computer to be drawing no power (as opposed to being 'off' in some sense), I have to turn the computer off using the hard power switch on the PSU. Once I've flipped this switch, the computer is off until I flip it back, and if I flip it back with (UPS) power available, the computer will power back up again and start drawing power and all that.

(My BIOS is set to 'always power up when AC power is restored', and apparently one side effect of this is that the chassis fans and so on keep spinning even when the system is 'powered off' from Linux.)

The magic UPS feature I would like to fix this is a one-shot push button switch for every outlet that temporarily switches the outlet to 'wait until AC power returns to give this outlet any power'. With this, I could run 'poweroff' on my computer, then push the button to cut power and have it come back when Toronto Hydro restored service. I believe it might be possible to do this with commands to the UPS, but that mostly doesn't help me since the host that would issue those commands is the one I'm running 'poweroff' on.

(The better solution would be a BIOS and hardware that turns everything off after 'poweroff' even when set to always power up after AC comes back. Possibly this is fixed in a later BIOS revision than I have.)

People at universities travel widely and unpredictably

By: cks
16 July 2024 at 02:11

Every so often, people make the entirely reasonable suggestion that if one day you see a particular person log in from locally and then a few days later they're logging in from half way around the world, perhaps you should investigate. This may work for many organizations, but unfortunately it is one of the ways in which universities are peculiar places. At universities, a lot of people travel, they do it a fair bit (and unpredictably), and they go to all sorts of strange places, where they will connect back to the university to continue doing work (for professors and graduate students, at least).

There are all sorts of causes for this travel. Professors, postdocs, and graduate students go to conferences in various locations. Professors go on sabbatical, or go visit another university for a month or two, or even go hang out at a company for a while (perhaps as a visiting researcher). Graduate students also go home to visit their family, which can put them pretty much anywhere in the world, and they can also visit places for other reasons.

(Graduate students are often strongly encouraged to keep working all the time, including on holiday visits to their family. Even professors can feel similar pressures in the modern academic environment.)

Professors, postdocs, and graduate students will not tell you all of this information ahead of time, and even if you forced them to share their travel plans, it would not necessarily be useful because they may well have no idea how they will be connecting to the Internet at their destination (and what IP address ranges that would involve). Plus, geolocation of Internet IP addresses is not particularly exact or accurate, especially if you need to do it for free.

One corollary of this is that at a university, you often can't safely do broad 'geographic' blocks of logins (or VPN connections, or whatever) from IP address ranges, because there's no guarantee that one of your people isn't going to pop up there. The more populous the geographic area, the more likely that some of your people are going to be there sooner or later.

(An additional complication is people who move elsewhere (or are elsewhere) but maintain a relationship with your part of the university, and as part of that may to visit in person every so often. These people travel too, and are even less likely to tell you their travel plans, since now you're a third party to them.)

Network switches aren't simple devices (not even basic switches)

By: cks
13 July 2024 at 03:13

Recently over on the Fediverse I said something about switches:

"Network switches are simple devices" oh I am so sorry. Hubs were simple devices. Switches are alarmingly smart devices even if they don't handle VLANs or support STP (and almost everyone wants them to support Spanning Tree Protocol, to stop loops). Your switch has onboard packet buffering, understands Ethernet addresses, often generates its own traffic and responds to network traffic (see STP), and is actually a (layer 2) high speed router with a fallback to being a hub.

(And I didn't even remember about multicast, plus I omitted various things. The trigger for my post was seeing a quote from Making a Linux-managed network switch, which is speaking (I believe) somewhat tongue in cheek and anyway is a fun and interesting article.)

Back in the old days, a network hub could simply repeat incoming packets out each port, with some hand waving about having to be aware of packet boundaries (see the Wikipedia page for more details). This is not the case with switches. Even a very basic switch must extract source and destination Ethernet addresses out of packets, maintain a mapping table between ports and Ethernet addresses, and route incoming packets to the appropriate port (or send them to all ports if they're to an unknown Ethernet address). This generally needs to be done at line speed and handle simultaneous packets on multiple ports at once.

Switches must have some degree of internal packet buffering, although how much buffering switches have can vary (and can matter). Switches need buffering to deal with both a high speed port sending to a low speed one and several ports all sending traffic to the same destination port at the same time. Buffering implies that packet reception and packet transmission can be decoupled from each other, although ideally there is no buffering delay if the receive to transmit path for a packet is clear (people like low latency in switches).

A basic switch will generally be expected to both send and receive special packets itself, not just pass through network traffic. Lots of people want switches to implement STP (Spanning Tree Protocol) to avoid network loops (which requires the switch to send, receive, and process packets itself), and probably Ethernet flow control as well. If the switch is going to send out its own packets in addition to incoming traffic, it needs the intelligence to schedule this packet transmission somehow and deal with how it interacts with regular traffic.

If the switch supports VLANs, several things get more complicated (although VLAN support generally requires a 'managed switch', since you have to be able to configure the VLAN setup). In common configurations the switch will need to modify packets passing through to add or remove VLAN tags (as packets move between tagged and untagged ports). People will also want the switch to filter incoming packets, for example to drop a VLAN-tagged packet if the VLAN in question is not configured on that port. And they will expect all of this to still run at line speed with low latency. In addition, the switch will generally want to segment its Ethernet mapping table by VLAN, because bad things can happen if it's not.

(Port isolation, also known as "private VLANs", adds more complexity but now you're well up in managed switch territory.)

PS: Modern small network switches are 'simple' in the sense that all of this is typically implemented in a single chip or close to it; the Making a Linux-managed network switch article discusses a couple of them. But what is happening inside that IC is a marvel.

Using WireGuard as a router to get around reachability issues

By: cks
9 July 2024 at 03:33

Suppose that you have a machine, or a set of machines, that can't be readily reached from the outside world with random traffic (for example, your home LAN setup), and you also have a roaming machine that you want to use to reach those machines (for example, your phone). If you only had one of these problems, you could set up a straightforward WireGuard tunnel, where your roaming phone talked to the WireGuard machines on your home LAN. But on the surface, having both of them sounds like you need some degree of complex inbound NAT gateway on a fixed and reachable address in the cloud (your phone talks to the gateway with WireGuard, the gateway NATs the traffic and passes it over WireGuard to the home VLAN, etc). However, with some tricks you don't need this; instead, you can use WireGuard on the fixed cloud machine as a router instead of a gateway.

(As someone who deals with non-WireGuard networking regularly, my reflex is that if two machines can't talk to each other with plain IP, we're going to need some kind of NAT or port forwarding somewhere. This leads to a situation where if two potential WireGuard peers can't talk to each other, my thoughts immediately jump to 'clearly we're going to need a NAT'.)

The basic idea is that you set up the fixed public machine as a router, although only for WireGuard connections, and then you arrange to route appropriate IP addresses and IP address ranges over the various WireGuard connections. The simplest approach is to give each WireGuard client an 'inside' IP address on the WireGuard interface on some subnet, and then have each client route the entire (rest of the) subnet to the WireGuard router machine. The router machine's routing table then sends the appropriate IP address (or address range) down the appropriate WireGuard connection. More complex setups are possible if you have existing IP address ranges that need to be reached over these WireGuard-based links, but the more distinct IPs or IP ranges you want to reach over WireGuard, the more routing entries each WireGuard client needs (the router's routing table also gets more complicated, but it was already a central point of complexity).

(This isn't a new pattern; it used to appear in, for example, PPP servers. But those have been generally out of fashion for a while and not something people deal with. VPN servers also behave this way but often their VPN software handles this all for you without explicit routing table entries or you having to think about it. They may also automatically NAT traffic for you.)

Routing an existing home LAN IP address range or the like to the WireGuard machines is potentially a bit more complex. Unless you can use your existing home gateway as a WireGuard peer, you'll need to either NAT the WireGuard 'inside' IP addresses when they talk to your home LAN or establish a special route on your home LAN that sends traffic for those IPs to your WireGuard gateway. If you can set up WireGuard on your home gateway (by which I mean whatever machine is the default route for things on your LAN), life is simpler because the return traffic is already flowing through the machine; you just need to send it off to the WireGuard router instead of to the Internet. Another option is to assign unused home LAN IP addresses to your remote WireGuard machines, and then have your home LAN WireGuard gateway do 'proxy ARP' or IPv6 NDP for those IPs.

(In theory this is one of the situations where IPv6 may make your life easier, because if necessary you can create your own Unique local address space, carve it up between your home LAN and other areas, and route it around.)

Unix's fsync(), write ahead logs, and durability versus integrity

By: cks
3 July 2024 at 02:41

I recently read Phil Eaton's A write-ahead log is not a universal part of durability (via), which is about what it says it's about. In the process it discusses using Unix's fsync() to achieve durability, which woke up a little twitch I have about this general area, which is the difference between durability and integrity (which I'm sure Phil Eaton is fully aware of; their article was only about the durability side).

The core integrity issue of simple uses of fsync() is that while fsync() forces the filesystem to make things durable on disk, the filesystem doesn't promise to not write anything to disk until you do that fsync(). Once you write() something to the filesystem, it may write it to disk without warning at any time, and even during an fsync() the filesystem makes no promises about what order data will be written in. If you start an fsync() and the system crashes part way through, some of your data will be on disk and some won't be and you have no control over which part is which.

This means that if you overwrite data in place and use fsync(), the only time you are guaranteed that your data has both durability and integrity is in the time after fsync() completes and before you write any more data. Once you start (over)writing data again, that data could be partially written to disk even before you call fsync(), and your integrity could be gone. To retain integrity, you can't overwrite more than a tiny bit of data in place. Instead, you need to write data to a new place, fsync() it, and then overwrite one tiny piece of existing data to activate your new data (and fsync() that write too).

(Filesystems can use similar two-stage approaches to make and then activate changes, such as ZFS's slight variation on this. ZFS does not quite overwrite anything in place, but it does require multiple disk flushes, possibly more than two.)

The simplest version of this condenses things down to one fsync() (or its equivalent) at the cost of having an append-only data structure, which we usually call a log. Logs need their own internal integrity protection, so that they can tell whether or not a segment of the log had all of its data flushed to disk and so is fully valid. Once your single fsync() of a log append finishes, all of the data is on disk and that segment is valid; before the fsync finishes, it's not necessarily so. Only some of the data might have been written, and it might have been written out of order (so that the last block made it to disk but an earlier block did not).

Write ahead logs normally increases the amount of data written to disk; you write data once to the WAL and once to the main database. However, a WAL may well reduce the number of fsync()s (and thus disk flushes) that you have to do in order to have both durability and integrity. In modern solid state storage systems, synchronous disk flushes can be the slowest operation and (asynchronous) write bandwidth relatively plentiful, so trading off more data written for fewer disk flushes can be a net performance win in practice for plenty of workloads.

(Again, I'm sure Phil Eaton knows all of this; their article was specifically about the durability side of things. I'm using it as a springboard for additional thoughts. I'm not sure I'd realized how a WAL can reduce the number of fsync()s required before now.)

❌
❌