IT+Notes
FreeBSD vs. SmartOS: Who's Faster for Jails, Zones, and bhyve VMs? 19 September 2025 at 08:50

FreeBSD vs. SmartOS: Who's Faster for Jails, Zones, and bhyve VMs?

By: Stefano Marinelli

19 September 2025 at 08:50

Which virtualization host performs better? I put FreeBSD and SmartOS in a head-to-head showdown. The performance of Jails, Zones, and bhyve VMs surprised me, forcing a second round of tests on different hardware to find the real winner.

Chris's Wiki :: blog
Maybe I should add new access control rules at the front of rule lists 22 September 2025 at 03:14

Maybe I should add new access control rules at the front of rule lists

Chris's Wiki :: blog

By: cks

22 September 2025 at 03:14

Not infrequently I wind up maintaining slowly growing lists of filtering rules to either allow good things or weed out bad things. Not infrequently, traffic can potentially match more than one filtering rule, either because it has multiple bad (or good) characteristics or because some of the match rules overlap. My usual habit has been to add new rules to the end of my rule lists (or the relevant section of them), so the oldest rules are at the top and the newest ones are at the bottom.

After writing about how access control rules need some form of usage counters, it's occurred to me that maybe I want to reverse this, at least in typical systems where the first matching rule wins. The basic idea is that the rules I'm most likely to want to drop are the oldest rules, but by having them first I'm hindering my ability to see if they've been made obsolete by newer rules. If an old rule matches some bad traffic, a new rule matches all of the bad traffic, and the new rule is last, any usage counters will show a mix of the old rule and the new rule, making it look like the old rule is still necessary. If the order was reversed, the new rule would completely occlude the old rule and usage counters would show me that I could weed the old rule out.

(My view is that it's much less likely that I'll add a new rule at the bottom that's completely ineffectual because everything it matches is already matched by something earlier. If I'm adding a new rule, it's almost certainly because something isn't being handled by the collection of existing rules.)

Another possible advantage to this is that it will keep new rules at the top of my attention, because when I look at the rule list (or the section of it) I'll probably start at the top. Currently, the top is full of old rules that I usually ignore, but if I put new rules first I'll naturally see them right away.

(I think that most things I deal with are 'first match wins' systems. A 'last match wins' system would naturally work right here, but it has other confusing aspects. I also have the impression that adding new rules at the end is a common thing, but maybe it's just in the cultural water here.)

Chris's Wiki :: blog
Access control rules need some form of usage counters 16 September 2025 at 03:15

Access control rules need some form of usage counters

Chris's Wiki :: blog

By: cks

16 September 2025 at 03:15

Today, for reasons outside the scope of this entry, I decided to spend some time maintaining and pruning the access control rules for Wandering Thoughts, this blog. Due to the ongoing crawler plague (and past abuses), Wandering Thoughts has had to build up quite a collection of access control rules, which are mostly implemented as a bunch of things in an Apache .htaccess file (partly 'Deny from ...' for IP address ranges and partly as rewrite rules based on other characteristics). The experience has left me with a renewed view of something, which is that systems with access control rules need some way of letting you see which rules are still being used by your traffic.

It's in the nature of systems with access control rules to accumulate more and more rules over time. You hit another special situation, you add another rule, perhaps to match and block something or perhaps to exempt something from blocking. These rules often interact in various ways, and over time you'll almost certainly wind up with a tangled thicket of rules (because almost no one goes back to carefully check and revisit all existing rules when they add a new one or modify an existing one). The end result is a mess, and one of the ways to reduce the mess is to weed out rules that are now obsolete. One way a rule can be obsolete is that it's not used any more, and often these are the easiest rules to drop once you can recognize them.

(A rule that's still being matched by traffic may be obsolete for other reasons, and rules that aren't currently being matched may still be needed as a precaution. But it's a good starting point.)

If you have the necessary log data, you can sometimes establish if a rule was actually ever used by manually checking your logs. For example, if you have logs of rejected traffic (or logs of all traffic), you can search it for an IP address range to see if a particular IP address rule ever matched anything. But this requires tedious manual effort and that means that only determined people will go through it, especially regularly. The better way is to either have this information provided directly, such as by counters on firewall rules, or to have something in your logs that makes deriving it easy.

(An Apache example would be to augment any log line that was matched by some .htaccess rule with a name or a line number or the like. Then you could go readily through your logs to determine which lines were matched and how often.)

The next time I design an access control rule system, I'm hopefully going to remember this and put something in its logging to (optionally) explain its decisions.

(Periodically I write something that has an access control rule system of some sort. Unfortunately all of mine to date have been quiet on this, so I'm not at all without sin here.)

(3 comments.)

Chris's Wiki :: blog
Our too many paths to 'quiet' Prometheus alerts 6 September 2025 at 02:54

Our too many paths to 'quiet' Prometheus alerts

Chris's Wiki :: blog

By: cks

6 September 2025 at 02:54

One of the things our Prometheus environment has is a notion of different sorts of alerts, and in particular of less important alerts that should go to a subset of people (ie, me). There are various reasons for this, including that the alert is in testing, or it concerns a subsystem that only I should have to care about, or that it fires too often for other people (for example, a reboot notification for a machine we routinely reboot).

For historical reasons, there are at least four different ways that this can be done in our Prometheus environment:

a special label can be attached to the Prometheus alert rule, which is appropriate if the alert rule itself is in testing or otherwise is low priority.
a special label can be attached to targets in a scrape configuration, although this has some side effects that can be less than ideal. This affects all alerts that trigger based on metrics from, for example, the Prometheus host agent (for that host).
our Prometheus configuration itself can apply alert relabeling to add the special label for everything from a specific host, as indicated by a "host" label that we add. This is useful if we have so many exporters being scraped from a particular host, or if I want to keep metric continuity (ie, the metrics not changing their label set) when a host moves into production.
our Alertmanager configuration can specifically route certain alerts about certain machines to the 'less important alerts' destination.

The drawback of these assorted approaches is that now there are at least three places to check and possibly to update when a host moves from being a testing host into being a production host. A further drawback is some of these (the first two) are used a lot more often than others of these (the last two). When you have multiple things, some of which are infrequently used, and fallible humans have to remember to check them all, you can guess what can happen next.

And that is the simple version of why alerts about one of our fileservers wouldn't have gone to everyone here for about the past year.

How I discovered the problem was that I got an alert about one of the fileserver's Prometheus exporters restarting, and decided that I should update the alert configuration to make it so that alerts about this service restarting only went to me. As I was in the process of doing this, I realized that the alert already had only gone to me, despite there being no explicit configuration in the alert rule or the scrape configuration. This set me on an expedition into the depths of everything else, where I turned up an obsolete bit in our general Prometheus configuration.

On the positive side, now I've audited our Prometheus and Alertmanager configurations for any other things that shouldn't be there. On the negative side, I'm now not completely sure that there isn't a fifth place that's downgrading (some) alerts about (some) hosts.

Chris's Wiki :: blog
The Bash Readline bindings and settings that I want 28 August 2025 at 02:49

The Bash Readline bindings and settings that I want

Chris's Wiki :: blog

By: cks

28 August 2025 at 02:49

Normally I use Bash (and Readline in general) in my own environment, where I have a standard .inputrc set up to configure things to my liking (although it turns out that one particular setting doesn't work now (and may never have), and I didn't notice). However, sometimes I wind up using Bash in foreign environments, for example if I'm su'd to root at the moment, and when that happens the differences can be things that I get annoyed by. I spent a bit of today running into this again and being irritated enough that this time I figured out how to fix it on the fly.

The general Bash command to do readline things is 'bind', and I believe it accepts all of the same syntax as readline init files do, both for keybindings and for turning off (mis-)features like bracketed paste (which we dislike enough that turning it off for root is a standard feature of our install framework). This makes it convenient if I forget the exact syntax, because I can just look at my standard .inputrc and copy lines from it.

What I want to do is the following:

Switch Readline to the Unix word erase behavior I want:

set bind-tty-special-chars off
Control-w: backward-kill-word

Both of these are necessary because without the first, Bash will automatically bind Ctrl-w (my normal word-erase character) to 'unix-word-rubout' and not let you override that with your own binding.
(This is the difference that I run into all the time, because I'm very used to be able to use Ctrl-W to delete only the most recent component of a path. I think this partly comes from habit and partly because you tab-complete multi-component paths a component at a time, so if I mis-completed the latest component I want to Ctrl-W just it. M-Del is a standard Readline binding for this, but it's less convenient to type and not something I remember.)
Make readline completion treat symbolic links to directories as if they were directories:

set mark-symlinked-directories on

When completing paths and so on, I mostly don't bother thinking about the difference between an actual directory (such as /usr/bin) and a symbolic link to a directory (such as /bin on modern Linuxes). If I type '/bi<TAB>' I want this to complete to '/bin/', not '/bin', because it's basically guaranteed that I will go on to tab-complete something in '/bin/'. If I actually want the symbolic link, I'll delete the trailing '/' (which does happen every so often, but much less frequently than I want to tab-complete through the symbolic link).
Make readline forget any random edits I did to past history lines when I hit Return to finally do something:

set revert-all-at-newline on

The behavior I want from readline is that past history is effectively immutable. If I edit some bit of it and then abandon the edit by moving to another command in the history (or just start a command from scratch), the edited command should revert to being what I actually typed back when I executed it no later than when I hit Return on the current command and start a new one. It infuriates me when I cursor-up (on a fresh command) and don't see exactly the past commands that I typed.
(My notes say I got this from Things You Didn't Know About GNU Readline.)

This is more or less in the order I'm likely to fix them. The different (and to me wrong) behavior of C-w is a relatively constant irritation, while the other two are less frequent.

(If this irritates me enough on a particular system, I can probably do something in root's .bashrc, if only to add an alias to use 'bind -f ...' on a prepared file. I can't set these in /root/.inputrc, because my co-workers don't particularly agree with my tastes on these and would probably be put out if standard readline behavior they're used to suddenly changed on them.)

(In other Readline things I want to remember, there's Readline's support for fishing out last or first or Nth arguments from earlier commands.)

Chris's Wiki :: blog
Giving up on Android devices using IPv6 on our general-access networks 26 August 2025 at 03:42

Giving up on Android devices using IPv6 on our general-access networks

Chris's Wiki :: blog

By: cks

26 August 2025 at 03:42

We have a couple of general purpose, general access networks that anyone can use to connect their devices to; one is a wired network (locally, it's called our 'RED' network after the colour of the network cables used for it), and the other is a departmental wireless network that's distinct from the centrally run university-wide network. However, both of these networks have a requirement that we need to be able to more or less identify who is responsible for a machine on them. Currently, this is done through (IPv4) DHCP and registering the Ethernet address of your device. This is a problem for any IPv6 deployment, because the Android developers refuse to support DHCPv6.

We're starting to look more seriously at IPv6, including sort of planning out how our IPv6 subnets will probably work, so I came back to thinking about this issue recently. My conclusion and decision was to give up on letting Android devices use IPv6 on our networks. We can't use SLAAC (StateLess Address AutoConfiguration) because that doesn't require any sort of registration, and while Android devices apparently can use IPv6 Prefix Delegation, that would consume /64s at a prodigious rate using reasonable assumptions. We'd also have to build a system to do it. So there's no straightforward answer, and while I can think of potential hacks, I've decided that none of them are particular good options compared to the simple choice to not support IPv6 for Android by way of only supporting DHCPv6.

(Our requirement for registering a fixed Ethernet address also means that any device that randomizes its wireless Ethernet address on every connection has to turn that off. Hopefully all such devices actually have such an option.)

I'm only a bit sad about this, because you can only hope that a rock rolls uphill for so long before you give up. IPv6 is still not a critical thing in my corner of the world (as shown by how no one is complaining to us about the lack of it), so some phones continuing to not have IPv6 is not likely to be a big deal to people here.

(Android devices that can be connected to wired networking will be able to get IPv6 on some research group networks. Some research groups ask for their network to be open and not require pre-registration of devices (which is okay if it only exists in access-controlled space), and for IPv6 I expect we'll do this by turning on SLAAC on the research group's network and calling it a day.)

(10 comments.)

Chris's Wiki :: blog
An interesting thing about people showing up to probe new DNS resolvers 16 August 2025 at 02:48

An interesting thing about people showing up to probe new DNS resolvers

Chris's Wiki :: blog

By: cks

16 August 2025 at 02:48

Over on the Fediverse, I said something:

It appears to have taken only a few hours (or at most a few hours) from putting a new resolving DNS server into production to seeing outside parties specifically probing it to see if it's an open resolver.

I assume people are snooping activity on authoritative DNS servers and going from there, instead of spraying targeted queries at random IPs, but maybe they are mass scanning.

There turns out to be some interesting aspects to these probes. This new DNS server has two network interfaces, both firewalled off from outside queries, but only one is used as the source IP on queries to authoritative DNS servers. In addition, we have other machines on both networks, with firewalls, so I can get a sense of the ambient DNS probes.

Out of all of these various IPs, the IP that the new DNS server used for querying authoritative DNS servers, and only that IP, very soon saw queries that were specifically tuned for it:

124.126.74.2.54035 > 128.100.X.Y.53: 16797 NS? . (19)
124.126.74.2.7747 > 128.100.X.Y.7: UDP, length 512
124.126.74.2.54035 > 128.100.X.Y.53: 17690 PTR? Y.X.100.128.in-addr.arpa. (47)

This was a consistent pattern from multiple IPs; they all tried to query for the root zone, tried to check the UDP echo port, and then tried a PTR query for the machine's IP itself. Nothing else saw this pattern; not the machine's other IP on a different network, not another IP on the same network, and so on. This pattern and the lack of this pattern to other IPs is what's led me to assume that people are somehow identifying probe targets based on what source IPs they seem making upstream queries.

(There are a variety of ways that you could do this without having special access to DNS servers. APNIC has long used web ad networks and special captive domains and DNS servers for them to do various sorts of measurements, and you could do similar things to discover who was querying your captive DNS servers.)

Chris's Wiki :: blog
How you want to have the Unbound DNS server listen on all interfaces 15 August 2025 at 03:30

How you want to have the Unbound DNS server listen on all interfaces

Chris's Wiki :: blog

By: cks

15 August 2025 at 03:30

Suppose, not hypothetically, that you have an Unbound server with multiple network interfaces, at least two (which I will call A and B), and you'd like Unbound to listen on all of the interfaces. Perhaps these are physical interfaces and there are client machines on both, or perhaps they're virtual interfaces and you have virtual machines on them. Let's further assume that these are routed networks, so that in theory people on A can talk to IP addresses on B and vice versa.

The obvious and straightforward way to have Unbound listen on all of your interfaces is with a server stanza like this:

server:
  interface: 0.0.0.0
  interface: ::0
  # ... probably some access-control statements

This approach works 99% of the time, which is probably why it appears all over the documentation. The other 1% of the time is when a DNS client on network A makes a DNS request to Unbound's IP address on network B; when this happens, the network A client will not get any replies. Well, it won't get any replies that it accepts. If you use tcpdump to examine network traffic, you will discover that Unbound is sending replies to the client on network A using its network A IP address as the source address (which is the default behavior if you send packets to a network you're directly attached to; you normally want to use your IP on that network as the source IP). This will fail with almost all DNS client libraries because DNS clients reject replies from unexpected sources, which is to say any IP other than the IP they sent their query to.

(One way this might happen is if the client moves from network B to network A without updating its DNS configuration. Or you might be testing to see if Unbound's network B IP address answers DNS requests.)

The other way to listen on all interfaces in modern Unbound is to use 'interface-automatic: yes' (in server options), like this:

server:
  interface-automatic: yes

The important bit of what interface-automatic does for you is mentioned in passing in its documentation, and I've emphasized it here:

Listen on all addresses on all (current and future) interfaces, detect the source interface on UDP queries and copy them to replies.

As far as I know, you can't get this 'detect the source interface' behavior for UDP queries in any other way if you use 'interface: 0.0.0.0' to listen on everything. You get it if you listen on specific interfaces, perhaps with 'ip-transparent: yes' for safety:

server:
  interface: 127.0.0.1
  interface: ::1
  interface: <network A>.<my-A-IP>
  interface: <network B>.<my-B-IP>
  # insure we always start
  ip-transparent: yes

Since 'interface-automatic' is marked as an experimental option I'd love to be wrong, but I can't spot an option in skimming the documentation and searching on some likely terms.

(I'm a bit surprised that Unbound doesn't always copy the IP address it received UDP packets on and use that for replies, because I don't think things work if you have the wrong IP there. But this is probably an unusual situation and so it gets papered over, although now I'm curious how this interacts with default routes.)

(8 comments.)

Chris's Wiki :: blog
Servers will apparently run for a while even when quite hot 11 August 2025 at 03:26

Servers will apparently run for a while even when quite hot

Chris's Wiki :: blog

By: cks

11 August 2025 at 03:26

This past Saturday (yesterday as I write this), a university machine room had an AC failure of some kind:

It's always fun times to see a machine room temperature of 54C and slowly climbing. It's not our machine room but we have switches there, and I have a suspicion that some of them will be ex-switches by the time this is over.

This machine room and its AC has what you could call a history; in 2011 it flooded partly due to an AC failure, then in 2016 it had another AC issue, and another in 2024 (and those are just the ones I remember and can find entries for).

Most of this machine room is a bunch of servers from another department, and my assumption is that they are what created all of the heat when the AC failed. Both we and the other department have switches in the room, but networking equipment is usually relatively low-heat compared to active servers. So I found it interesting that the temperature graph rises in a smooth arc to its maximum temperature (and then drops abruptly, presumably as the AC starts to get fixed). To me this suggests that many of the servers in the room kept running, despite the ambient temperature hitting 54C (and their internal temperatures undoubtedly being much higher). If some servers powered off from the heat, it wasn't enough to stabilized the heat level of the room; it was still increasing right up to when it started dropping rapidly.

(Servers may well have started thermally throttling various things, and it's possible that some of them crashed without powering off and thus potentially without reducing the heat load. I have second hand information that some UPS units reported battery overheating.)

It's one thing to be fairly confident that server thermal limits are set unrealistically high. It's another thing to see servers (probably) keep operating at 54C, rather than fall over with various sorts of failures. For example, I wouldn't have been surprised if power supplies overheated and shut down (or died entirely).

(I think desktop PSUs are often rated as '0C to 50C', but I suspect that neither end of that rating is actually serious, and this was over 50C anyway.)

I rather suspect that running at 50+C for a while has increased the odds of future failures and shortened the lifetime of everything in this machine room (our switches included). But it still amazes me a bit that things didn't fall over and fail, even above 50C.

(When I started writing this entry I thought I could make some fairly confident predictions about the servers keeping running purely from the temperature graph. But the more I think about it, the less I'm sure of that. There are a lot of things that could be going on, including server failures that leave them hung or locked up but still with PSUs running and pumping out heat.)

(One comment.)

Chris's Wiki :: blog
My policy of semi-transience and why I have to do it 10 August 2025 at 03:05

My policy of semi-transience and why I have to do it

Chris's Wiki :: blog

By: cks

10 August 2025 at 03:05

Some time back I read Simon Tatham's Policy of transience (via) and recognized both points of similarity and points of drastic departure between Tatham and I. Both Tatham and I use transient shell history, transient terminal and application windows (sort of for me), and don't save our (X) session state, and in general I am a 'disposable' usage pattern person. However, I depart from Tatham in that I have a permanently running browser and I normally keep my login sessions running until I reboot my desktops. But broadly I'm a 'transient' or 'disposable' person, where I mostly don't keep inactive terminal windows or programs around in case I might want them again, or even immediately re-purpose them from one use to another.

(I do have some permanently running terminal windows, much like I have permanently present other windows on my desktop, but that's because they're 'in use', running some program. And I have one inactive terminal window but that's because exiting that shell ends my entire X session.)

The big way that I depart from Tatham is already visible in my old desktop tour, in the form of a collection of iconified browser windows (in carefully arranged spots so I can in theory keep track of them). These aren't web pages I use regularly, because I have a different collection of schemes for those. Instead they're a collection of URLs that I'm keeping around to read later or in general to do something with. This is anathema to Tatham, who keeps track of URLs to read in other ways, but I've found that it's absolutely necessary for me.

Over and over again I've discovered that if something isn't visible to me, shoved in front of my nose, it's extremely likely to drop completely out of my mind. If I file email into a 'to be dealt with' or 'to be read later' or whatever folder, or if I write down URLs to visit later and explanations of them, or any number of other things, I almost might as well throw those things away. Having a web page in an iconified Firefox window in no way guarantees that I'll ever read it, but writing its URL down in a list guarantees that I won't. So I keep an optimistic collection of iconified Firefox windows around (and every so often I look at some of them and give up on them).

It would be nice if I didn't need to do this and could de-clutter various bits of my electronic life. But by now I've made enough attempts over a long enough period of time to be confident that my mind doesn't work that way and is unlikely to ever change its ways. I need active, ongoing reminders for things to stick, and one of the best forms is to have those reminders right on my desktop.

(And because the reminders need to be active and ongoing, they also need to be non-intrusive. Mailing myself every morning with 'here are the latest N URLs you've saved to read later' wouldn't work, for example.)

PS: I also have various permanently running utility programs and their windows, so my desktop is definitely not minimalistic. A lot of this is from being a system administrator and working with a bunch of systems, where I want various sorts of convenient fast access and passive monitoring of them.

Chris's Wiki :: blog
My approach to testing new versions of Exim for our mail servers 6 August 2025 at 03:39

My approach to testing new versions of Exim for our mail servers

Chris's Wiki :: blog

By: cks

6 August 2025 at 03:39

When I wrote about how Exim's ${run ...} string expansion operator changed how it did quoting, I (sort of) mentioned that I found this when I tested a new version of Exim. Some people would do testing like this in a thorough, automated manner, but I don't go that far. Instead I have a written down test plan, with some resources set up for it in advance. Well, it's more accurate to say that I have test plans, because I have a separate test plan for each of our important mail servers because they have different features and so need different things tested.

In the beginning I simply tested all of the important features of a particular mail server by hand and from memory when I rebuilt it on a new version of Ubuntu. Eventually I got tired of having to reinvent my test process from scratch (or from vague notes) every time around (for each mail server), so I started writing it down. In the process of writing my test process down the natural set of things happened; I made it more thorough and systematic, and I set up various resources (like saved copies of the EICAR test file) to make testing more cut and paste. Having an organized, written down test plan, even as basic as ours is, has made it easier to test new builds of our Exim servers and made that testing more comprehensive.

I test most of our mail servers primarily by using swaks to send various bits of test email to them and then watching what happens (both in the swaks SMTP session and in the Exim logs). So a lot of the test plan is 'run this swaks command and ...', with various combinations of sending and receiving addresses, starting with the very most basic test of 'can it deliver from a valid dummy address to a valid dummy address'. To do some sorts of testing, such as DNS blocklist tests, I take advantage of the fact that all of the IP-based DNS blocklists we use include 127.0.0.2, so that part of the test plan is 'use swaks on the mail machine itself to connect from 127.0.0.2'.

(Some of our mail servers can apply different filtering rules to different local addresses, so I have various pre-configured test addresses set up to make it easy to test that per-address filtering is working.)

The actual test plans are mostly a long list of 'run more or less this swaks command, pointing it at your test server, to test this thing, and you should see the following result'. This is pretty close to cut and paste, which makes it relatively easy and fast for me to run through.

One qualification is that these test plans aren't attempting to be an exhaustive check of everything we do in our Exim configurations. Instead, they're mostly about making sure that the basics work, like delivering straightforward email, and that Exim can interact properly with the outside world, such as talking to ClamAV and rspamd or running external programs (which also tests that the programs themselves work on the new Ubuntu version). Testing every corner of our configurations would be exhausting and my feeling is that it would generally be pointless. Exim is stable software and mostly doesn't change or break things from version to version.

(Part of this is pragmatic experience with Exim and knowledge of what our configuration does conditionally and what it checks all of the time. If Exim does a check all of the time and basic mail delivery works, we know we haven't run into, say, an issue with tainted data.)

(One comment.)

Chris's Wiki :: blog
Some practical challenges of access management in 'IAM' systems 1 August 2025 at 03:14

Some practical challenges of access management in 'IAM' systems

Chris's Wiki :: blog

By: cks

1 August 2025 at 03:14

Suppose that you have a shiny new IAM system, and you take the 'access management' part of it seriously. Global access management is (or should be) simple; if you disable or suspect someone in your IAM system, they should wind up disabled everywhere. Well, they will wind up unable to authenticate. If they have existing credentials that are used without checking with your IAM system (including things like 'an existing SSH login'), you'll need some system to propagate the information that someone has been disabled in your IAM to consumers and arrange that existing sessions, credentials, and so on get shut down and revoked.

(This system will involve both IAM software features and features in the software that uses the IAM to determine identity.)

However, this only covers global access management. You probably have some things that only certain people should have access to, or that treat certain people differently. This is where our experiences with a non-IAM environment suggest to me that things start getting complex. For pure access, the simplest thing probably is if every separate client system or application has a separate ID and directly talks to the IAM, and the IAM can tell it 'this person cannot authenticate (to you)' or 'this person is disabled (for you)'. This starts to go wrong if you ever put two or more services or applications behind the same IAM client ID, for example if you set up a web server for one application (with an ID) and then host another application on the same web server because of convenience (your web server is already there and already set up to talk to the IAM and so on).

This gets worse if there is a layer of indirection involved, so that systems and application don't talk directly to your IAM but instead talk to, say, a LDAP server or a Radius server or whatever that's fed from your IAM (or is the party that talks to your IAM). I suspect that this is one reason why IAM software has a tendency to directly support a lot of protocols for identity and authentication.

(One thing that's sort of an extra layer of indirection is what people are trying to do, since they may have access permission for some things but not others.)

Another approach is for your IAM to only manage what 'groups' people are in and provide that information to clients, leaving it up to clients to make access decisions based on group membership. On the one hand, this is somewhat more straightforward; on the other hand, your IAM system is no longer directly managing access. It has to count on clients doing the right thing with the group information it hands them. At a minimum this gives you much less central visibility into what your access management rules are.

People not infrequently want complicated access control conditions for individual applications (including things like privilege levels). In any sort of access management system, you need to be able to express these conditions in rules. There's no uniform approach or language for expressing access control conditions, so your IAM will use one, your Unix systems will use one (or more) that you probably get to craft by hand using PAM tricks, your web applications will use one or more depending on what they're written in, and so on and so forth. One of the reasons that these languages differ is that the capabilities and concepts of each system will differ; a mesh VPN has different access control concerns than a web application. Of course these differences make it challenging to handle all of their access management in one single spot in an IAM system, leaving you with the choice of either not being able to do everything you want to but having it all in the IAM or having partially distributed access management.

Chris's Wiki :: blog
A change in how Exim's ${run ...} string expansion operator does quoting 31 July 2025 at 03:09

A change in how Exim's ${run ...} string expansion operator does quoting

Chris's Wiki :: blog

By: cks

31 July 2025 at 03:09

The Exim mail server has, among other features, a string expansion language with quite a number of expansion operators. One of those expansion operators is '${run}', which 'expands' by running a command and substituting in its output. As is commonly the case, ${run} is given the command to run and all of its command line arguments as a single string, without any explicit splitting into separate arguments:

${run {/some/command -a -b foo -c ...} [...]}

Any time a program does this, a very important question to ask is how this string is split up into separate arguments in order to be exec()'d. In Exim's case, the traditional answer is that it was rather complicated and not well documented, in a way that required you to explicitly quote many arguments that came from variables. In my entry on this I called Exim's then current behavior dangerous and wrong but also said it was probably too late to change it. Fortunately, the Exim developers did not heed my pessimism.

In Exim 4.96, this behavior of ${run} changed. To quote from the changelog:

The ${run} expansion item now expands its command string elements after splitting. Previously it was before; the new ordering makes handling zero-length arguments simpler. The old ordering can be obtained by appending a new option "preexpand", after a comma, to the "run".

(The new way is more or less the right way to do it, although it can create problems with [[some sorts of command string expansions.)

This is an important change because this change is not backward compatible if you used deliberate quoting in your ${run} command string. For example, if you ever expanded a potentially dangerous Exim variable in a ${run} command (for example, one that might have a space in it), you previously had to wrap it in ${quote}:

${run {/some/command \
         --subject ${quote:$header_subject:} ...

(As seen in my entry on our attachment type logging with Exim.)

In Exim 4.96 and later, this same ${run} string expansion will add spurious quote marks around the email message's Subject: header as your program sees it. This is because ${quote:...} will add them, since you asked it to generate a quoted version of its argument, and then ${run} won't strip them out as part of splitting the command string apart into arguments because the command string has already been split before the ${quote:} was done. What this shows is that you probably don't need explicit quoting in ${run} command strings any more, unless you're doing tricky expansions with string expressions (in which case you'll have to switch back to the old way of doing it).

To be clear, I'm all for this change. It makes straightforward and innocent use of ${run} much safer and more reliable (and it plays better with Exim's new rules about 'tainted' strings from the outside world, such as the subject header). Having to remote my use of ${quote:...} is a minor price to pay, and learning this sort of stuff in advance is why I build test servers and have test plans.

(This elaborates on a Fediverse post of mine.)

Chris's Wiki :: blog
My system administrator's view of IAM so far (from the outside) 30 July 2025 at 03:23

My system administrator's view of IAM so far (from the outside)

Chris's Wiki :: blog

By: cks

30 July 2025 at 03:23

Over on the Fediverse I said something about IAM:

My IAM choices appear to be "bespoke giant monolith" or "DIY from a multitude of OSS pieces", and the natural way of life appears to be that you start with the latter because you don't think you need IAM and then you discover maybe you have to blow up the world to move to the first.

At work we are the latter: /etc/passwd to LDAP to a SAML/OIDC server depending on what generation of software and what needs. With no unified IM or AM, partly because no rules system for expressing it.

Identity and Access Management (IAM) isn't the same thing as (single sign on) authentication, although I believe it's connected to authorization if you take the 'Access' part seriously, and also a bunch of IAM systems will also do some or all of authentication too so everything is in one place. However, all of these things can be separated, and in complex environments they are (for example, the university's overall IAM environment, also).

(If you have an IAM system you're presumably going to want to feed information from it to your authentication system, so that it knows who is (still) valid to authenticate and perhaps how.)

I believe that one thing that makes IAM systems complicated is interfacing with what could be called 'legacy systems', which in this context includes garden variety Unix systems. If you take your IAM system seriously, everything that knows about 'logins' or 'users' needs to somehow be drawing data from the IAM system, and the IAM system has to know how to provide each with the information it needs. Or alternately your legacy systems need to somehow merge local identity information (Unix home directories, UIDs, GIDs, etc) with the IAM information. Since people would like their IAM system to do it all, I think this is one driver of IAM system complexity and those bespoke giant monoliths that want to own everything in your environment.

(The reason to want your IAM system to do it all is that if it doesn't, you're building a bunch of local tools and then your IAM information is fragmented. What UID is this person on your Unix systems? Only your Unix systems know, not your central IAM database. For bonus points, the person might have different UIDs on different Unix systems, depending.)

If you start out with a green field new system, you can probably build in this central IAM from the start (assuming that you can find and operate IAM software that does what you want and doesn't make you back away in terror). But my impression is that central IAM systems are quite hard to set up, so the natural alternative is that you start without an IAM system and then are possibly faced with trying to pull all of your /etc/passwd, Apache authentication data, LDAP data, and so on into a new IAM system that is somehow going to take over the world. I have no idea how you'd pull off this transition, although presumably people have.

(In our case, we started our Unix systems well before IAM systems existed. There are accounts here that have existed since the 1980s, partly because professors and retired professors tend to stick around for a long time.)

The difficulty of moving our environment to anything like an IAM system leaves me looking at the whole thing from the outside. If we had to add an 'IAM system', it would likely be because something else we wanted to do needed to be fed data from some IAM system using some IAM protocol. The IAM system would probably not become the center of identity and access management, but just another thing that we pushed information into and updated information in.

IT+Notes
Make Your Own Backup System – Part 2: Forging the FreeBSD Backup Stronghold 29 July 2025 at 06:00

Make Your Own Backup System – Part 2: Forging the FreeBSD Backup Stronghold

IT+Notes

By: Stefano Marinelli

29 July 2025 at 06:00

Build a bulletproof backup server with FreeBSD, ZFS, and jails. Complete guide covering encryption, security hardening, and multiple backup strategies for enterprise-grade data protection.

IT+Notes
New Article on BSD Cafe Journal: WordPress on FreeBSD with BastilleBSD 21 July 2025 at 07:30

New Article on BSD Cafe Journal: WordPress on FreeBSD with BastilleBSD

IT+Notes

By: Stefano Marinelli

21 July 2025 at 07:30

A new article on running WordPress on FreeBSD with BastilleBSD has been published on the BSD Cafe Journal, plus a small update on future technical content.

Chris's Wiki :: blog
Realizing we needed two sorts of alerts for our temperature monitoring 21 July 2025 at 03:10

Realizing we needed two sorts of alerts for our temperature monitoring

Chris's Wiki :: blog

By: cks

21 July 2025 at 03:10

We have a long standing system to monitor the temperatures of our machine rooms and alert us if there are problems. A recent discussion about the state of the temperature in one of them made me realize that we want to monitor and alert for two different problems, and because they're different we need two different sorts of alerts in our monitoring system.

The first, obvious problem is a machine room AC failure, where the AC shuts off or becomes almost completely ineffective. In our machine rooms, an AC failure causes a rapid and sustained rise in temperature to well above its normal maximum level (which is typically reached just before the AC starts its next cooling cycle). AC failures are high priority issues that we want to alert about rapidly, because we don't have much time before machines start to cook themselves (and they probably won't shut themselves down before the damage has been done).

The second problem is an AC unit that can't keep up with the room's heat load; perhaps its filters are (too) clogged, or it's not getting enough cooling from the roof chillers, or various other mysterious AC reasons. The AC hasn't failed and it is still able to cool things to some degree and keep the temperature from racing up, but over time the room's temperature steadily drifts upward. Often the AC will still be cycling on and off to some degree and we'll see the room temperature vary up and down as a result; at other things the room temperature will basically reach a level and more or less stay there, presumably with the AC running continuously.

One issue we ran into is that a fast triggering alert that was implicitly written for the AC failure case can wind up flapping up and down if insufficient AC has caused the room to slowly drift close to its triggering temperature level. As the AC works (and perhaps cycles on and off), the room temperature will shift above and then back below the trigger level, and the alert flaps.

We can't detect both situations with a single alert, so we need at least two. Currently, the 'AC is not keeping up' alert looks for sustained elevated temperatures with the temperature always at or above a certain level over (much) more time than the AC should take to bring it down, even if the AC has to avoid starting for a bit of time to not cycle too fast. The 'AC may have failed' alert looks for high temperatures over a relatively short period of time, although we may want to make this an average over a short period of time.

(The advantage of an average is that if the temperature is shooting up, it may trigger faster than a 'the temperature is above X for Y minutes' alert. The drawback is that an average can flap more readily than a 'must be above X for Y time' alert.)

(3 comments.)

Chris's Wiki :: blog
Checklists are hard (but still a good thing) 20 July 2025 at 03:09

Checklists are hard (but still a good thing)

Chris's Wiki :: blog

By: cks

20 July 2025 at 03:09

We recently had a big downtime at work where part of the work was me doing a relatively complex and touchy thing. Naturally I made a checklist, but also naturally my checklist turned out to be incomplete, with some things I'd forgotten and some steps that weren't quite right or complete. This is a good illustration that checklists are hard to create.

Checklists are hard partly because they require us to try to remember, reconstruct, and understand everything in what's often a relatively complex system that is too big for us to hold in our mind. If your understanding is incomplete you can overlook something and so leave out a step or a part of a step, and even if you write down a step you may not fully remember (and record) why the step has to be there. My view is that this is especially likely in system administration where we may have any number of things that have been quietly sitting in the corner for some time, working away without problems, and so they've slipped out of our minds.

(For example, one of the issues that we ran into in this downtime was not remembering all of the hosts that ran crontab jobs that used one particular filesystem. Of course we thought we did know, so we didn't try to systematically look for such crontab jobs.)

To get a really solid checklist you have to be able to test it, much like all documentation needs testing. Unfortunately, a lot of the checklists I write (or don't write) are for one-off things that we can't really test in advance for various reasons, for example because they involve a large scale change to our live systems (that requires a downtime). If you're lucky you'll realize that you don't know something or aren't confident in something while writing the checklist, so you can investigate it and hopefully get it right, but some of the time you'll be confident you understand the problem but you're wrong.

Despite any imperfections, checklists are still a good thing. An imperfect written down checklist is better than relying on your memory and mind on the fly almost all of the time (the rare exceptions are when you wouldn't even dare do the operation without a checklist but an imperfect checklist tempts you into doing it and fumbling).

(You can try to improve the situation by keeping notes on what was missed in the checklist and then saving or publishing these notes somewhere. You can review these after the fact notes on what was missed in this specific checklist if you have to do the thing again, or look for specific types of things you tend to overlook and should specifically check for the next time you're making a checklist that touches on some area.)

(4 comments.)

Chris's Wiki :: blog
People still use our old-fashioned Unix login servers 13 July 2025 at 03:00

People still use our old-fashioned Unix login servers

Chris's Wiki :: blog

By: cks

13 July 2025 at 03:00

Every so often I think about random things, and today's random thing was how our environment might look if it was rebuilt from scratch as a modern style greenfield development. One of the obvious assumptions is that it'd involve a lot of use of containers, which led me to wondering how you handle traditional Unix style login servers. This is a relevant issue for us because we have such traditional login servers and somewhat to our surprise, they still see plenty of use.

We have two sorts of login servers. There's effectively one general purpose login server that people aren't supposed to do heavy duty computation on (and which uses per-user CPU and RAM limits to help with that), and four 'compute' login servers where they can go wild and use up all of the CPUs and memory they can get their hands on (with no guarantees that there will be any, those machines are basically first come, first served; for guaranteed CPUs and RAM people need to use our SLURM cluster). Usage of these servers has declined over time, but they still see a reasonable amount of use, including by people who have only recently joined the department (as graduate students or otherwise).

What people log in to our compute servers to do probably hasn't changed much, at least in one sense; people probably don't log in to a compute server to read their mail with their favorite text mode mail reader (yes, we have Alpine and Mutt users). What people use the general purpose 'application' login server for likely has changed a fair bit over time. It used to be that people logged in to run editors, mail readers, and other text and terminal based programs. However, now a lot of logins seem to be done either to SSH to other machines that aren't accessible from the outside world or to run the back-ends of various development environments like VSCode. Some people still use the general purpose login server for traditional Unix login things (me included), but I think it's rarer these days.

(Another use of both sorts of servers is to run cron jobs; various people have various cron jobs on one or the other of our login servers. We have to carefully preserve them when we reinstall these machines as part of upgrading Ubuntu releases.)

PS: I believe the reason people run IDE backends on our login servers is because they have their code on our fileservers, in their (NFS-mounted) home directories. And in turn I suspect people put the code there partly because they're going to run the code on either or both of our SLURM cluster or the general compute servers. But in general we're not well informed about what people are using our login servers for due to our support model.

Chris's Wiki :: blog
What OSes we use here (as of July 2025) 11 July 2025 at 03:06

What OSes we use here (as of July 2025)

Chris's Wiki :: blog

By: cks

11 July 2025 at 03:06

About five years ago I wrote an entry on what OSes we were using at the time. Five years is both a short time and a long time here, and in that time some things have changed.

Our primary OS is still Ubuntu LTS; it's our default and we use it on almost everything. On the one hand, these days 'almost everything' covers somewhat more ground than it did in 2020, as some machines have moved from OpenBSD to Ubuntu. On the other hand, as time goes by I'm less and less confident that we'll still be using Ubuntu in five years, because I expect Canonical to start making (more) unfortunate and unacceptable changes any day now. Our most likely replacement Linux is Debian.

CentOS is dead here, killed by a combination of our desire to not have two Linux variants to deal with and CentOS Stream. We got rid of the last of our CentOS machines last year. Conveniently, our previous commercial anti-spam system vendor effectively got out of the business so we didn't have to find a new Unix that they supported.

We're still using OpenBSD, but it's increasingly looking like a legacy OS that's going to be replaced by FreeBSD as we rebuild the various machines that currently run OpenBSD. Our primary interests are better firewall performance and painless mirrored root disks, but if we're going to run some FreeBSD machines and it can do everything OpenBSD can, we'd like to run fewer Unixes so we'll probably replace all of the OpenBSD machines with FreeBSD ones over time. This is a shift in progress and we'll see how far it goes, but I don't expect the number of OpenBSD machines we run to go up any more; instead it's a question of how far down the number goes.

(Our opinions about not using Linux for firewalls haven't changed. We like PF, it's just we like FreeBSD as a host for it more than OpenBSD.)

We continue to not use containers so we don't have to think about a separate, minimal Linux for container images.

There are a lot of research groups here and they run a lot of machines, so research group machines are most likely running a wide assortment of Linuxes and Unixes. We know that Ubuntu (both LTS and non-LTS) is reasonably popular among research groups, but I'm sure there are people with other distributions and probably some use of FreeBSD, OpenBSD, and so on. I believe there may be a few people still using Solaris machines.

(My office desktop continues to run Fedora, but I wouldn't run it on any production server due to the frequent distribution version updates. We don't want to be upgrading distribution versions every six months.)

Overall I'd say we've become a bit more of an Ubuntu LTS monoculture than we were before, but it's not a big change, partly because we were already mostly Ubuntu. Given our views on things like firewalls, we're probably never going to be all-Ubuntu or all-Linux.

(11 comments.)

Chris's Wiki :: blog
The easiest way to interact with programs is to run them in terminals 7 July 2025 at 03:19

The easiest way to interact with programs is to run them in terminals

Chris's Wiki :: blog

By: cks

7 July 2025 at 03:19

I recently wrote about a new little script of mine, which I use to start programs in terminals in a way that I can interact with them (to simplify it). Much of what I start with this tool doesn't need to run in a terminal window at all; the actual program will talk directly to the X server or arrange to talk to my Firefox or the like. I could in theory start them directly from my X session startup script, as I do with other things.

The reason I haven't put these things in my X session startup is that running things in shell sessions in terminal windows is the easiest way to interact with them in all sorts of ways. It's trivial to stop the program or restart it, to look at its output, to rerun it with slightly different arguments if I need to, it automatically inherits various aspects of my current X environment, and so on. You can do all of these things with programs in ways other than using shell sessions in terminals, but it's generally going to be more awkward.

(For instance, on systemd based Linuxes, I could make some of these programs into systemd user services, but I'd still have to use systemd commands to manipulate them. If I run them as standalone programs started from my X session script, it's even more work to stop them, start them again, and so on.)

For well established programs that I expect to never restart or want to look at output from, I'll run them from my X session startup script. But for new programs, like these, they get to spend a while in terminal windows because that's the easiest way. And some will be permanent terminal window occupants because they sometimes produce (text) output.

On the one hand, using terminal windows for this is simple and effective, and I could probably make it better by using a multi-tabbed terminal program, with one tab for each program (or the equivalent in a regular terminal program with screen or tmux). On the other hand, it feels a bit sad that in 2025, our best approach for flexible interaction with a program and monitoring its output is 'put it in a terminal'.

(It's also irritating that with some programs, the easiest and best way to make sure that they really exit when you want them to shut down, rather than "helpfully" lingering on in various ways, is to run them from a terminal and then Ctrl-C them when you're done with them. I have to use a certain video conferencing application that is quite eager to stay running if you tell it to 'quit', and this is my solution to it. Someday I may have to figure out how to put it in a systemd user unit so that it can't stage some sort of great escape into the background.)

(4 comments.)

Chris's Wiki :: blog
On sysadmins (not) changing (OpenSSL) cipher suite strings 3 July 2025 at 03:02

On sysadmins (not) changing (OpenSSL) cipher suite strings

Chris's Wiki :: blog

By: cks

3 July 2025 at 03:02

Recently I read Apps shouldn’t let users enter OpenSSL cipher-suite strings by Frank Denis (via), which advocates for providing at most a high level interface to people that lets them express intentions like 'forward secrecy is required' or 'I have to comply with FIPS 140-3'. As a system administrator, I've certainly been guilty of not keeping OpenSSL cipher suite strings up to date, so I have a good deal of sympathies for the general view of trusting the clients and the libraries (and also possibly the servers). But at the same time, I think that this approach has some issues. In particular, if you're only going to set generic intents, you have to trust that the programs and libraries have good defaults. Unfortunately, historically time when system administrators have most reached for setting specific OpenSSL cipher suite strings was when something came up all of a sudden and they didn't trust the library or program defaults to be up to date.

The obvious conclusion is that an application or library that wants people to only set high level options needs to commit to agility and fast updates so that it always has good defaults. This needs more than just the upstream developers making prompt updates when issues come up, because in practice a lot of people will get the program or library through their distribution or other packaging mechanism. A library that really wants people to trust it here needs to work with distributions to make sure that this sort of update can rapidly flow through, even for older distribution versions with older versions of the library and so on.

(For obvious reasons, people are generally pretty reluctant to touch TLS libraries and would like to do it as little as possible, leaving it to specialists and even then as much as possible to the upstream. Bad things can and have happened here.)

If I was doing this for a library, I would be tempted to give the library two sets of configuration files. One set, the official public set, would be the high level configuration that system administrators were supposed to use to express high level intents, as covered by Frank Denis. The other set would be internal configuration that expressed all of those low level details about cipher suite preferences, what cipher suites to use when, and so on, and was for use by the library developers and people packaging and distributing the library. The goal is to make it so that emergency cipher changes can be shipped as relatively low risk and easily backported internal configuration file changes, rather than higher risk (and thus slower to update) code changes. In an environment with reproducible binary builds, it'd be ideal if you could rebuild the library package with only the configuration files changed and get library shared objects and so on that were binary identical to the previous versions, so distributions could have quite high confidence in newly-built updates.

(System administrators who opted to edit these second set of files themselves would be on their own. In packaging systems like RPM and Debian .debs, I wouldn't even have these files marked as 'configuration files'.)

(2 comments.)

Chris's Wiki :: blog
A new little shell script to improve my desktop environment 29 June 2025 at 02:23

A new little shell script to improve my desktop environment

Chris's Wiki :: blog

By: cks

29 June 2025 at 02:23

Recently on the Fediverse I posted a puzzle about a little shell script:

A silly little Unix shell thing that I've vaguely wanted for ages but only put together today. See if you can guess what it's for:

#!/bin/sh
trap 'exec $SHELL' 2
"$@"
exec $SHELL

(The use of this is pretty obscure and is due to my eccentric X environment.)

The actual version I now use wound up slightly more complicated, and I call it 'thenshell'. What it does (as suggested by the name) is to run something and then after the thing either exits or is Ctrl-C'd, it runs a shell. This is pointless in normal circumstances but becomes very relevant if you use this as the command for a terminal window to run instead of your shell, as in 'xterm -e thenshell <something>'.

Over time, I've accumulated a number of things I want to run in my eccentric desktop environment, such as my system for opening URLs from remote machines and my alert monitoring. But some of the time I want to stop and restart these (or I need to restart them), and in general I want to notice if they produce some output, so I've been running them in terminal windows. Up until now I've had to manually start a terminal and run these programs each time I restart my desktop environment, which is annoying and sometimes I forget to do it for something. My new 'thenshell' shell script handles this; it runs whatever and then if it's interrupted or exits, starts a shell so I can see things, restart the program, or whatever.

Thenshell isn't quite a perfect duplicate of the manual version. One obvious limitation is that it doesn't put the command into the shell's command history, so I can't just cursor-up and hit return to restart it. But this is a small thing compared to having all of these things automatically started for me.

(Actually, I think I might be able to get this into a version of thenshell that knows exactly how my shell and my environment handle history, but it would be more than a bit of a hack. I may still try it, partly because it would be nifty.)

(4 comments.)

Chris's Wiki :: blog
My pragmatic view on virtual screens versus window groups 22 June 2025 at 03:04

My pragmatic view on virtual screens versus window groups

Chris's Wiki :: blog

By: cks

22 June 2025 at 03:04

I recently read z3bra's 2014 Avoid workspaces (via) which starts out with the tag "Virtual desktops considered harmful". At one level I don't disagree with z3bra's conclusion that you probably want flexible groupings of windows, and I also (mostly) don't use single-purpose virtual screens. But I do it another way, which I think is easier than z3bra's (2014) approach.

I've written about how I use virtual screens in my desktop environment, although a bit of that is now out of date. The short summary is that I mostly have a main virtual screen and then 'overflow' virtual screens where I move to if I need to do something else without cleaning up the main virtual screen (as a system administrator, I can be quite interrupt-driven or working on more than one thing at once). This sounds a lot like window groups, and I'm sure I could do it with them in another window manager. The advantage to me of fvwm's virtual screens is that it's very easy to move windows from one to another.

If I start a window in one virtual screen, for what I think is going to be one purpose, and it turns out that I need it for another purpose too, on another virtual screen, I don't have to fiddle around with, say, adding or changing its tags. Instead I can simply grab it and move it to the new virtual screen (or, for terminal windows and some others, iconify them on one screen, switch screens, and deiconify them). This makes it fast, fluid, and convenient to shuffle things around, especially for windows where I can do this by iconifying and deiconify them.

This is somewhat specific to (fvwm's idea of) virtual screens, where the screens have a spatial relationship to each other and you can grab windows and move them around to change their virtual screen (either directly or through FvwmPager). In particular, I don't have to switch between virtual screens to drag a window on to my current one; I can grab it in a couple of ways and yank it to where I am now.

In other words, it's the direct manipulation of window grouping that makes this work so nicely. Unfortunately I'm not sure how to get direct manipulation of currently not visible windows without something like virtual screens or virtual desktops. You could have a 'show all windows' feature, but that still requires bouncing between that all-windows view (to tag in new windows) and your regular view. Maybe that would work fluidly enough, especially with today's fast graphics.

(5 comments.)

Chris's Wiki :: blog
Potential issues in running your own identity provider 10 June 2025 at 03:35

Potential issues in running your own identity provider

Chris's Wiki :: blog

By: cks

10 June 2025 at 03:35

Over on the Fediverse, Simon Tatham had a comment about (using) cloud identity providers that's sparked some discussion. Yesterday I wrote about the facets of identity providers. Today I'm sort of writing about why you might not want to run your own identity provider, despite the hazards of depending on the security of some outside third party. I'll do this by talking about what I see as being involved in the whole thing.

The hardcore option is to rely on no outside services at all, not even for multi-factor authentication. This pretty much reduces your choices for MFA down to TOTP and perhaps WebAuthn, either with devices or with hardware keys. And of course you're going to have to manage all aspects of your MFA yourself. I'm not sure if there's capable open source software here that will let people enroll multiple second factors, handle invalidating one, and so on.

One facet of being an identity provider is managing identities. There's a wide variety of ways to do this; there's Unix accounts, LDAP databases, and so on. But you need a central system for it, one that's flexible enough to cope with with real world, and that system is load bearing and security sensitive. You will need to keep it secure and you'll want to keep logs and audit records, and also backups so you can restore things if it explodes (or go all the way to redundant systems for this). If the identity service holds what's considered 'personal information' in various jurisdictions, you'll need to worry about an attacker being able to bulk-extract that information, and you'll need to build enough audit trails so you can tell to what extent that happened. Your identity system will need to be connected to other systems in your organization so it knows when people appear and disappear and can react appropriately; this can be complex and may require downstream integrations with other systems (either yours or third parties) to push updates to them.

Obviously you have to handle primary authentication yourself (usually through passwords). This requires you to build and operate a secure password store as well as a way of using it for authentication, either through existing technology like LDAP or something else (this may or may not be integrated with your identity service software, as passwords are often considered part of the identity). Like the identity service but more so, this system will need logs and audit trails so you can find out when and how people authenticated to it. The log and audit information emitted by open source software may not always meet your needs, in which case you may wind up doing some hacks. Depending on how exposed this primary authentication service is, it may need its own ratelimiting and alerting on signs of potential compromised accounts or (brute force) attacks. You will also definitely want to consider reacting in some way to accounts that pass primary authentication but then fail second-factor authentication.

Finally, you will need to operate the 'identity provider' portion of things, which will probably do either or both of OIDC and SAML (but maybe you (also) need Kerberos, or Active Directory, or other things). You will have to obtain the software for this, keep it up to date, worry about its security and the security of the system or systems it runs on, make sure it has logs and audit trails that you capture, and ideally make sure it has ratelimits and other things that monitor for and react to signs of attacks, because it's likely to be a fairly exposed system.

If you're a sufficiently big organization, some or all of these services probably need to be redundant, running on multiple servers (perhaps in multiple locations) so the failure of a single server doesn't lock you out of everything. In general, all of these expose you to all of the complexities of running your own servers and services, and each and all of them are load bearing and highly security sensitive, which probably means that you should be actively paying attention to them more or less all of the time.

If you're lucky you can find suitable all-in-one software that will handle all the facets you need (identity, primary authentication, OIDC/SAML/etc IdP, and perhaps MFA authentication) in a way that works for you and your organization. If not, you're going to have to integrate various different pieces of software, possibly leaving you with quite a custom tangle (this is our situation). The all in one software generally seems to have a reputation of being pretty complex to set up and operate, which is not surprising given how much ground it needs to cover (and how many protocols it may need to support to interoperate with other systems that want to either push data to it or pull data and authentication from it). As an all-consuming owner of identity and authentication, my impression is that such software is also something that's hard to add to an existing environment after the fact and hard to swap out for anything else.

(So when you pick an all in one open source software for this, you really have to hope that it stays good, reliable software for many years to come. This may mean you need to build up a lot of expertise before you commit so that you really understand your choices, and perhaps even do pilot projects to 'kick the tires' on candidate software. The modular DIY approach is more work but it's potentially easier to swap out the pieces as you learn more and your needs change.)

The obvious advantage of a good cloud identity provider is that they've already built all of these systems and they have the expertise and infrastructure to operate them well. Much like other cloud services, you can treat them as a (reliable) black box that just works. Because the cloud identity provider works at a much bigger scale than you do, they can also afford to invest a lot more into security and monitoring, and they have a lot more visibility into how attackers work and so on. In many organizations, especially smaller ones, looking after your own identity provider is a part time job for a small handful of technical people. In a cloud identity provider, it is the full time job of a bunch of developers, operations, and security specialists.

(This is much like the situation with email (also). The scale at which cloud providers operates dwarfs what you can manage. However, your identity provider is probably more security sensitive and the quality difference between doing it yourself and using a cloud identity provider may not be as large as it is with email.)

(One comment.)

Chris's Wiki :: blog
Thinking about facets of (cloud) identity providers 9 June 2025 at 02:47

Thinking about facets of (cloud) identity providers

Chris's Wiki :: blog

By: cks

9 June 2025 at 02:47

Over on the Fediverse, Simon Tatham had a comment about cloud identity providers, and this sparked some thoughts of my own. One of my thoughts is that in today's world, a sufficiently large organization may have a number of facets to its identity provider situation (which is certainly the case for my institution). Breaking up identity provision into multiple facets can leave it not clear if and to what extend you could be said to be using a 'cloud identity provider'.

First off, you may outsource 'multi-factor authentication', which is to say your additional factor, to a specialist SaaS provider who can handle the complexities of modern MFA options, such as phone apps for push-based authentication approval. This SaaS provider can turn off your ability to authenticate, but they probably can't authenticate as a person all by themselves because you 'own' the first factor authentication. Well, unless you have situations where people only authenticate via their additional factor and so your password or other first factor authentication is bypassed.

Next is the potential distinction between an identity provider and an authentication source. The identity provider implements things like OIDC and SAML, and you may have to use a big one in order to get MFA support for things like IMAP. However, the identity provider can delegate authenticating people to something else you run using some technology (which might be OIDC or SAML but also could be something else). In some cases this delegation can be quite visible to people authenticating; they will show up to the cloud identity provider, enter their email address, and wind up on your web-based single sign on system. You can even have multiple identity providers all working from the same authentication source. The obvious exposure here is that a compromised identity provider can manufacture attested identities that never passed through your authentication source.

Along with authentication, someone needs to be (or at least should be) the 'system of record' as to what people actually exist within your organization, what relevant information you know about them, and so on. Your outsourced MFA SaaS and your (cloud) identity providers will probably have their own copies of this data where you push updates to them. Depending on how systems consume the IdP information and what other data sources they check (eg, if they check back in with your system of record), a compromised identity provider could invent new people in your organization out of thin air, or alter the attributes of existing people.

(Small IdP systems often delegate both password validation and knowing who exists and what attributes they have to other systems, like LDAP servers. One practical difference is whether the identity provider system asks you for the password or whether it sends you to something else for that.)

If you have no in-house authentication or 'who exists' identity system and you've offloaded all of these to some external provider (or several external providers that you keep in sync somehow), you're clearly at the mercy of that cloud identity provider. Otherwise, it's less clear and a lot more situational as to when you could be said to be using a cloud identity provider and thus how exposed you are. I think one useful line to look at is to ask whether a particular identity provider is used by third party services or if it's only used to for that provider's own services. Or to put it in concrete terms, as an example, do you use Github identities only as part of using Github, or do you authenticate other things through your Github identities?

(With that said, the blast radius of just a Github (identity) compromise might be substantial, or similarly for Google, Microsoft, or whatever large provider of lots of different services that you use.)

Chris's Wiki :: blog
I have divided (and partly uninformed) views on OpenTelemetry 4 June 2025 at 02:46

I have divided (and partly uninformed) views on OpenTelemetry

Chris's Wiki :: blog

By: cks

4 June 2025 at 02:46

OpenTelemetry ('OTel') is one of the current in things in the broad metrics and monitoring space. As I understand it, it's fundamentally a set of standards (ie, specifications) for how things can emit metrics, logs, and traces; the intended purpose is (presumably) so that people writing programs can stop having to decide if they expose Prometheus format metrics, or Influx format metrics, or statsd format metrics, or so on. They expose one standard format, OpenTelemetry, and then everything (theoretically) can consume it. All of this has come on to my radar because Prometheus can increasingly ingest OpenTelemetry format metrics and we make significant use of Prometheus.

If OpenTelemetry is just another metrics format that things will produce and Prometheus will consume just as it consumes Prometheus format metrics today, that seems perfectly okay. I'm pretty indifferent to the metrics formats involved, presuming that they're straightforward to generate and I never have to drop everything and convert all of our things that generate (Prometheus format) metrics to generating OpenTelemetry metrics. This would be especially hard because OpenTelemtry seems to require either Protobuf or (complex) JSON, while the Prometheus metrics format is simple text.

However, this is where I start getting twitchy. OpenTelemetry certainly gives off the air of being a complex ecosystem, and on top of that it also seems to be an application focused ecosystem, not a system focused one. I don't think that metrics are as highly regarded in application focused ecosystems as logs and traces are, while we care a lot about metrics and not very much about the others, at least in an OpenTelemtry context. To the extent that OpenTelemtry diverts people away from producing simple, easy to use and consume metrics, I'm going to wind up being unhappy with it. If what 'OpenTelemtry support' turns out to mean in practice is that more and more things have minimal metrics but lots of logs and traces, that will be a loss for us.

Or to put it another way, I worry that an application focused OpenTelemetry will pull the air away from the metrics focused things that I care about. I don't know how realistic this worry is. Hopefully it's not.

(Partly I'm underinformed about OpenTelemetry because, as mentioned I often feel disconnected from the mainstream of 'observability', so I don't particularly try to keep up with it.)

(2 comments.)

Chris's Wiki :: blog
Things are different between system and application monitoring 3 June 2025 at 03:01

Things are different between system and application monitoring

Chris's Wiki :: blog

By: cks

3 June 2025 at 03:01

We mostly run systems, not applications, due to our generally different system administration environment. Many organizations instead run applications. Although these applications may be hosted on some number of systems, the organizations don't care about the systems, not really; they care about how the applications work (and the systems only potentially matter if the applications have problems). It's my increasing feeling that this has created differences in the general field of monitoring such systems (as well as alerting), which is a potential issue for us because most of the attention is focused on the application area of things.

When you run your own applications, you get to give them all of the 'three pillars of observability' (metrics, traces, and logs, see here for example). In fact, emitting logs is sort of the default state of affairs for applications, and you may have to go out of your way to add metrics (my understanding is that traces can be easier). Some people even process logs to generate metrics, something that's supported by various log ingestion pipelines these days. And generally you can send your monitoring output to wherever you want, in whatever format you want, and often you can do things like structuring them.

When what you run is systems, life is a lot different. Your typical Unix system will most easily provide low level metrics about things. To the extent that the kernel and standard applications emit logs, these logs come in a variety of formats that are generally beyond your control and are generally emitted to only a few places, and the overall logs of what's happening on the system are often extremely incomplete (partly because 'what's happening on the system' is a very high volume thing). You can basically forget about having traces. In the modern Linux world of eBPF it's possible to do better if you try hard, but you'll probably be building custom tooling for your extra logs and traces so they'd better be sufficiently important (and you need the relevant expertise, which may include reading kernel and program source code).

The result is that for people like us who run systems, our first stop for monitoring is metrics and they're what we care most about; our overall unstructured logs are at best a secondary thing, and tracing some form of activity is likely to be something done only to troubleshoot problems. Meanwhile, my strong impression is that application people focus on logs and if they have them, traces, with metrics only a distant and much less important third (especially in the actual applications, since metrics can be produced by third party tools from their logs).

(This is part of why I'm so relatively indifferent to smart log searching systems. Our central syslog server is less about searching logs and much more about preserving them in one place for investigations.)

(One comment.)

Chris's Wiki :: blog
Our Grafana and Loki installs have quietly become 'legacy software' here 29 May 2025 at 03:00

Our Grafana and Loki installs have quietly become 'legacy software' here

Chris's Wiki :: blog

By: cks

29 May 2025 at 03:00

At this point we've been running Grafana for quite some time (since late 2018), and (Grafana) Loki for rather less time and on a more ad-hoc and experimental basis. However, over time both have become 'legacy software' here, by which I mean that we (I) have frozen their versions and don't update them any more, and we (I) mostly or entirely don't touch their configurations any more (including, with Grafana, building or changing dashboards).

We froze our Grafana version due to backward compatibility issues. With Loki I could say that I ran out of enthusiasm for going through updates, but part of it was that Loki explicitly deprecated 'promtail' in favour of a more complex solution ('Alloy') that seemed to mostly neglect the one promtail feature we seriously cared about, namely reading logs from the systemd/journald complex. Another factor was it became increasingly obvious that Loki was not intended for our simple setup and future versions of Loki might well work even worse in it than our current version does.

Part of Grafana and Loki going without updates and becoming 'legacy' is that any future changes in them would be big changes. If we ever have to update our Grafana version, we'll likely have to rebuild a significant number of our current dashboards, because they use panels that aren't supported any more and the replacements have a quite different look and effect, requiring substantial dashboard changes for the dashboards to stay decently usable. With Loki, if the current version stopped working I'd probably either discard the idea entirely (which would make me a bit sad, as I've done useful things through Loki) or switch to something else that had similar functionality. Trying to navigate the rapids of updating to a current Loki is probably roughly as much work (and has roughly as much chance of requiring me to restart our log collection from scratch) as moving to another project.

(People keep mentioning VictoriaLogs (and I know people have had good experiences with it), but my motivation for touching any part of our Loki environment is very low. It works, it hasn't eaten the server it's on and shows no sign of doing that any time soon, and I'm disinclined to do any more work with smart log collection until a clear need shows up. Our canonical source of history for logs continues to be our central syslog server.)

(2 comments.)

Chris's Wiki :: blog
The five platforms we have to cover when planning systems 21 May 2025 at 03:33

The five platforms we have to cover when planning systems

Chris's Wiki :: blog

By: cks

21 May 2025 at 03:33

Suppose, not entirely hypothetically, that you're going to need a 'VPN' system that authenticates through OIDC. What platforms do you need this VPN system to support? In our environment, the answer is that we have five platforms that we need to care about, and they're the obvious four plus one more: Windows, macOS, iOS, Android, and Linux.

We need to cover these five platforms because people here use our services from all of those platforms. Both Windows and macOS are popular on laptops (and desktops, which still linger around), and there's enough people who use Linux to be something we need to care about. On mobile devices (phones and tablets), obviously iOS and Android are the two big options, with people using either or both. We don't usually worry about the versions of Windows and macOS and suggest that people to stick to supported ones, but that may need to change with Windows 10.

Needing to support mobile devices unquestionably narrows our options for what we can use, at least in theory, because there are certain sorts of things you can semi-reasonably do on Linux, macOS, and Windows that are infeasible to do (at least for us) on mobile devices. But we have to support access to various of our services even on iOS and Android, which constrains us to certain sorts of solutions, and ideally ones that can deal with network interruptions (which are quite common on mobile devices in Toronto, as anyone who takes our subways is familiar with).

(And obviously it's easier for open source systems to support Linux, macOS, and Windows than it is for them to extend this support to Android and especially iOS. This extends to us patching and rebuilding them for local needs; with various modern languages, we can produce Windows or macOS binaries from modified open source projects. Not so much for mobile devices.)

In an ideal world it would be easy to find out the support matrix of platforms (and features) for any given project. In this world, the information can sometimes be obscure, especially for what features are supported on what platforms. One of my resolutions to myself is that when I find interesting projects but they seem to have platform limitations, I should note down where in their documentation they discuss this, so I can find it later to see if things have changed (or to discuss with people why certain projects might be troublesome).

(2 comments.)

Chris's Wiki :: blog
Two broad approaches to having Multi-Factor Authentication everywhere 15 May 2025 at 03:05

Two broad approaches to having Multi-Factor Authentication everywhere

Chris's Wiki :: blog

By: cks

15 May 2025 at 03:05

In this modern age, more and more people are facing more and more pressure to have pervasive Multi-Factor Authentication, with every authentication your people perform protected by MFA in some way. I've come to feel that there are two broad approaches to achieving this and one of them is more realistic than the other, although it's also less appealing in some ways and less neat (and arguably less secure).

The 'proper' way to protect everything with MFA is to separately and individually add MFA to everything you have that does authentication. Ideally you will have a central 'single sign on' system, perhaps using OIDC, and certainly your people will want you to have only one form of MFA even if it's not all run through your SSO. What this implies is that you need to add MFA to every service and protocol you have, which ranges from generally easy (websites) through being annoying to people or requiring odd things (SSH) to almost impossible at the moment (IMAP, authenticated SMTP, and POP3). If you opt to set it up with no exemptions for internal access, this approach to MFA insures that absolutely everything is MFA protected without any holes through which an un-MFA'd authentication can be done.

The other way is to create some form of MFA-protected network access (a VPN, a mesh network, a MFA-authenticated SSH jumphost, there are many options) and then restrict all non-MFA access to coming through this MFA-protected network access. For services where it's easy enough, you might support additional MFA authenticated access from outside your special network. For other services where MFA isn't easy or isn't feasible, they're only accessible from the MFA-protected environment and a necessary step for getting access to them is to bring up your MFA-protected connection. This approach to MFA has the obvious problem that if someone gets access to your MFA-protected network, they have non-MFA access to everything else, and the not as obvious problem that attackers might be able to MFA as one person to the network access and then do non-MFA authentication as another person on your systems and services.

The proper way is quite appealing to system administrators. It gives us an array of interesting challenges to solve, neat technology to poke at, and appealingly strong security guarantees. Unfortunately the proper way has two downsides; there's essentially no chance of it covering your IMAP and authenticated SMTP services any time soon (unless you're willing to accept some significant restrictions), and it requires your people to learn and use a bewildering variety of special purpose, one-off interfaces and sometimes software (and when it needs software, there may be restrictions on what platforms the software is readily available on). Although it's less neat and less nominally secure, the practical advantage of the MFA protected network access approach is that it's universal and it's one single thing for people to deal with (and by extension, as long as the network system itself covers all platforms you care about, your services are fully accessible from all platforms).

(In practice the MFA protected network approach will probably be two things for people to deal with, not one, since if you have websites the natural way to protect them is with OIDC (or if you have to, SAML) through your single sign on system. Hopefully your SSO system is also what's being used for the MFA network access, so people only have to sign on to it once a day or whatever.)

Chris's Wiki :: blog
Our need for re-provisioning support in mesh networks (and elsewhere) 13 May 2025 at 03:00

Our need for re-provisioning support in mesh networks (and elsewhere)

Chris's Wiki :: blog

By: cks

13 May 2025 at 03:00

In a comment on my entry on how WireGuard mesh networks need a provisioning system, vcarceler pointed me to Innernet (also), an interesting but opinionated provisioning system for WireGuard. However, two bits of it combined made me twitch a bit; Innernet only allows you to provision a given node once, and once a node is assigned an internal IP, that IP is never reused. This lack of support for re-provisioning machines would be a problem for us and we'd likely have to do something about it, one way or another. Nor is this an issue unique to Innernet, as a number of mesh network systems have it.

Our important servers have fixed, durable identities, and in practice these identities are both DNS names and IP addresses (we have some generic machines, but they aren't as important). We also regularly re-provision these servers, which is to say that we reinstall them from scratch, usually on new hardware. In the usual course of events this happens roughly every two years or every four years, depending on whether we're upgrading the machine for every Ubuntu LTS release or every other one. Over time this is a lot of re-provisionings, and we need the re-provisioned servers to keep their 'identity' when this happens.

We especially need to be able to rebuild a dead server as an identical replacement if its hardware completely breaks and eats its system disks. We're already in a crisis, we don't want to have a worse crisis because other things need to be updated because we can't exactly replace the server but instead have to build a new server that fills the same role, or will once DNS is updated, configurations are updated, etc etc.

This is relatively straightforward for regular Linux servers with regular networking; there's the issue of SSH host keys, but there's several solutions. But obviously there's a problem if the server is also a mesh network node and the mesh network system will not let it be re-provisioned under the same name or the same internal IP address. Accepting this limitation would make it difficult to use the mesh network for some things, especially things where we don't want to depend on DNS working (for example, sending system logs via syslog). Working around the limitation requires reverse engineering where the mesh network system stores local state and hopefully being able to save a copy elsewhere and restore it; among other things, this has implications for the mesh network system's security model.

For us, it would be better if mesh networking systems explicitly allowed this re-provisioning. They could make it a non-default setting that took explicit manual action on the part of the network administrator (and possibly required nodes to cooperate and extend more trust than normal to the central provisioning system). Or a system like Innernet could have a separate class of IP addresses, call them 'service addresses', that could be assigned and reassigned to nodes by administrators. A node would always have its unique identity but could also be assigned one or more service addresses.

(Of course our other option is to not use a mesh network system that imposes this restriction, even if it would otherwise make our lives easier. Unless we really need the system for some other reason or its local state management is explicitly documented, this is our more likely choice.)

PS: The other problem with permanently 'consuming' IP addresses as machines are re-provisioned is that you run out of them sooner or later unless you use gigantic network blocks that are many times larger than the number of servers you'll ever have (well, in IPv4, but we're not going to switch to IPv6 just to enable a mesh network provisioning system).

Chris's Wiki :: blog
Using WireGuard seriously as a mesh network needs a provisioning system 11 May 2025 at 02:45

Using WireGuard seriously as a mesh network needs a provisioning system

Chris's Wiki :: blog

By: cks

11 May 2025 at 02:45

One thing that my recent experience expanding our WireGuard mesh network has driven home to me is how (and why) WireGuard needs a provisioning system, especially if you're using it as a mesh networking system. In fact I think that if you use a mesh WireGuard setup at any real scale, you're going to wind up either adopting or building such a provisioning system.

In a 'VPN' WireGuard setup with a bunch of clients and one or a small number of gateway servers, adding a new client is mostly a matter of generating and giving it some critical information. However, it's possible to more or less automate this and make it relatively easy for people who want to connect to you to do this. You'll still need to update your WireGuard VPN server too, but at least you only have one of them (probably), and it may well be the host where you generate the client configuration and provide it to the client's owner.

The extra problem with adding a new client to a WireGuard mesh network is that there's many more WireGuard nodes that need to be updated (and also the new client needs a lot more information; it needs to know about all of the other nodes it's supposed to talk to). More broadly, every time you change the mesh network configuration, every node needs to update with the new information. If you add a client, remove a client, a client changes its keys for some reason (perhaps it had to be re-provisioned because the hardware died), all of these means nodes need updates (or at least the nodes that talk to the changed node). In the VPN model, only the VPN server node (and the new client) needed updates.

Our little WireGuard mesh is operating at a small scale, so we can afford to do this by hand. As you have more WireGuard nodes and more changes in nodes, you're not going to want to manually update things one by one, any more than you want to do that for other system administration work. Thus, you're going to want some sort of a provisioning system, where at a minimum you can say 'this is a new node' or 'this node has been removed' and all of your WireGuard configurations are regenerated, propagated to WireGuard nodes, trigger WireGuard configuration reloads, and so on. Some amount of this can be relatively generic in your configuration management system, but not all of it.

(Many configuration systems can propagate client-specific files to clients on changes and then trigger client side actions when the files are updated. But you have to build the per-client WireGuard configuration.)

PS: I haven't looked into systems that will do this for you, either as pure WireGuard provisioning systems or as bigger 'mesh networking using WireGuard' software, so I don't have any opinions on how you want to handle this. I don't even know if people have built and published things that are just WireGuard provisioning systems, or if everything out there is a 'mesh networking based on WireGuard' complex system.

(6 comments.)

Chris's Wiki :: blog
Chosing between "it works for now" and "it works in the long term" 8 May 2025 at 02:50

Chosing between "it works for now" and "it works in the long term"

Chris's Wiki :: blog

By: cks

8 May 2025 at 02:50

A comment on my entry about how Netplan can only have WireGuard peers in one file made me realize one of my implicit system administration views (it's the first one by Jon). That is the tradeoff between something that works now and something that not only works now but is likely to keep working in the long term. In system administration this is a tradeoff, not an obvious choice, because what you want is different depending on the circumstances.

Something that works now is, for example, something that works because of how Netplan's code is currently written, where you can hack around an issue by structuring your code, your configuration files, or your system in a particular way. As a system administrator I do a surprisingly large amount of these, for example to fix or work around issues in systemd units that people have written in less than ideal or simply mistaken ways.

Something that's going to keep working in the longer term is doing things 'correctly', which is to say in whatever way that the software wants you to do and supports. Sometimes this means doing things the hard way when the software doesn't actually implement some feature that would make your life better, even if you could work around it with something that works now but isn't necessarily guaranteed to keep working in the future.

When you need something to work and there's no other way to do it, you have to take a solution that (only) works now. Sometimes you take a 'works now' solution even if there's an alternative because you expect your works-now version to be good enough for the lifetime of this system, this OS release, or whatever; you'll revisit things for the next version (at least in theory, workarounds to get things going can last a surprisingly long time if they don't break anything). You can't always insist on a 'works now and in the future' solution.

On the other hand, sometimes you don't want to do a works-now thing even if you could. A works-now thing is in some sense technical debt, with all that that implies, and this particular situation isn't important enough to justify taking on such debt. You may solve the problem properly, or you may decide that the problem isn't big and important enough to solve at all and you'll leave things in their imperfect state. One of the things I think about when making this decision is how annoying it would be and how much would have to change if my works-now solution broke because of some update.

(Another is how ugly the works-now solution is, including how big of a note we're going to want to write for our future selves so we can understand what this peculiar load bearing thing is. The longer the note, the more I generally wind up questioning the decision.)

It can feel bad to not deal with a problem by taking a works-now solution. After all, it works, and otherwise you're stuck with the problem (or with less pleasant solutions). But sometimes it's the right option and the works-now solution is simply 'too clever'.

(I've undoubtedly made this decision many times over my career. But Jon's comment and my reply to it crystalized the distinction between a 'works now' and a 'works for the long term' solution in my mind in a way that I think I can sort of articulate.)

Chris's Wiki :: blog
The complexity of mixing mesh networking and routes to subnets 2 May 2025 at 02:51

The complexity of mixing mesh networking and routes to subnets

Chris's Wiki :: blog

By: cks

2 May 2025 at 02:51

One of the in things these days is encrypted (overlay) mesh networks, where you have a bunch of nodes and the nodes have encrypted connections to each other that they use for (at least) internal IP traffic. WireGuard is one of the things that can be used for this. A popular thing to add to such mesh network solutions is 'subnet routes', where nodes will act as gateways to specific subnets, not just endpoints in themselves. This way, if you have an internal network of servers at your cloud provider, you can establish a single node on your mesh network and route to the internal network through that node, rather than having to enroll every machine in the internal network.

(There are various reasons not to enroll every machine, including that on some of them it would be a security or stability risk.)

In simple configurations this is easy to reason about and easy to set up through the tools that these systems tend to give you. Unfortunately, our network configuration isn't simple. We have an environment with multiple internal networks, some of which are partially firewalled off from each other, and where people would want to enroll various internal machines in any mesh networking setup (partly so they can be reached directly). This creates problems for a simple 'every node can advertise some routes and you accept the whole bundle' model.

The first problem is what I'll call the direct subnet problem. Suppose that you have a subnet with a bunch of machines on it and two of them are nodes (call them A and B), with one of them (call it A) advertising a route to the subnet so that other machines in the mesh can reach it. The direct subnet problem is that you don't want B to ever send its traffic for the subnet to A; since it's directly connected to the subnet, it should send the traffic directly. Whether or not this happens automatically depends on various implementation choices the setup makes.

The second problem is the indirect subnet problem. Suppose that you have a collection of internal networks that can all talk to each other (perhaps through firewalls and somewhat selectively). Not all of the machines on all of the internal networks are part of the mesh, and you want people who are outside of your networks to be able to reach all of the internal machines, so you have a mesh node that advertises routes to all of your internal networks. However, if a mesh node is already inside your perimeter and can reach your internal networks, you don't want it to go through your mesh gateway; you want it to send its traffic directly.

(You especially want this if mesh nodes have different mesh IPs from their normal IPs, because you probably want the traffic to come from the normal IP, not the mesh IP.)

You can handle the direct subnet case with a general rule like 'if you're directly attached to this network, ignore a mesh subnet route to it', or by some automatic system like route priorities. The indirect subnet case can't be handled automatically because it requires knowledge about your specific network configuration and what can reach what without the mesh (and what you want to reach what without the mesh, since some traffic you want to go over the mesh even if there's a non-mesh route between the two nodes). As far as I can see, to deal with this you need the ability to selectively configure or accept (subnet) routes on a mesh node by mesh node basis.

(In a simple topology you can get away with accepting or not accepting all subnet routes, but in a more complex one you can't. You might have two separate locations, each with their own set of internal subnets. Mesh nodes in each location want the other location's subnet routes, but not their own location's subnet routes.)

Chris's Wiki :: blog
Tailscale's surprising interaction of DNS settings and 'exit nodes' 20 April 2025 at 02:40

Tailscale's surprising interaction of DNS settings and 'exit nodes'

Chris's Wiki :: blog

By: cks

20 April 2025 at 02:40

Tailscale is a well regarded commercial mesh networking system, based on WireGuard, that can be pressed into service as a VPN as well. As part of its general features, it allows you to set up various sorts of DNS settings for your tailnet (your own particular Tailscale mesh network), including both DNS servers for specific (sub)domains (eg an 'internal.example.org') and all DNS as a whole. As part of optionally being VPN-like, Tailscale also lets you set up exit nodes, which let you route all traffic for the Internet out the exit node (if you want to route just some subnets to somewhere, that's a subnet router, a different thing). If you're a normal person, especially if you're a system administrator, you probably have a guess as to how these two features interact. Unfortunately, you may well be wrong.

As of today, if you use a Tailscale exit node, all of your DNS traffic is routed to the exit node regardless of Tailscale DNS settings. This applies to both DNS servers for specific subdomains and to any global DNS servers you've set for your tailnet (due to, for example, 'split horizon' DNS). Currently this is documented only in one little sentence in small type in the "Use Tailscale DNS settings" portion of the client preferences documentation.

In many Tailscale environments, all this does is make your DNS queries take an extra hop (from you to the exit node and then to the configured DNS servers). Your Tailscale exit nodes are part of your tailnet, so in ordinary configurations they will have your Tailscale DNS settings and be able to query your configured DNS servers (and they will probably get the same answers, although this isn't certain). However, if one of your exit nodes isn't set up this way, potential pain and suffering is ahead of you. Your tailnet nodes that are using this exit node will get wildly different DNS answers than you expect, potentially not resolving internal domains and maybe getting different answers than you'd expect (if you have split horizon DNS).

One reason that you might set an exit node machine to not use your Tailscale DNS settings (or subnet routes) is that you're only using it as an exit node, not as a regular participant in your tailnet. Your exit node machine might be placed on a completely different network (and in a completely different trust environment) than the rest of your tailnet, and you might have walled off its (less-trusted) traffic from the rest of your network. If the only thing the machine is supposed to be is an Internet gateway, there's no reason to have it use internal DNS settings, and it might not normally be able to reach your internal DNS servers (or the rest of your internal servers).

In my view, a consequence of this is that it's probably best to have any internal DNS servers directly on your tailnet, with their tailnet IP addresses. This makes them as reachable as possible to your nodes, independent of things like subnet routes.

PS: Routing general DNS queries through a tailnet exit node makes sense in this era of geographical DNS results, where you may get different answers depending on where in the world you are and you'd like these to match up with where your exit node is.

(I'm writing this entry because this issue was quite mysterious to us when we ran into it while testing Tailscale and I couldn't find much about it in online searches.)

Chris's Wiki :: blog
How I install personal versions of programs (on Unix) 12 April 2025 at 03:03

How I install personal versions of programs (on Unix)

Chris's Wiki :: blog

By: cks

12 April 2025 at 03:03

These days, Unixes are quite generous in what they make available through their packaging systems, so you can often get everything you want through packages that someone else worries about building, updating, managing, and so on. However, not everything is available that way; sometimes I want something that isn't packaged, and sometimes (especially on 'long term support' distributions) I want something that's more recent that the system provides (for example, Ubuntu 22.04 only has Emacs 27.1). Over time, I've evolved my own approach for managing my personal versions of such things, which is somewhat derived from the traditional approach for multi-architecture Unixes here.

The starting point is that I have a ~/lib/<architecture> directory tree. When I build something personally, I tell it that its install prefix is a per-program directory within this tree, for example, '/u/cks/lib/<arch>/emacs-30.1'. These days I only have one active architecture inside ~/lib, but old habits die hard, and someday we may start using ARM machines or FreeBSD. If I install a new version of the program, it goes in a different (versioned) subdirectory, so I have 'emacs-29.4' and 'emacs-30.1' directory trees.

I also have both a general ~/bin directory, for general scripts and other architecture independent things, and a ~/bin/bin.<arch> subdirectory, for architecture dependent things. When I install a program into ~/lib/<arch>/<whatever> and want to use it, I will make either a symbolic link or a cover script in ~/bin/bin.<arch> for it, such as '~/bin/bin.<arch>/emacs'. This symbolic link or cover script always points to what I want to use as the current version of the program, and I update it when I want to switch.

(If I'm building and installing something from the latest development tree, I'll often call the subdirectory something like 'fvwm3-git' and then rename it to have multiple versions around. This is not as good as real versioned subdirectories, but I tend to do this for things that I won't ever run two versions of at the same time; at most I'll switch back and forth.)

Some things I use, such as pipx, normally install programs (or symbolic links to them) into places like ~/.local/bin or ~/.cargo/bin. Because it's not worth fighting city hall on this one, I pretty much let them do so, but I don't add either directory to my $PATH. If I want to use a specific tool that they install and manage, I put in a symbolic link or a cover script in my ~/bin/bin.<arch>. The one exception to this is Go, where I do have ~/go/bin in my $PATH because I use enough Go based programs that it's the path of least resistance.

This setup isn't perfect, because right now I don't have a good general approach for things that depend on the Ubuntu version (where an Emacs 30.1 built on 22.04 doesn't run on 24.04). If I ran into this a lot I'd probably make an addition ~/bin/bin.<something> directory for the Ubuntu version and then put version specific things there. And in general, Go and Cargo are not ready for my home directory to be shared between different binary architectures. For Go, I would probably wind up setting $GOPATH to something like ~/lib/<arch>/go. Cargo has a similar system for deciding where it puts stuff but I haven't looked into it in detail.

(From a quick skim of 'cargo help install' and my ~/.cargo, I suspect that I'd point $CARGO_INSTALL_ROOT into my ~/lib/<arch> but leave $CARGO_HOME unset, so that various bits of Cargo's own data remain shared between architectures.)

(This elaborates a bit on a Fediverse conversation.)

PS: In theory I have a system for keeping track of the command lines used to build things (also, which I'd forgotten when I wrote the more recent entry on this system). In practice I've fallen out of the habit of using it when I build things for my ~/lib, although I should probably get back into it. For GNU Emacs, I put the ./configure command line into a file in ~/lib/<arch>, since I expected to build enough versions of Emacs over time.

(One comment.)

Chris's Wiki :: blog
Sorting out the ordering of OpenSSH configuration directives 7 April 2025 at 02:50

Sorting out the ordering of OpenSSH configuration directives

Chris's Wiki :: blog

By: cks

7 April 2025 at 02:50

As I discovered recently, OpenSSH makes some unusual choices for the ordering of configuration directives in its configuration files, both sshd_config and ssh_config (and files they include). Today I want to write down what I know about the result (which is partly things I've learned researching this entry).

For sshd_config, the situation is relatively straightforward. There are what we could call 'global options' (things you set normally, outside of 'Match' blocks) and 'matching Match options' (things set in Match blocks that actually matched). Both of them are 'first mention wins', but Match options take priority over global options regardless of where the Match option block is in the (aggregate) configuration file. Sshd makes 'first mention win' work in the presence of including files from /etc/ssh/sshd_config.d/ by doing the inclusion at the start of /etc/ssh/sshd_config.

So here's an example with a Match statement:

PasswordAuthentication no
Match Address 127.0.0.0/8,192.168.0.0/16
  PasswordAuthentication yes

Password authentication is turned off as a global option but then overridden in the address-based Match block to enable it for connections from the local network. If we had a (Unix) group for logins that we wanted to never use passwords even if they were coming from the local network, I believe that we would have to write it like this, which looks somewhat odd:

PasswordAuthentication no
Match Group neverpassword
 PasswordAuthentication no
Match Address 127.0.0.0/8,192.168.0.0/16
  PasswordAuthentication yes

Then a 'neverpassword' person logging in from the local network would match both Match blocks, and the first block (the group block) would have 'PasswordAuthentication no' win over the second block's 'PasswordAuthentication yes'. Equivalently, you could put the global 'PasswordAuthentication no' after both Match blocks, which might be clearer.

The situation with ssh and ssh_config is one that I find more confusing and harder to follow. The ssh_config manual page says:

Unless noted otherwise, for each parameter, the first obtained value will be used.

It's pretty clear how this works for the various sources of configurations; options on the command line take priority over everything else, and ~/.ssh/config options take priority over the global options from /etc/ssh/ssh_config and its included files. But within a file (such as ~/.ssh/config), I get a little confused.

What I believe this means for any specific option that you want to give a default value to for all hosts but then override for specific hosts is that you must put your Host * directive for it at the end of your configuration file, and the more specific Host or Match directives first. I'm not sure how this works for matches like 'Match canonical' or 'Match final' that happen 'late' in the processing of your configuration; the natural reading would be that you have to make sure that nothing earlier conflicts with them. If this is so, a natural use for 'Match final' would then be options that you want to be true defaults that only apply if nothing has overridden them.

Some ssh_config options are special in that you can provide them multiple times and they'll be merged together; one example is IdentityFile. I think this applies even across multiple Host and Match blocks, and also that there's no way to remove an IdentityFile once you've added it (which might be an issue if you have a lot of identity files, because SSH servers only let you offer so many). Some options let you modify the default state to, for example, add a non-default key exchange algorithm; I haven't tested to see if you can do this multiple times in Host blocks or if you can only do it once.

(These days you can make things somewhat simpler with 'Match tagged ...' and 'Tag'; one handy and clear explanation of what you can do with this is OpenSSH Config Tags How To.)

Typically your /etc/ssh/ssh_config has no active options set in it and includes /etc/ssh/ssh_config.d/* at the end. On Debian-derived systems, it does have some options specified (for 'Host *', ie making them defaults), but the inclusion of /etc/ssh/ssh_config.d/* has been moved to the start so you can override them.

My own personal ~/.ssh/config setup starts with a 'Host *' block, but as far as I can tell I don't try to override any of its settings later in more specific Host blocks. I do have a final 'Host *' block with comments about how I want to do some things by default if they haven't been set earlier, along with comments in the file that I was finding all of this confusing. I may at some point try to redo it into a 'Match tagged' / 'Tag' form to see if that makes it clearer.

(One comment.)

Chris's Wiki :: blog
The order of files in /etc/ssh/sshd_config.d/ matters (and may surprise you) 3 April 2025 at 03:18

The order of files in /etc/ssh/sshd_config.d/ matters (and may surprise you)

Chris's Wiki :: blog

By: cks

3 April 2025 at 03:18

Suppose, not entirely hypothetically, that you have an Ubuntu 24.04 server system where you want to disable SSH passwords for the Internet but allow them for your local LAN. This looks straightforward based on sshd_config, given the PasswordAuthentication and Match directives:

PasswordAuthentication no
Match Address 127.0.0.0/8,192.168.0.0/16
  PasswordAuthentication yes

Since I'm an innocent person, I put this in a file in /etc/ssh/sshd_config.d/ with a nice high ordering number, say '60-no-passwords.conf'. Then I restarted the SSH daemon and was rather confused when it didn't work (and I wound up resorting to manipulating AuthenticationMethods, which also works).

The culprit is two things combined together. The first is this sentence at the start of sshd_config:

[...] Unless noted otherwise, for each keyword, the first obtained value will be used. [...]

Some configuration systems are 'first mention wins', but I think it's more common to be either 'last mention wins' or 'if it's mentioned more than once, it's an error'. Certainly I was vaguely expecting sshd_config and the files in sshd_config.d to be 'last mention wins', because that would be the obvious way to let you easily override things specified in sshd_config itself. But OpenSSH doesn't work this way.

(You can still override things in sshd_config, because the global sshd_config includes all of sshd_config.d/* at the start, before it sets anything, rather than at the end, how you often see this.)

The second culprit is that at least in our environment, Ubuntu 24.04 writes out a '50-cloud-init.conf' file that contains one deadly (for this) line:

PasswordAuthentication yes

Since '50-cloud-init.conf' was read by sshd before my '60-no-passwords.conf', it forced password authentication to be on. My new configuration file was more or less silently ignored.

Renaming my configuration file to be '10-no-passwords.conf' fixed my problem and made things work like I expected.

(5 comments.)

Chris's Wiki :: blog
Our simple view of 'identity' for our (Unix) accounts 31 March 2025 at 03:17

Our simple view of 'identity' for our (Unix) accounts

Chris's Wiki :: blog

By: cks

31 March 2025 at 03:17

When I wrote about how it's complicated to count how many professors are in our department, I mentioned that the issues involved would definitely complicate the life of any IAM system that tried to understand all of this, but that we had a much simpler view of things. Today I'm going to explain that, with a little bit on its historical evolution (as I understand it).

All Unix accounts on our have to be 'sponsored' by someone, their 'sponsor'. Roughly speaking, all professors who supervise graduate students in the department and all professors who are in the department are or can be sponsors, and there are some additional special sponsors (for example, technical and administrative staff also have sponsors). Your sponsor has to approve your account request before it can be created, although some of the time the approval is more or less automatic (for example, for incoming graduate students, who are automatically sponsored by their supervisor).

At one level this requires us to track 'who is a professor'. At another level, we outsource this work; when new professors show up, the administrative staff side of the department will ask us to set up an account for them, at which point we know to either enable them as a sponsor or schedule it in the future at their official start date. And ultimately, 'who can sponsor accounts' is a political decision that's made (if necessary) by the department (generally by the Chair). We're never called on to evaluate the 'who is a professor in the department' question ourselves.

I believe that one reason we use this model is that what is today the department's general research side computing environment originated in part from an earlier organization that included only a subset of the professors here, so that not everyone in the department could get a Unix account on 'CSRI' systems. To get a CSRI account, a professor who was explicitly part of CSRI had to say 'yes, I want this person to have an account', sponsoring it. When this older, more restricted environment expanded to become the department's general research side computing environment, carrying over the same core sponsorship model was natural (or so I believe).

(Back in the days there were other research groups around the department, involving other professors, and they generally had similar policies for who could get an account.)

(2 comments.)

Chris's Wiki :: blog
Using SimpleSAMLphp to set up an identity provider with Duo support 30 March 2025 at 02:28

Using SimpleSAMLphp to set up an identity provider with Duo support

Chris's Wiki :: blog

By: cks

30 March 2025 at 02:28

My university has standardized on an institutional MFA system that's based on institutional identifiers and Duo (a SaaS company, as is commonly necessary these days to support push MFA). We have our own logins and passwords, but wanted to add full Duo MFA authentication to (as a first step) various of our web applications. We were eventually able to work out how to do this, which I'm going to summarize here because although this is a very specific need, maybe someone else in the world also has it.

The starting point is SimpleSAMLphp, which we already had an instance of that authenticated only with login and password against an existing LDAP server we had. SSP is a SAML IdP, but there's a third party module for OIDC OP support, and we wound up using it to make our new IdP support both SAML and OIDC. For Duo support we found a third party module, but to work with SSP 2.x, you need to use a feature branch. We run the entire collective stack of things under Apache, because we're already familiar with that.

A rough version of the install process is:

Set up Apache so it can run PHP and etc etc.
Obtain SimpleSAMLphp 2.x from the upstream releases. You almost certainly can't use a version packaged by your Linux distribution, because you need to be able to use the 'composer' PHP package manager to add packages to it.
Unpack this release somewhere, conventionally /var/simplesamlphp.
Install the 'composer' PHP package manager if it's not already available.
Install the third party Duo module from the alternate branch. At the top level of your SimpleSAMLphp install, run:
```
composer require 0x0fbc/simplesamlphp-module-duouniversal:dev-feature
```

Optionally install the OIDC module:

composer require simplesamlphp/simplesamlphp-module-oidc

Now you can configure SimpleSAMLphp, the Duo module, and the OIDC module following their respective instructions (which are not 'simple' despite the name). If you're using the OIDC module, remember that you'll need to set up the Duo module (and the other things we'll need) in two places, not just one, and you'll almost certainly want to add an Apache alias for '/.well-known/openid-configuration' that redirects it to the actual URL that the OIDC module uses.

At this point we need to deal with the mismatch between our local logins and the institutional identifiers that Duo uses for MFA. There are at least three options to deal with this:

Add a LDAP attribute (and schema) that will hold the Duo identifier (let's call this the 'duoid') for everyone. This attribute will (probably) be automatically available as a SAML attribute, making it available to the Duo module.
(If you're not using LDAP for your SimpleSAMLphp authentication module, the module you're using may have its own way to add extra information.)
Embed the duoid into your GECOS field in LDAP and write a SimpleSAMLphp 'authproc' with arbitrary PHP code to extract the GECOS field and materialize it as a SAML attribute. This has the advantage that you can share this GECOS field with the Duo PAM module if you use that.
Write a SimpleSAMLphp 'authproc' that uses arbitrary PHP code to look up the duoid for a particular login from some data source, which could be an actual database or simply a flat file that you open and search through. This is what we did, mostly because we had such a file sitting around for other reasons.

(Your new SAML attribute will normally be passed through to SAML SPs (clients) that use you as a SAML IdP, but it won't be passed through to OIDC RPs (also clients) unless you configure a new OIDC claim and scope for it and clients ask for that OIDC scope.)

You'll likely also want to augment the SSP Duo module with some additional logging, so you can tell when Duo MFA authentication is attempted for people and when it succeeds. Since the SSP Duo module is more or less moribund, we probably don't have too much to worry about as far as keeping up with upstream updates goes.

I've looked through the SSP Duo module's code and I'm not too worried about development having stopped some time ago. As far as I can see, the module is directly following Duo's guidance for how to use the current Duo Universal SDK and is basically simple glue code to sit between SimpleSAMLphp's API and the Duo SDK API.

Sidebar: Implications of how the Duo module is implemented

To simplify the technical situation, the MFA challenge created by the SSP Duo module is done as an extra step after SimpleSAMLphp has 'authenticated' your login and password against, say, your LDAP server. SSP as a whole has no idea that a person who's passed LDAP is not yet 'fully logged in', and so it will both log things and behave as if you're fully authenticated even before the Duo challenge succeeds. This is the big reason you need additional logging in the Duo module itself.

As far as I can tell, SimpleSAMLphp will also set its 'you are authenticated' IdP session cookie in your browser immediately after you pass LDAP. Conveniently (and critically), authprocs always run when you revisit SimpleSAMLphp even if you're not challenged for a login and password. This does mean that every time you revisit your IdP (for example because you're visiting another website that's protected by it), you'll be sent for a round trip through Duo's site. Generally this is harmless.

Chris's Wiki :: blog
US sanctions and your VPN (and certain big US-based cloud providers) 28 March 2025 at 02:43

US sanctions and your VPN (and certain big US-based cloud providers)

Chris's Wiki :: blog

By: cks

28 March 2025 at 02:43

As you may have heard (also) and to simplify, the US government requires US-based organizations to not 'do business with' certain countries and regions (what this means in practice depends in part which lawyer you ask, or more to the point, that the US-based organization asked). As a Canadian university, we have people from various places around the world, including sanctioned areas, and sometimes they go back home. Also, we have a VPN, and sometimes when people go back home, they use our VPN for various reasons (including that they're continuing to do various academic work while they're back at home). Like many VPNs, ours normally routes all of your traffic out of our VPN public exit IPs (because people want this, for good reasons).

Getting around geographical restrictions by using a VPN is a time honored Internet tradition. As a result of it being a time honored Internet tradition, a certain large cloud provider with a lot of expertise in browsers doesn't just determine what your country is based on your public IP; instead, as far as we can tell, it will try to sniff all sorts of attributes of your browser and your behavior and so on to tell if you're actually located in a sanctioned place despite what your public IP is. If this large cloud provider decides that you (the person operating through the VPN) actually are in a sanctioned region, it then seems to mark your VPN's public exit IP as 'actually this is in a sanctioned area' and apply the result to other people who are also working through the VPN.

(Well, I simplify. In real life the public IP involved may only be one part of a signature that causes the large cloud provider to decide that a particular connection or request is from a sanctioned area.)

Based on what we observed, this large cloud provider appears to deal with connections and HTTP requests from sanctioned regions by refusing to talk to you. Naturally this includes refusing to talk to your VPN's public exit IP when it has decided that your VPN's IP is really in a sanctioned country. When this sequence of events happened to us, this behavior provided us an interesting and exciting opportunity to discover how many companies hosted some part of their (web) infrastructure and assets (static or otherwise) on the large cloud provider, and also how hard to diagnose the resulting failures were. Some pages didn't load at all; some pages loaded only partially, or had stuff that was supposed to work but didn't (because fetching JavaScript had failed); with some places you could load their main landing page (on one website) but then not move to the pages (on another website at a subdomain) that you needed to use to get things done.

The partial good news (for us) was that this large cloud provider would reconsider its view of where your VPN's public exit IP 'was' after a day or two, at which point everything would go back to working for a while. This was also sort of the bad news, because it made figuring out what was going on somewhat more complicated and hit or miss.

If this is relevant to your work and your VPNs, all I can suggest is to get people to use different VPNs with different public exit IPs depending on where the are (or force them to, if you have some mechanism for that).

PS: This can presumably also happen if some of your people are merely traveling to and in the sanctioned region, either for work (including attending academic conferences) or for a vacation (or both).

(This is a sysadmin war story from a couple of years ago, but I have no reason to believe the situation is any different today. We learned some troubleshooting lessons from it.)

(2 comments.)

Chris's Wiki :: blog
Three ways I know of to authenticate SSH connections with OIDC tokens 27 March 2025 at 02:56

Three ways I know of to authenticate SSH connections with OIDC tokens

Chris's Wiki :: blog

By: cks

27 March 2025 at 02:56

Suppose, not hypothetically, that you have an MFA equipped OIDC identity provider (an 'OP' in the jargon), and you would like to use it to authenticate SSH connections. Specifically, like with IMAP, you might want to do this through OIDC/OAuth2 tokens that are issued by your OP to client programs, which the client programs can then use to prove your identity to the SSH server(s). One reason you might want to do this is because it's hard to find non-annoying, MFA-enabled ways of authenticating SSH, and your OIDC OP is right there and probably already supports sessions and so on. So far I've found three different projects that will do this directly, each with their own clever approach and various tradeoffs.

(The bad news is that all of them require various amounts of additional software, including on client machines. This leaves SSH apps on phones and tablets somewhat out in the cold.)

The first is ssh-oidc, which is a joint effort of various European academic parties, although I believe it's also used elsewhere (cf). Based on reading the documentation, ssh-oidc works by directly passing the OIDC token to the server, I believe through a SSH 'challenge' as part of challenge/response authentication, and then verifying it on the server through a PAM module and associated tools. This is clever, but I'm not sure if you can continue to do plain password authentication (at least not without PAM tricks to selectively apply their PAM module depending on, eg, the network area the connection is coming from).

Second is Smallstep's DIY Single-Sign-On for SSH (also). This works by setting up a SSH certificate authority and having the CA software issue signed, short-lived SSH client certificates in exchange for OIDC authentication from your OP. With client side software, these client certificates will be automatically set up for use by ssh, and on servers all you need is to trust your SSH CA. I believe you could even set this up for personal use on servers you SSH to, since you set up a personally trusted SSH CA. On the positive side, this requires minimal server changes and no extra server software, and preserves your ability to directly authenticate with passwords (and perhaps some MFA challenge). On the negative side, you now have a SSH CA you have to trust.

(One reason to care about still supporting passwords plus another MFA challenge is that it means that people without the client software can still log in with MFA, although perhaps somewhat painfully.)

The third option, which I've only recently become aware of, is Cloudflare's recently open-sourced 'opkssh' (via, Github). OPKSSH builds on something called OpenPubkey, which uses a clever trick to embed a public key you provide in (signed) OIDC tokens from your OP (for details see here). OPKSSH uses this to put a basically regular SSH public key into such an augmented OIDC token, then smuggles it from the client to the server by embedding the entire token in a SSH (client) certificate; on the server, it uses an AuthorizedKeysCommand to verify the token, extract the public key, and tell the SSH server to use the public key for verification (see How it works for more details). If you want, as far as I can see OPKSSH still supports using regular SSH public keys and also passwords (possibly plus an MFA challenge).

(Right now OPKSSH is not ready for use with third party OIDC OPs. Like so many things it's started out by only supporting the big, established OIDC places.)

It's quite possible that there are other options for direct (ie, non-VPN) OIDC based SSH authentication. If there are, I'd love to hear about them.

(OpenBao may be another 'SSH CA that authenticates you via OIDC' option; see eg Signed SSH certificates and also here and here. In general the OpenBao documentation gives me the feeling that using it merely to bridge between OIDC and SSH servers would be swatting a fly with an awkwardly large hammer.)

(4 comments.)

Chris's Wiki :: blog
Some notes on configuring Dovecot to authenticate via OIDC/OAuth2 15 March 2025 at 03:01

Some notes on configuring Dovecot to authenticate via OIDC/OAuth2

Chris's Wiki :: blog

By: cks

15 March 2025 at 03:01

Suppose, not hypothetically, that you have a relatively modern Dovecot server and a shiny new OIDC identity provider server ('OP' in OIDC jargon, 'IdP' in common usage), and you would like to get Dovecot to authenticate people's logins via OIDC. Ignoring certain practical problems, the way this is done is for your mail clients to obtain an OIDC token from your IdP, provide it to Dovecot via SASL OAUTHBEARER, and then for Dovecot to do the critical step of actually validating that token it received is good, still active, and contains all the information you need. Dovecot supports this through OAuth v2.0 authentication as a passdb (password database), but in the usual Dovecot fashion, the documentation on how to configure the parameters for validating tokens with your IdP is a little bit lacking in explanations. So here are some notes.

If you have a modern OIDC IdP, it will support OpenID Connect Discovery, including the provider configuration request on the path /.well-known/openid-configuration. Once you know this, if you're not that familiar with OIDC things you can request this URL from your OIDC IdP, feed the result through 'jq .', and then use it to pick out the specific IdP URLs you want to set up in things like the Dovecot file with all of the OAuth2 settings you need. If you do this, the only URL you want for Dovecot is the userinfo_endpoint URL. You will put this into Dovecot's introspection_url, and you'll leave introspection_mode set to the default of 'auth'.

You don't want to set tokeninfo_url to anything. This setting is (or was) used for validating tokens with OAuth2 servers before the introduction of RFC 7662. Back then, the defacto standard approach was to make a HTTP GET approach to some URL with the token pasted on the end (cf), and it's this URL that is being specified. This approach was replaced with RFC 7662 token introspection, and then replaced again with OpenID Connect UserInfo. If both tokeninfo_url and introspection_url are set, as in Dovecot's example for Google, the former takes priority.

(Since I've just peered deep into the Dovecot source code, it appears that setting 'introspection_mode = post' actually performs an (unauthenticated) token introspection request. The 'get' mode seems to be the same as setting tokeninfo_url. I think that if you set the 'post' mode, you also want to set active_attribute and perhaps active_value, but I don't know what to, because otherwise you aren't necessarily fully validating that the token is still active. Does my head hurt? Yes. The moral here is that you should use an OIDC IdP that supports OpenID Connect UserInfo.)

If your IdP serves different groups and provides different 'issuer' ('iss') values to them, you may want to set the Dovecot 'issuers =' to the specific issuer that applies to you. You'll also want to set 'username_attribute' to whatever OIDC claim is where your IdP puts what you consider the Dovecot username, which might be the email address or something else.

It would be nice if Dovecot could discover all of this for itself when you set openid_configuration_url, but in the current Dovecot, all this does is put that URL in the JSON of the error response that's sent to IMAP clients when they fail OAUTHBEARER authentication. IMAP clients may or may not do anything useful with it.

As far as I can tell from the Dovecot source code, setting 'scope =' primarily requires that the token contains those scopes. I believe that this is almost entirely a guard against the IMAP client requesting a token without OIDC scopes that contain claims you need elsewhere in Dovecot. However, this only verifies OIDC scopes, it doesn't verify the presence of specific OIDC claims.

So what you want to do is check your OIDC IdP's /.well-known/openid-configuration URL to find out its collection of endpoints, then set:

# Modern OIDC IdP/OP settings
introspection_url = <userinfo_endpoint>
username_attribute = <some claim, eg 'email'>

# not sure but seems common in Dovecot configs?
pass_attrs = pass=%{oauth2:access_token}

# optionally:
openid_configuration_url = <stick in the URL>

# you may need:
tls_ca_cert_file = /etc/ssl/certs/ca-certificates.crt

The OIDC scopes that IMAP clients should request when getting tokens should include a scope that gives the username_attribute claim, which is 'email' if the claim is 'email', and also apparently the requested scopes should include the offline_access scope.

If you want a test client to see if you've set up Dovecot correctly, one option is to appropriately modify a contributed Python program for Mutt (also the README), which has the useful property that it has an option to check all of IMAP, POP3, and authenticated SMTP once you've obtained a token. If you're just using it for testing purposes, you can change the 'gpg' stuff to 'cat' to just store the token with no fuss (and no security). Another option, which can be used for real IMAP clients too if you really want to, is an IMAP/etc OAuth2 proxy.

(If you want to use Mutt with OAuth2 with your IMAP server, see this article on it also, also, also. These days I would try quite hard to use age instead of GPG.)

Chris's Wiki :: blog
How I got my nose rubbed in my screens having 'bad' areas for me 10 March 2025 at 02:50

How I got my nose rubbed in my screens having 'bad' areas for me

Chris's Wiki :: blog

By: cks

10 March 2025 at 02:50

I wrote a while back about how my desktop screens now had areas that were 'good' and 'bad' for me, and mentioned that I had recently noticed this, calling it a story for another time. That time is now. What made me really notice this issue with my screens and where I had put some things on them was our central mail server (temporarily) stopping handling email because its load was absurdly high.

In theory I should have noticed this issue before a co-worker rebooted the mail server, because for a long time I've had an xload window from the mail server (among other machines, I have four xloads). Partly I did this so I could keep an eye on these machines and partly it's to help keep alive the shared SSH connection I also use for keeping an xrun on the mail server.

(In the past I had problems with my xrun SSH connections seeming to spontaneously close if they just sat there idle because, for example, my screen was locked. Keeping an xload running seemed to work around that; I assumed it was because xload keeps updating things even with the screen locked and so forced a certain amount of X-level traffic over the shared SSH connection.)

When the mail server's load went through the roof, I should have noticed that the xload for it had turned solid green (which is how xload looks under high load). However, I had placed the mail server's xload way off on the right side of my office dual screens, which put it outside my normal field of attention. As a result, I never noticed the solid green xload that would have warned me of the problem.

(This isn't where the xload was back on my 2011 era desktop, but at some point since then I moved it and some other xloads over to the right.)

In the aftermath of the incident, I relocated all of those xloads to a more central location, and also made my new Prometheus alert status monitor appear more or less centrally, where I'll definitely notice it.

(Some day I may do a major rethink about my entire screen layout, but most of the time that feels like yak shaving that I'd rather not touch until I have to, for example because I've been forced to switch to Wayland and an entirely different window manager.)

Sidebar: Why xload turns green under high load

Xload draws a horizontal tick line for every integer load average it needs to display the maximum load that fits in its moving histogram. If the highest load average is 1.5, there will be one tick; if the highest load average is 10.2, there will be ten. Ticks are normally drawn in green. This means that as the load average climbs, xload draws more and more ticks, and after a certain point the entire xload display is just solid green from all of the tick lines.

This has the drawback that you don't know the shape of the load average (all you know is that at some point it got quite high), but the advantage that it's quite visually distinctive and you know you have a problem.

(2 comments.)

Chris's Wiki :: blog
A Prometheus gotcha with alerts based on counting things 6 March 2025 at 04:39

A Prometheus gotcha with alerts based on counting things

Chris's Wiki :: blog

By: cks

6 March 2025 at 04:39

Suppose, not entirely hypothetically, that you have some backup servers that use swappable HDDs as their backup media and expose that 'media' as mounted filesystems. Because you keep swapping media around, you don't automatically mount these filesystems and when you do manually try to mount them, it's possible to have some missing (if, for example, a HDD didn't get fully inserted and engaged with the hot-swap bay). To deal with this, you'd like to write a Prometheus alert for 'not all of our backup disks are mounted'. At first this looks simple:

count(
  node_filesystem_size_bytes{
         host = "backupserv",
         mountpoint =~ "/dumps/tapes/slot.*" }
) != <some number>

This will work fine most of the time and then one day it will fail to alert you to the fact that none of the expected filesystems are mounted. The problem is the usual one of PromQL's core nature as a set-based query language (we've seen this before). As long as there's at least one HDD 'tape' filesystem mounted, you can count them, but once there are none, the result of counting them is not 0 but nothing. As a result this alert rule won't produce any results when there are no 'tape' filesystems on your backup server.

Unfortunately there's no particularly good fix, especially if you have multiple identical backup servers and so the real version uses 'host =~ "bserv1|bserv2|..."'. In the single-host case, you can use either absent() or vector() to provide a default value. There's no good solution in the multi-host case, because there's no version of vector() that lets you set labels. If there was, you could at least write:

count( ... ) by (host)
  or vector(0, "host", "bserv1")
  or vector(0, "host", "bserv2")
  ....

(Technically you can set labels via label_replace(). Let's not go there; it's a giant pain for simply adding labels, especially if you want to add more than one.)

In my particular case, our backup servers always have some additional filesystems (like their root filesystem), so I can write a different version of the count() based alert rule:

count(
  node_filesystem_size_bytes{
         host =~ "bserv1|bserv2|...",
         fstype =~ "ext.*' }
) by (host) != <other number>

In theory this is less elegant because I'm not counting exactly what I care about (the number of 'tape' filesystems that are mounted) but instead something more general and potentially more variable (the number of extN filesystems that are mounted) that contains various assumptions about the systems. In practice the number is just as fixed as the number of 'taoe' filesystems, and the broader set of labels will always match something, producing a count of at least one for each host.

(This would change if the standard root filesystem type changed in a future version of Ubuntu, but if that happened, we'd notice.)

PS: This might sound all theoretical and not something a reasonably experienced Prometheus person would actually do. But I'm writing this entry partly because I almost wrote a version of my first example as our alert rule, until I realized what would happen when there were no 'tape' filesystems mounted at all, which is something that happens from time to time for reasons outside the scope of this entry.

Chris's Wiki :: blog
What SimpleSAMLphp's core:AttributeAlter does with creating new attributes 5 March 2025 at 03:41

What SimpleSAMLphp's core:AttributeAlter does with creating new attributes

Chris's Wiki :: blog

By: cks

5 March 2025 at 03:41

SimpleSAMLphp is a SAML identity provider (and other stuff). It's of deep interest to us because it's about the only SAML or OIDC IdP I can find that will authenticate users and passwords against LDAP and has a plugin that will do additional full MFA authentication against the university's chosen MFA provider (although you need to use a feature branch). In the process of doing this MFA authentication, we need to extract the university identifier to use for MFA authentication from our local LDAP data. Conveniently, SimpleSAMLphp has a module called core:AttributeAlter (a part of authentication processing filters) that is intended to do this sort of thing. You can give it a source, a pattern, a replacement that includes regular expression group matches, and a target attribute. In the syntax of its examples, this looks like the following:

 // the 65 is where this is ordered
 65 => [
    'class' => 'core:AttributeAlter',
    'subject' => 'gecos',
    'pattern' => '/^[^,]*,[^,]*,[^,]*,[^,]*,([^,]+)(?:,.*)?$/',
    'target' => 'mfaid',
    'replacement' => '\\1',
 ],

If you're an innocent person, you expect that your new 'mfaid' attribute will be undefined (or untouched) if the pattern does not match because the required GECOS field isn't set. This is not in fact what happens, and interested parties can follow along the rest of this in the source.

(All of this is as of SimpleSAMLphp version 2.3.6, the current release as I write this.)

The short version of what happens is that when the target is a different attribute and the pattern doesn't match, the target will wind up set but empty. Any previous value is lost. How this happens (and what happens) starts with that 'attributes' here are actually arrays of values under the covers (this is '$attributes'). When core:AttributeAlter has a different target attribute than the source attribute, it takes all of the source attribute's values, passes each of them through a regular expression search and replace (using your replacement), and then gathers up anything that changed and sets the target attribute to this gathered collection. If the pattern doesn't match any values of the attribute (in the normal case, a single value), the array of changed things is empty and your target attribute is set to an empty PHP array.

(This is implemented with an array_diff() between the results of preg_replace() and the original attribute value array.)

My personal view is that this is somewhere around a bug; if the pattern doesn't match, I expect nothing to happen. However, the existing documentation is ambiguous (and incomplete, as the use of capture groups isn't particularly documented), so it might not be considered a bug by SimpleSAMLphp. Even if it is considered a bug I suspect it's not going to be particularly urgent to fix, since this particular case is unusual (or people would have found it already).

For my situation, perhaps what I want to do is to write some PHP code to do this extraction operation by hand, through core:PHP. It would be straightforward to extract the necessary GECOS field (or otherwise obtain the ID we need) in PHP, without fooling around with weird pattern matching and module behavior.

(Since I just looked it up, I believe that in the PHP code that core:PHP runs for you, you can use a PHP 'return' to stop without errors but without changing anything. This is relevant in my case since not all GECOS entries have the necessary information.)

Chris's Wiki :: blog
If you get the chance, always run more extra network fiber cabling 4 March 2025 at 04:22

If you get the chance, always run more extra network fiber cabling

Chris's Wiki :: blog

By: cks

4 March 2025 at 04:22

Some day, you may be in an organization that's about to add some more fiber cabling between two rooms in the same building, or maybe two close by buildings, and someone may ask you for your opinion about many fiber pairs should be run. My personal advice is simple: run more fiber than you think you need, ideally a bunch more (this generalizes to network cabling in general, but copper cabling is a lot more bulky and so harder to run (much) more of). There is an unreasonable amount of fiber to run, but mostly it comes up when you'd have to put in giant fiber patch panels.

The obvious reason to run more fiber is that you may well expand your need for fiber in the future. Someone will want to run a dedicated, private network connection between two locations; someone will want to trunk things to get more bandwidth; someone will want to run a weird protocol that requires its own network segment (did you know you can run HDMI over Ethernet?); and so on. It's relatively inexpensive to add some more fiber pairs when you're already running fiber but much more expensive to have to run additional fiber later, so you might as well give yourself room for growth.

The less obvious reason to run extra fiber is that every so often fiber pairs stop working, just like network cables go bad, and when this happens you'll need to replace them with spare fiber pairs, which means you need those spare fiber pairs. Some of the time this fiber failure is (probably) because a raccoon got into your machine room, but some of the time it just happens for reasons that no one is likely to ever explain to you. And when this happens, you don't necessarily lose only a single pair. Today, for example, we lost three fiber pairs that ran between two adjacent buildings and evidence suggests that other people at the university lost at least one more pair.

(There are a variety of possible causes for sudden loss of multiple pairs, probably all running through a common path, which I will leave to your imagination. These fiber runs are probably not important enough to cause anyone to do a detailed investigation of where the fault is and what happened.)

Fiber comes in two varieties, single mode and multi-mode. I don't know enough to know if you should make a point of running both (over distances where either can be used) as part of the whole 'run more fiber' thing. Locally we have both SM and MM fiber and have switched back and forth between them at times (and may have to do so as a result of the current failures).

PS: Possibly you work in an organization where broken inside-building fiber runs are regularly fixed or replaced. That is not our local experience; someone has to pay for fixing or replacing, and when you have spare fiber pairs left it's easier to switch over to them rather than try to come up with the money and so on.

(Repairing or replacing broken fiber pairs will reduce your long term need for additional fiber, but obviously not the short term need. If you lose N pairs of fiber, you need N spare pairs to get back into operation.)

(3 comments.)

Chris's Wiki :: blog
MFA's "push notification" authentication method can be easier to integrate 26 February 2025 at 03:59

MFA's "push notification" authentication method can be easier to integrate

Chris's Wiki :: blog

By: cks

26 February 2025 at 03:59

For reasons outside the scope of this entry, I'm looking for an OIDC or SAML identity provider that supports primary user and password authentication against our own data and then MFA authentication through the university's SaaS vendor. As you'd expect, the university's MFA SaaS vendor supports all of the common MFA approaches today, covering push notifications through phones, one time codes from hardware tokens, and some other stuff. However, pretty much all of the MFA integrations I've been able to find only support MFA push notifications (eg, also). When I thought about it, this made a lot of sense, because it's often going to be much easier to add push notification MFA than any other form of it.

A while back I wrote about exploiting password fields for multi-factor authentication, where various bits of software hijacked password fields to let people enter things like MFA one time codes into systems (like OpenVPN) that were never set up for MFA in the first place. With most provider APIs, authentication through push notification can usually be inserted in a similar way, because from the perspective of the overall system it can be a synchronous operation. The overall system calls a 'check' function of some sort, the check function calls out the the provider's API and then possibly polls for a result for a while, and then it returns a success or a failure. There's no need to change the user interface of authentication or add additional high level steps.

(The exception is if the MFA provider's push authentication API only returns results to you by making a HTTP query to you. But I think that this would be a relatively weird API; a synchronous reply or at least a polled endpoint is generally much easier to deal with and is more or less required to integrate push authentication with non-web applications.)

By contrast, if you need to get a one time code from the person, you have to do things at a higher level and it may not fit well in the overall system's design (or at least the easily exposed points for plugins and similar things). Instead of immediately returning a successful or failed authentication, you now need to display an additional prompt (in many cases, a HTML page), collect the data, and only then can you say yes or no. In a web context (such as a SAML or OIDC IdP), the provider may want you to redirect the user to their website and then somehow call you back with a reply, which you'll have to re-associate with context and validate. All of this assumes that you can even interpose an additional prompt and reply, which isn't the case in some contexts unless you do extreme things.

(Sadly this means that if you have a system that only supports MFA push authentication and you need to also accept codes and so on, you may be in for some work with your chainsaw.)

(One comment.)

Chris's Wiki :: blog
JSON has become today's machine-readable output format (on Unix) 24 February 2025 at 04:26

JSON has become today's machine-readable output format (on Unix)

Chris's Wiki :: blog

By: cks

24 February 2025 at 04:26

Recently, I needed to delete about 1,200 email messages to a particular destination from the mail queue on one of our systems. This turned out to be trivial, because this system was using Postfix and modern versions of Postfix can output mail queue status information in JSON format. So I could dump the mail queue status, select the relevant messages and print the queue IDs with jq, and feed this to Postfix to delete the messages. This experience has left me with the definite view that everything should have the option to output JSON for 'machine-readable' output, rather than some bespoke format. For new programs, I think that you should only bother producing JSON as your machine readable output format.

(If you strongly object to JSON, sure, create another machine readable output format too. But if you don't care one way or another, outputting only JSON is probably the easiest approach for programs that don't already have such a format of their own.)

This isn't because JSON is the world's best format (JSON is at best the least bad format). Instead it's because JSON has a bunch of pragmatic virtues on a modern Unix system. In general, JSON provides a clear and basically unambiguous way to represent text data and much numeric data, even if it has relatively strange characters in it (ie, JSON has escaping rules that everyone knows and all tools can deal with); it's also generally extensible to add additional data without causing heartburn in tools that are dealing with older versions of a program's output. And on Unix there's an increasingly rich collection of tools to deal with and process JSON, starting with jq itself (and hopefully soon GNU Awk in common configurations). Plus, JSON can generally be transformed to various other formats if you need them.

(JSON can also be presented and consumed in either multi-line or single line formats. Multi-line output is often much more awkward to process in other possible formats.)

There's nothing unique about JSON in all of this; it could have been any other format with similar virtues where everything lined up this way for the format. It just happens to be JSON at the moment (and probably well into the future), instead of (say) XML. For individual programs there are simpler 'machine readable' output formats, but they either have restrictions on what data they can represent (for example, no spaces or tabs in text), or require custom processing that goes well beyond basic grep and awk and other widely available Unix tools, or both. But JSON has become a "narrow waist" for Unix programs talking to each other, a common coordination point that means people don't have to invent another format.

(JSON is also partially self-documenting; you can probably look at a program's JSON output and figure out what various parts of it mean and how it's structured.)

PS: Using JSON also means that people writing programs don't have to design their own machine-readable output format. Designing a machine readable output format is somewhat more complicated than it looks, so I feel that the less of it people need to do, the better.

(I say this as a system administrator who's had to deal with a certain amount of output formats that have warts that make them unnecessarily hard to deal with.)

(3 comments.)

Chris's Wiki :: blog
It's good to have offline contact information for your upstream networking 21 February 2025 at 03:42

It's good to have offline contact information for your upstream networking

Chris's Wiki :: blog

By: cks

21 February 2025 at 03:42

So I said something on the Fediverse:

Current status: it's all fun and games until the building's backbone router disappears.

A modest suggestion: obtain problem reporting/emergency contact numbers for your upstream in advance and post them on the wall somewhere. But you're on your own if you use VOIP desk phones.

(It's back now or I wouldn't be posting this, I'm in the office today. But it was an exciting 20 minutes.)

(I was somewhat modeling the modest suggestion after nuintari's Fediverse series of "rules of networking", eg, also.)

The disappearance of the building's backbone router took out all local networking in the particular building that this happened in (which is the building with our machine room), including the university wireless in the building. THe disappearance of the wireless was especially surprising, because the wireless SSID disappeared entirely.

(My assumption is that the university's enterprise wireless access points stopped advertising the SSID when they lost some sort of management connection to their control plane.)

In a lot of organizations you might have been able to relatively easily find the necessary information even with this happening. For example, people might have smartphones with data plans and laptops that they could tether to the smartphones, and then use this to get access to things like the university directory, the university's problem reporting system, and so on. For various reasons, we didn't really have any of this available, which left us somewhat at a loss when the external networking evaporated. Ironically we'd just managed to finally find some phone numbers and get in touch with people when things came back.

(One bit of good news is that our large scale alert system worked great to avoid flooding us with internal alert emails. My personal alert monitoring (also) did get rather noisy, but that also let me see right away how bad it was.)

Of course there's always things you could do to prepare, much like there are often too many obvious problems to keep track of them all. But in the spirit of not stubbing our toes on the same problem a second time, I suspect we'll do something to keep some problem reporting and contact numbers around and available.

(3 comments.)

Chris's Wiki :: blog
Shared (Unix) hosting and the problem of managing resource limits 20 February 2025 at 03:14

Shared (Unix) hosting and the problem of managing resource limits

Chris's Wiki :: blog

By: cks

20 February 2025 at 03:14

Yesterday I wrote about how one problem with shared Unix hosting was the lack of good support for resource limits in the Unixes of the time. But even once you have decent resource limits, you still have an interlinked set of what we could call 'business' problems. These are the twin problems of what resource limits you set on people and how you sell different levels of these resources limits to your customers.

(You may have the first problem even for purely internal resource allocation on shared hosts within your organization, and it's never a purely technical decision.)

The first problem is whether you overcommit what you sell and in general how you decide on the resource limits. Back in the big days of the shared hosting business, I believe that overcommitting was extremely common; servers were expensive and most people didn't use much resources on average. If you didn't overcommit your servers, you had to charge more and most people weren't interested in paying that. Some resources, such as CPU time, are 'flow' resources that can be rebalanced on the fly, restricting everyone to a fair share when the system is busy (even if that share is below what they're nominally entitled to), but it's quite difficult to take memory back (or disk space). If you overcommit memory, your systems might blow up under enough load. If you don't overcommit memory, either everyone has to pay more or everyone gets unpopularly low limits.

(You can also do fancy accounting for 'flow' resources, such as allowing bursts of high CPU but not sustained high CPU. This is harder to do gracefully for things like memory, although you can always do it ungracefully by terminating things.)

The other problem entwined with setting resource limits is how (and if) you sell different levels of resource limits to your customers. A single resource limit is simple but probably not what all of your customers want; some will want more and some will only need less. But if you sell different limits, you have to tell customers what they're getting, let them assess their needs (which isn't always clear in a shared hosting situation), deal with them being potentially unhappy if they think they're not getting what they paid for, and so on. Shared hosting is always likely to have complicated resource limits, which raises the complexity of selling them (and of understanding them, for the customers who have to pick one to buy).

Viewed from the right angle, virtual private servers (VPSes) are a great abstraction to sell different sets of resource limits to people in a way that's straightforward for them to understand (and which at least somewhat hides whether or not you're overcommitting resources). You get 'a computer' with these characteristics, and most of the time it's straightforward to figure out whether things fit (the usual exception is IO rates). So are more abstracted, 'cloud-y' ways of selling computation, database access, and so on (at least in areas where you can quantify what you're doing into some useful unit of work, like 'simultaneous HTTP requests').

It's my personal suspicion that even if the resource limitation problems had been fully solved much earlier, shared hosting would have still fallen out of fashion in favour of simpler to understand VPS-like solutions, where what you were getting and what you were using (and probably what you needed) were a lot clearer.

(One comment.)

Chris's Wiki :: blog
One problem with "shared Unix hosting" was the lack of resource limits 19 February 2025 at 04:04

One problem with "shared Unix hosting" was the lack of resource limits

Chris's Wiki :: blog

By: cks

19 February 2025 at 04:04

I recently read Comments on Shared Unix Hosting vs. the Cloud (via), which I will summarize as being sad about how old fashioned shared hosting on a (shared) Unix system has basically died out, and along with it web server technology like CGI. As it happens, I have a system administrator's view of why shared Unix hosting always had problems and was a down-market thing with various limitations, and why even today people aren't very happy with providing it. In my view, a big part of the issue was the lack of resource limits.

The problem with sharing a Unix machine with other people is that by default, those other people can starve you out. They can take up all of the available CPU time, memory, process slots, disk IO, and so on. On an unprotected shared web server, all you need is one person's runaway 'CGI' code (which might be PHP code or etc) or even an unusually popular dynamic site and all of the other people wind up having a bad time. Life gets worse if you allow people to log in, run things in the background, run things from cron, and so on, because all of these can add extra load. In order to make shared hosting be reliable and good, you need some way of forcing a fair sharing of resources and limiting how much resources a given customer can use.

Unfortunately, for much of the practical life of shared Unix hosting, Unixes did not have that. Some Unixes could create various sorts of security boundaries, but generally not resource usage limits that applied to an entire group of processes. Even once this became possibly to some degree in Linux through cgroup(s), the kernel features took some time to mature and then it took even longer for common software to support running things in isolated and resource controlled cgroups. Even today it's still not necessarily entirely there for things like running CGIs from your web server, never mind a potential shared database server to support everyone's database backed blog.

(A shared database server needs to implement its own internal resource limits for each customer, otherwise you have to worry about a customer gumming it up with expensive queries, a flood of queries, and so on. If they need separate database servers for isolation and resource control, now they need more server resources.)

My impression is that the lack of kernel supported resource limits forced shared hosting providers to roll their own ad-hoc ways of limiting how much resources their customers could use. In turn this created the array of restrictions that you used to see on such providers, with things like 'no background processes', 'your CGI can only run for so long before being terminated', 'your shell session is closed after N minutes', and so on. If shared hosting had been able to put real limits on each of their customers, this wouldn't have been as necessary; you could go more toward letting each customer blow itself up if it over-used resources.

(How much resources to give each customer is also a problem, but that's another entry.)

(3 comments.)

Chris's Wiki :: blog
How you should respond to authentication failures isn't universal 13 February 2025 at 02:55

How you should respond to authentication failures isn't universal

Chris's Wiki :: blog

By: cks

13 February 2025 at 02:55

A discussion broke out in the comments on my entry on how everything should be able to ratelimit authentication failures, and one thing that came up was the standard advice that when authentication fails, the service shouldn't give you any indication of why. You shouldn't react any differently if it's a bad password for an existing account, an account that doesn't exist any more (perhaps with the correct password for the account when it existed), an account that never existed, and so on. This is common and long standing advice, but like a lot of security advice I think that the real answer is that what you should do depends on your circumstances, priorities, and goals.

The overall purpose of the standard view is to not tell attackers what they got wrong, and especially not to tell them if the account doesn't even exist. What this potentially achieves is slowing down authentication guessing and making the attacker use up more resources with no chance of success, so that if you have real accounts with vulnerable passwords the attacker is less likely to succeed against them. However, you shouldn't have weak passwords any more and on the modern Internet, attackers aren't short of resources or likely to suffer any consequences for trying and trying against you (and lots of other people). In practice, much like delays on failed authentications, it's been a long time since refusing to say why something failed meaningfully impeded attackers who are probing standard setups for SSH, IMAP, authenticated SMTP, and other common things.

(Attackers are probing for default accounts and default passwords, but the fix there is not to have any, not to slow attackers down a bit. Attackers will find common default account setups, probably much sooner than you would like. Well informed attackers can also generally get a good idea of your valid accounts, and they certainly exist.)

If what you care about is your server resources and not getting locked out through side effects, it's to your benefit for attackers to stop early. In addition, attackers aren't the only people who will fail your authentication. Your own people (or ex-people) will also be doing a certain amount of it, and some amount of the time they won't immediately realize what's wrong and why their authentication attempt failed (in part because people are sadly used to systems simply being flaky, so retrying may make things work). It's strictly better for your people if you can tell them what was wrong with their authentication attempt, at least to a certain extent. Did they use a non-existent account name? Did they format the account name wrong? Are they trying to use an account that has now been disabled (or removed)? And so on.

(Some of this may require ingenious custom communication methods (and custom software). In the comments on my entry, BP suggested 'accepting' IMAP authentication for now-closed accounts and then providing them with only a read-only INBOX that had one new message that said 'your account no longer exists, please take it out of this IMAP client'.)

There's no universally correct trade-off between denying attackers information and helping your people. A lot of where your particular trade-offs fall will depend on your usage patterns, for example how many of your people make mistakes of various sorts (including 'leaving their account configured in clients after you've closed it'). Some of it will also depend on how much resources you have available to do a really good job of recognizing serious attacks and impeding attackers with measures like accurately recognizing 'suspicious' authentication patterns and blocking them.

(Typically you'll have no resources for this and will be using more or less out of the box rate-limiting and other measures in whatever software you use. Of course this is likely to limit your options for giving people special messages about why they failed authentication, but one of my hopes is that over time, software adds options to be more informative if you turn them on.)

(3 comments.)

Chris's Wiki :: blog
Everything should be able to ratelimit sources of authentication failures 11 February 2025 at 03:54

Everything should be able to ratelimit sources of authentication failures

Chris's Wiki :: blog

By: cks

11 February 2025 at 03:54

One of the things that I've come to believe in is that everything, basically without exception, should be able to rate-limit authentication failures, at least when you're authenticating people. Things don't have to make this rate-limiting mandatory, but it should be possible. I'm okay with basic per-IP or so rate limiting, although it would be great if systems could do better and be able to limit differently based on different criteria, such as whether the target login exists or not, or is different from the last attempt, or both.

(You can interpret 'sources' broadly here, if you want to; perhaps you should be able to ratelimit authentication by target login, not just by source IP. Or ratelimit authentication attempts to nonexistent logins. Exim has an interesting idea of a ratelimit 'key', which is normally the source IP in string form but which you can make be almost anything, which is quite flexible.)

I have come to feel that there are two reasons for this. The first reason, the obvious one, is that the Internet is full of brute force bulk attackers and if you don't put in rate-limits, you're donating CPU cycles and RAM to them (even if they have no chance of success and will always fail, for example because you require MFA after basic password authentication succeeds). This is one of the useful things that moving your services to non-standard ports helps with; you're not necessarily any more secure against a dedicated attacker, but you've stopped donating CPU cycles to the attackers that only poke the default port.

The second reason is that there are some number of people out there who will put a user name and a password (or the equivalent in the form of some kind of bearer token) into the configuration of some client program and then forget about it. Some of the programs these people are using will retry failed authentications incessantly, often as fast as you'll allow them. Even if the people check the results of the authentication initially (for example, because they want to get their IMAP mail), they may not keep doing so and so their program may keep trying incessantly even after events like their password changing or their account being closed (something that we've seen fairly vividly with IMAP clients). Without rate-limits, these programs have very little limits on their blind behavior; with rate limits, you can either slow them down (perhaps drastically) or maybe even provoke error messages that get the person's attention.

Unless you like potentially seeing your authentication attempts per second trending up endlessly, you want to have some way to cut these bad sources off, or more exactly make their incessant attempts inexpensive for you. The simple, broad answer is rate limiting.

(Actually getting rate limiting implemented is somewhat tricky, which in my view is one reason it's uncommon (at least as an integrated feature, instead of eg fail2ban). But that's another entry.)

PS: Having rate limits on failed authentications is also reassuring, at least for me.

(3 comments.)

Chris's Wiki :: blog
The practical (Unix) problems with .cache and its friends 5 February 2025 at 03:53

The practical (Unix) problems with .cache and its friends

Chris's Wiki :: blog

By: cks

5 February 2025 at 03:53

Over on the Fediverse, I said:

Dear everyone writing Unix programs that cache things in dot-directories (.cache, .local, etc): please don't. Create a non-dot directory for it. Because all of your giant cache (sub)directories are functionally invisible to many people using your programs, who wind up not understanding where their disk space has gone because almost nothing tells them about .cache, .local, and so on.

A corollary: if you're making a disk space usage tool, it should explicitly show ~/.cache, ~/.local, etc.

If you haven't noticed, there are an ever increasing number of programs that will cache a bunch of data, sometimes a very large amount of it, in various dot-directories in people's home directories. If you're lucky, these programs put their cache somewhere under ~/.cache; if you're semi-lucky, they use ~/.local, and if you're not lucky they invent their own directory, like ~/.cargo (used by Rust's standard build tool because it wants to be special). It's my view that this is a mistake and that everyone should put their big caches in a clearly visible directory or directory hierarchy, one that people can actually find in practice.

I will freely admit that we are in a somewhat unusual environment where we have shared fileservers, a now very atypical general multi-user environment, a compute cluster, and a bunch of people who are doing various sorts of modern GPU-based 'AI' research and learning (both AI datasets and AI software packages can get very big). In our environment, with our graduate students, it's routine for people to wind up with tens or even hundreds of GBytes of disk space used up for caches that they don't even realize are there because they don't show up in conventional ways to look for space usage.

As noted by Haelwenn /элвэн/, a plain 'du' will find such dotfiles. The problem is that plain 'du' is more or less useless for most people; to really take advantage of it, you have to know the right trick (not just the -h argument but feeding it to sort to find things). How I think most people use 'du' to find space hogs is they start in their home directory with 'du -s *' (or maybe 'du -hs *') and then they look at whatever big things show up. This will completely miss things in dot-directories in normal usage. And on Linux desktops, I believe that common GUI file browsers will omit dot-directories by default and may not even have a particularly accessible option to change that (this is certainly the behavior of Cinnamon's 'Files' application and I can't imagine that GNOME is different, considering their attitude).

(I'm not sure what our graduate students use to try explore their disk usage, but I know that multiple graduate students have been unable to find space being eaten up in dot-directories and surprised that their home directory was using so much.)

(4 comments.)

Chris's Wiki :: blog
Modern languages and bad packaging outcomes at scale 1 February 2025 at 03:30

Modern languages and bad packaging outcomes at scale

Chris's Wiki :: blog

By: cks

1 February 2025 at 03:30

Recently I read Steinar H. Gunderson's Migrating away from bcachefs (via), where one of the mentioned issues was a strong disagreement between the author of bcachefs and the Debian Linux distribution about how to package and distribute some Rust-based tools that are necessary to work with bcachefs. In the technology circles that I follow, there's a certain amount of disdain for the Debian approach, so today I want to write up how I see the general problem from a system administrator's point of view.

(Saying that Debian shouldn't package the bcachefs tools if they can't follow the wishes of upstream is equivalent to saying that Debian shouldn't support bcachefs. Among other things, this isn't viable for something that's intended to be a serious mainstream Linux filesystem.)

If you're serious about building software under controlled circumstances (and Linux distributions certainly are, as are an increasing number of organizations in general), you want the software build to be both isolated and repeatable. You want to be able to recreate the same software (ideally exactly binary identical, a 'reproducible build') on a machine that's completely disconnected from the Internet and the outside world, and if you build the software again later you want to get the same result. This means that build process can't download things from the Internet, and if you run it three months from now you should get the same result even if things out there on the Internet have changed (such as third party dependencies releasing updated versions).

Unfortunately a lot of the standard build tooling for modern languages is not built to do this. Instead it's optimized for building software on Internet connected machines where you want the latest patchlevel or even entire minor version of your third party dependencies, whatever that happens to be today. You can sometimes lock down specific versions of all third party dependencies, but this isn't necessarily the default and so programs may not be set up this way from the start; you have to patch it in as part of your build customizations.

(Some languages are less optimistic about updating dependencies, but developers tend not to like that. For example, Go is controversial for its approach of 'minimum version selection' instead of 'maximum version selection'.)

The minimum thing that any serious packaging environment needs to do is contain all of the dependencies for any top level artifact, and to force the build process to use these (and only these), without reaching out to the Internet to fetch other things (well, you're going to block all external access from the build environment). How you do this depends on the build system, but it's usually possible; in Go you might 'vendor' all dependencies to give yourself a self-contained source tree artifact. This artifact never changes the dependency versions used in a build even if they change upstream because you've frozen them as part of the artifact creation process.

(Even if you're not a distribution but an organization building your own software using third-party dependencies, you do very much want to capture local copies of them. Upstream things go away or get damaged every so often, and it can be rather bad to not be able to build a new release of some important internal tool because an upstream decided to retire to goat farming rather than deal with the EU CRA. For that matter, you might want to have local copies of important but uncommon third party open source tools you use, assuming you can reasonably rebuild them.)

If you're doing this on a small scale for individual programs you care a lot about, you can stop there. If you're doing this on an distribution's scale you have an additional decision to make: do you allow each top level thing to have its own version of dependencies, or do you try to freeze a common version? If you allow each top level thing to have its own version, you get two problems. First, you're using up more disk space for at least your source artifacts. Second and worse, now you're on the hook for maintaining, checking, and patching multiple versions of a given dependency if it turns out to have a security issue (or a serious bug).

Suppose that you have program A using version 1.2.3 of a dependency, program B using 1.2.7, the current version is 1.2.12, and the upstream releases 1.2.13 to fix a security issue. You may have to investigate both 1.2.3 and 1.2.7 to see if they have the bug and then either patch both with backported fixes or force both program A and program B to be built with 1.2.13, even if the version of these programs that you're using weren't tested and validated with this version (and people routinely break things in patchlevel releases).

If you have a lot of such programs it's certainly tempting to put your foot down and say 'every program that uses dependency X will be set to use a single version of it so we only have to worry about that version'. Even if you don't start out this way you may wind up with it after a few security releases from the dependency and the packagers of programs A and B deciding that they will just force the use of 1.2.13 (or 1.2.15 or whatever) so that they can skip the repeated checking and backporting (especially if both programs are packaged by the same person, who has only so much time to deal with all of this). If you do this inside an organization, probably no one in the outside world knows. If you do this as a distribution, people yell at you.

(Within an organization you may also have more flexibility to update program A and program B themselves to versions that might officially support version 1.2.15 of that dependency, even if the program version updates are a little risky and change some behavior. In a distribution that advertises stability and has no way of contacting people using it to warn them or coordinate changes, things aren't so flexible.)

(One comment.)

Chris's Wiki :: blog
The tradeoffs of having an internal unauthenticated SMTP server 31 January 2025 at 04:08

The tradeoffs of having an internal unauthenticated SMTP server

Chris's Wiki :: blog

By: cks

31 January 2025 at 04:08

One of the reactions I saw to my story of being hit by an alarming well prepared phish spammer was surprise that we had an unauthenticated SMTP server, even if it was only available to our internal networks. Part of the reason we have such a server is historical, but I also feel that the tradeoffs involved are not as clear cut as you might think.

One fundamental problem is that people (actual humans) aren't the only thing that needs to be able to send email. Unless you enjoy building your own system problem notification system from scratch, a whole lot of things will try to send you email to tell you about problems. Cron jobs will email you output, you may want to get similar email about systemd units, both Linux software RAID and smartd will want to use email to tell you about failures, you may have home-grown management systems, and so on. In addition to these programs on your servers, you may have inconvenient devices like networked multi-function photocopiers that have scan to email functionality (and the people who bought them and need to use them have feelings about being able to do so). In a university environment such as ours, some of the machines involved will be run by research groups, graduate students, and so on, not your core system administrators (and it's a very good idea if these machines can tell their owners about failed disks and the like).

Most of these programs will submit their email through the local mailer facilities (whatever they are), and most local mail systems ('MTAs') can be configured to use authentication when they talk to whatever SMTP gateway you point them at. So in theory you could insist on authenticated SMTP for everything. However, this gives you a different problem, because now you must manage this authentication. Do you give each machine its own authentication identity and password, or have some degree of shared authentication? How do you distribute and update this authentication information? How much manual work are you going to need to do as research groups add and remove machines (and as your servers come and go)? Are you going to try to build a system that restricts where a given authentication identity can be used from, so that someone can't make off with the photocopier's SMTP authorization and reuse it from their desktop?

(If you instead authorize IP addresses without requiring SMTP authentication, you've simply removed the requirement for handling and distributing passwords; you're still going to be updating some form of access list. Also, this has issues if people can use your servers.)

You can solve all of these problems if you want to. But there is no current general, easily deployed solution for them, partly because we don't currently have any general system of secure machine and service identity that programs like MTAs can sit on top of. So system administrators have to build such things ourselves to let one MTA prove to another MTA who and what it is.

(There are various ways to do this other than SMTP authentication and some of them are generally used in some environments; I understand that mutual TLS is common in some places. And I believe that in theory Kerberos could solve this, if everything used it.)

Every custom piece of software or piece of your environment that you build is an overhead; it has to be developed, maintained, updated, documented, and so on. It's not wrong to look at the amount of work it would require in your environment to have only authenticated SMTP and conclude that the practical risks of having unauthenticated SMTP are low enough that you'll just do that.

PS: requiring explicit authentication or authorization for notifications is itself a risk, because it means that a machine that's in a sufficiently bad or surprising state can't necessarily tell you about it. Your emergency notification system should ideally fail open, not fail closed.

PPS: In general, there are ways to make an unauthenticated SMTP server less risky, depending on what you need it to do. For example, in many environments there's no need to directly send such system notification email to arbitrary addresses outside the organization, so you could restrict what destinations the server accepts, and maybe what sending addresses can be used with it.

(4 comments.)

Chris's Wiki :: blog
Sometimes you need to (or have to) run old binaries of programs 24 January 2025 at 03:52

Sometimes you need to (or have to) run old binaries of programs

Chris's Wiki :: blog

By: cks

24 January 2025 at 03:52

Something that is probably not news to system administrators who've been doing this long enough is that sometimes, you need to or have to run old binaries of programs. I don't mean that you need to run old versions of things (although since the program binaries are old, they will be old versions); I mean that you literally need to run old binaries, ones that were built years ago.

The obvious situation where this can happen is if you have commercial software and the vendor either goes out of business or stops providing updates for the software. In some situations this can result in you needing to keep extremely old systems alive simply to run this old software, and there are lots of stories about 'business critical' software in this situation.

(One possibly apocryphal local story is that the central IT people had to keep a SPARC Solaris machine running for more than a decade past its feasible end of life because it was the only environment that ran a very special printer driver that was used to print payroll checks.)

However, you can also get into this situation with open source software too. Increasingly, rebuilding complex open source software projects is not for the faint of heart and requires complex build environments. Not infrequently, these build environments are 'fragile', in the sense that in practice they depend on and require specific versions of tools, supporting language interpreters and compilers, and so on. If you're trying to (re)build them on a modern version of the OS, you may find some issues (also). You can try to get and run the version of the tools they need, but this can rapidly send you down a difficult rabbit hole.

(If you go back far enough, you can run into 32-bit versus 64-bit issues. This isn't just compilation problems, where code isn't 64-bit safe; you can also have code that produces different results when built as a 64-bit binary.)

This can create two problems. First, historically, it complicates moving between CPU architectures. For a couple of decades that's been a non-issue for most Unix environments, because x86 was so dominant, but now ARM systems are starting to become more and more available and even attractive, and they generally don't run old x86 binaries very well. Second, there are some operating systems that don't promise long term binary compatibility to older versions of themselves; they will update system ABIs, removing the old version of the ABI after a while, and require you to rebuild software to use the new ABIs if you want to run it on the current version of the OS. If you have to use old binaries you're stuck with old versions of the OS and generally no security updates.

(If you think that this is absurd and no one would possibly do that, I will point you to OpenBSD, which does it regularly to help maintain and improve the security of the system. OpenBSD is neither wrong nor right to take their approach; they're making a different set of tradeoffs than, say, Linux, because they have different priorities.)

(2 comments.)

Chris's Wiki :: blog
Some ways to restrict who can log in via OpenSSH and how they authenticate 19 January 2025 at 04:20

Some ways to restrict who can log in via OpenSSH and how they authenticate

Chris's Wiki :: blog

By: cks

19 January 2025 at 04:20

In yesterday's entry on allowing password authentication from the Internet for SSH, I mentioned that there were ways to restrict who this was enabled for or who could log in through SSH. Today I want to cover some of them, using settings in /etc/ssh/sshd_config.

The simplest way is to globally restrict logins with AllowUsers, listing only specific accounts you want to be accessed over SSH. If there are too many such accounts or they change too often, you can switch to AllowGroups and allow only people in a specific group that you maintain, call it 'sshlogins'.

If you want to allow logins generally but restrict, say, password based authentication to only people that you expect, what you want is a Match block and setting AuthenticationMethods within it. You would set it up something like this:

AuthenticationMethods publickey

Match User cks
  AuthenticationMethods any

If you want to be able to log in using password from your local networks but not remotely, you could extend this with an additional Match directive that looked at the origin IP address:

Match Address 127.0.0.0/8,<your networks here>
  AuthenticationMethods any

In general, Match directives are your tool for doing relatively complex restrictions. You could, for example, arrange that accounts in a certain Unix group can only log in from the local network, never remotely. Or reverse this so that only logins in some Unix group can log in remotely, and everyone else is only allowed to use SSH within the local network.

However, any time you're doing complex things with Match blocks, you should make sure to test your configuration to make sure it's working the way you want. OpenSSH's sshd_config is a configuration file with some additional capabilities, not a programming language, and there are undoubtedly some subtle interactions and traps you can fall into.

(This is one reason I'm not giving a lot of examples here; I'd have to carefully test them.)

Sidebar: Restricting root logins via OpenSSH

If you permit root logins via OpenSSH at all, one fun thing to do is to restrict where you'll accept them from:

PermitRootLogin no
Match Address 127.0.0.0/8,<your networks here>
  PermitRootLogin prohibit-password
  # or 'yes' for some places

A lot of Internet SSH probers direct most of their effort against the root account. With this setting you're assured that all of them will fail no matter what.

(This has come up before but I feel like repeating it.)

(2 comments.)

Chris's Wiki :: blog
Thoughts on having SSH allow password authentication from the Internet 18 January 2025 at 03:42

Thoughts on having SSH allow password authentication from the Internet

Chris's Wiki :: blog

By: cks

18 January 2025 at 03:42

On the Fediverse, I recently saw a poll about whether people left SSH generally accessible on its normal port or if they moved it; one of the replies was that the person left SSH on the normal port but disallowed password based authentication and only allowed public key authentication. This almost led to me posting a hot take, but then I decided that things were a bit more nuanced than my first reaction.

As everyone with an Internet-exposed SSH daemon knows, attackers are constantly attempting password guesses against various accounts. But if you're using a strong password, the odds of an attacker guessing it are extremely low, since doing 'password cracking via SSH' has an extremely low guesses per second number (enforced by your SSH daemon). In this sense, not accepting passwords over the Internet is at most a tiny practical increase in security (with some potential downsides in unusual situations).

Not accepting passwords from the Internet protects you against three other risks, two relatively obvious and one subtle one. First, it stops an attacker that can steal and then crack your encrypted passwords; this risk should be very low if you use strong passwords. Second, you're not exposed if your SSH server turns out to have a general vulnerability in password authentication that can be remotely exploited before a successful authentication. This might not be an authentication bypass; it might be some sort of corruption that leads to memory leaks, code execution, or the like. In practice, (OpenSSH) password authentication is a complex piece of code that interacts with things like your system's random set of PAM modules.

The third risk is that some piece of software will create a generic account with a predictable login name and known default password. These seem to be not uncommon, based on the fact that attackers probe incessantly for them, checking login names like 'ubuntu', 'debian', 'admin', 'testftp', 'mongodb', 'gitlab', and so on. Of course software shouldn't do this, but if something does, not allowing password authenticated SSH from the Internet will block access to these bad accounts. You can mitigate this risk by only accepting password authentication for specific, known accounts, for example only your own account.

The potential downside of only accepting keypair authentication for access to your account is that you might need to log in to your account in a situation where you don't have your keypair available (or can't use it). This is something that I probably care about more than most people, because as a system administrator I want to be able to log in to my desktop even in quite unusual situations. As long as I can use password authentication, I can use anything trustworthy that has a keyboard. Most people probably will only log in to their desktops (or servers) from other machines that they own and control, like laptops, tablets, or phones.

(You can opt to completely disallow password authentication from all other machines, even local ones. This is an even stronger and potentially more limiting restriction, since now you can't even log in from another one of your machines unless that machine has a suitable keypair set up. As a sysadmin, I'd never do that on my work desktop, since I very much want to be able to log in to my regular account from the console of one of our servers if I need to.)

(8 comments.)

Chris's Wiki :: blog
My bug reports are mostly done for work these days 15 January 2025 at 03:33

My bug reports are mostly done for work these days

Chris's Wiki :: blog

By: cks

15 January 2025 at 03:33

These days, I almost entirely report bugs in open source software as part of my work. A significant part of this is that most of what I stumble over bugs in are things that work uses (such as Ubuntu or OpenBSD), or at least things that I mostly use as part of work. There are some consequences of this that I feel like noting today.

The first is that I do bug investigation and bug reporting on work time during work hours, and I don't work on "work bugs" outside of that, on evenings, weekends, and holidays. This sometimes meshes awkwardly with the time open source projects have available for dealing with bugs (which is often in people's personal time outside of work hours), so sometimes I will reply to things and do additional followup investigation out of hours to keep a bug report moving along, but I mostly avoid it. Certainly the initial investigation and filing of a work bug is a working hours activity.

(I'm not always successful in keeping it to that because there is always the temptation to spend a few more minutes digging a bit more into the problem. This is especially acute when working from home.)

The second thing is that bug filing work is merely one of the claims on my work time. I have a finite amount of work time and a variety of things to get done with varying urgency, and filing and updating bugs is not always the top of the list. And just like other work activity, filing a particular bug has to convince me that it's worth spending some of my limited work time on this particular activity. Work does not pay me to file bugs and make open source better; they pay me to make our stuff work. Sometimes filing a bug is a good way to do this but some of the time it's not, for example because the organization in question doesn't respond to most bug reports.

(Even when it's useful in general to file a bug report because it will result in the issue being fixed at some point in the future, we generally need to deal with the problem today, so filing the bug report may take a back seat to things like developing workarounds.)

Another consequence is that it's much easier for me to make informal Fediverse posts about bugs (often as I discover more and more disconcerting things) or write Wandering Thoughts posts about work bugs than it is to make an actual bug report. Writing for Wandering Thoughts is a personal thing that I do outside of work hours, although I write about stuff from work (and I can often use something to write about, so interesting work bugs are good grist).

(There is also that making bug reports is not necessarily pleasant, and making bad bug reports can be bad. This interacts unpleasantly with the open source valorization of public work. To be blunt, I'm more willing to do unpleasant things when work is paying me than when it's not, although often the bug reports that are unpleasant to make are also the ones that aren't very useful to make.)

PS: All of this leads to a surprisingly common pattern where I'll spend much of a work day running down a bug to the point where I feel I understand it reasonably well, come home after work, write the bug up as a Wandering Thoughts entry (often clarifying my understanding of the bug in the process), and then file a bug report at work the next work day.

Chris's Wiki :: blog
IMAP clients can vary in their reactions to IMAP errors 12 January 2025 at 03:55

IMAP clients can vary in their reactions to IMAP errors

Chris's Wiki :: blog

By: cks

12 January 2025 at 03:55

For reasons outside of the scope of this entry, we recently modified our IMAP server so that it would only return 20,000 results from an IMAP LIST command (technically 20,001 results). In our environment, an IMAP LIST operation that generates this many results is because one of the people who can hit this have run into our IMAP server backward compatibility problem. When we made this change, we had a choice for what would happen when the limit was hit, and specifically we had a choice of whether to claim that the IMAP LIST operation had succeeded or had failed. In the end we decided it was better to report that the IMAP LIST operation had failed, which also allowed us to include a text message explaining what had happened (in IMAP these are relatively free form).

(The specifics of the situation are that the IMAP LIST command will report a stream of IMAP folders back to the client and then end the stream after 20,001 entries, with either an 'ok' result or an error result with text. So in the latter case, the IMAP client gets 20,001 folder entries and an error at the end.)

Unsurprisingly, after deploying this change we've seen that IMAP clients (both mail readers and things like server webmail code) vary in their behavior when this limit is hit. The behavior we'd like to see is that the client considers itself to have a partial result and uses it as much as possible, while also telling the person using it that something went wrong. I'm not sure any IMAP client actually does this. One webmail system that we use reports the entire output from the IMAP LIST command as an 'error' (or tries to); since the error message is the last part of the output, this means it's never visible. One mail client appears to throw away all of the LIST results and not report an error to the person using it, which in practice means that all of your folders disappear (apart from your inbox).

(Other mail clients appear to ignore the error and probably show the partial results they've received.)

Since the IMAP server streams the folder list from IMAP LIST to the client as it traverses the folders (ie, Unix directories), we don't immediately know if there are going to be too many results; we only find that out after we've already reported those 20,000 folders. But in hindsight, what we could have done is reported a final synthetic folder with a prominent explanatory name and then claimed that the command succeeded (and stopped). In practice this seems more likely to show something to the person using the mail client, since actually reporting the error text we provide is apparently not anywhere near as common as we might hope.

Chris's Wiki :: blog
Using tcpdump to see only incoming or outgoing traffic 9 January 2025 at 03:13

Using tcpdump to see only incoming or outgoing traffic

Chris's Wiki :: blog

By: cks

9 January 2025 at 03:13

In the normal course of events, implementations of 'tcpdump' report on packets going in both directions, which is to say it reports both packets received and packets sent. Normally this isn't confusing and you can readily tell one from the other, but sometimes situations aren't normal and you want to see only incoming packets or only outgoing packets (this has come up before). Modern versions of tcpdump can do this, but you have to know where to look.

If you're monitoring regular network interfaces on Linux, FreeBSD, or OpenBSD, this behavior is controlled by a tcpdump command line switch. On modern Linux and on FreeBSD, this is '-Q in' or '-Q out', as covered in the Linux manpage and the FreeBSD manpage. On OpenBSD, you use a different command line switch, '-D in' or '-D out', per the OpenBSD manpage.

(The Linux and FreeBSD tcpdump use '-D' to mean 'list all interfaces'.)

There are network types where the in or out direction can be matched by tcpdump pcap filter rules, but plain Ethernet is not one of them. This implies that you can't write a pcap filter rule that matches some packets only inbound and some packets only outbound at the same time; instead you have to run two tcpdumps.

If you have a (software) bridge interface or bridged collection of interfaces, as far as I know on both OpenBSD and FreeBSD the 'in' and 'out' directions on the underlying physical interfaces work the way you expect. Which is to say, if you have ix0 and ix1 bridged together as bridge0, 'tcpdump -Q in -i ix0' shows packets that ix0 is receiving from the physical network and doesn't include packets forward out through ix0 by the bridge interface (which in some sense you could say are 'sent' to ix0 by the bridge).

The PF packet filter system on both OpenBSD and FreeBSD can log packets to a special network interface, normally 'pflog0'. When you tcpdump this interface, both OpenBSD and FreeBSD accept an 'on <interface>' (which these days is a synonym for 'ifname <interface>') clause in pcap filters, which I believe means that the packet was received on the specific interface (per my entry on various filtering options for OpenBSD). Both also have 'inbound' and 'outbound', which I believe match based on whether the particular PF rule that caused them to match was an 'in' or an 'out' rule.

(See the OpenBSD pcap-filter and the FreeBSD pcap-filter manual pages.)

Chris's Wiki :: blog
I'm firmly attached to a mouse and (overlapping) windows 31 December 2024 at 04:45

I'm firmly attached to a mouse and (overlapping) windows

Chris's Wiki :: blog

By: cks

31 December 2024 at 04:45

In the tech circles I follow, there are a number of people who are firmly in what I could call a 'text mode' camp (eg, also). Over on the Fediverse, I said something in an aside about my personal tastes:

(Having used Unix through serial terminals or modems+emulators thereof back in the days, I am not personally interested in going back to a single text console/window experience, but it is certainly an option for simplicity.)

(Although I didn't put it in my Fediverse post, my experience with this 'single text console' environment extends beyond Unix. Similarly, I've lived without a mouse and now I want one (although I have particular tastes in mice).)

On the surface I might seem like someone who is a good candidate for the single pane of text experience, since I do much of my work in text windows, either terminals or environments (like GNU Emacs) that ape them, and I routinely do odd things like read email from the command line. But under the surface, I'm very much not. I very much like having multiple separate blocks of text around, being able to organize these blocks spatially, having a core area where I mostly work from with peripheral areas for additional things, and being able to overlap these blocks and apply a stacking order to control what is completely visible and what's partly visible.

In one view, you could say that this works partly because I have enough screen space. In another view, it would be better to say that I've organized my computing environment to have this screen space (and the other aspects). I've chosen to use desktop computers instead of portable ones, partly for increased screen space, and I've consistently opted for relatively large screens when I could reasonably get them, steadily moving up in screen size (both physical and resolution wise) over time.

(Over the years I've gone out of my way to have this sort of environment, including using unusual window systems.)

The core reason I reach for windows and a mouse is simple: I find the pure text alternative to be too confining. I can work in it if I have to but I don't like to. Using finer grained graphical windows instead of text based ones (in a text windowing environment, which exist), and being able to use a mouse to manipulate things instead of always having to use keyboard commands, is nicer for me. This extends beyond shell sessions to other things as well; for example, generally I would rather start new (X) windows for additional Emacs or vim activities rather than try to do everything through the text based multi-window features that each has. Similarly, I almost never use screen (or tmux) within my graphical desktop; the only time I reach for either is when I'm doing something critical that I might be disconnected from.

(This doesn't mean that I use a standard Unix desktop environment for my main desktops; I have a quite different desktop environment. I've also written a number of tools to make various aspects of this multi-window environment be easy to use in a work environment that involves routine access to and use of a bunch of different machines.)

If I liked tiling based window environments, it would be easier to switch to a text (console) based environment with text based tiling of 'windows', and I would probably be less strongly attached to the mouse (although it's hard to beat the mouse for selecting text). However, tiling window environments don't appeal to me (also), either in graphical or in text form. I'll use tiling in environments where it's the natural choice (for example, in vim and emacs), but I consider it merely okay.

(One comment.)

Chris's Wiki :: blog
The TLS certificate multi-file problem (for automatic updates) 25 December 2024 at 03:25

The TLS certificate multi-file problem (for automatic updates)

Chris's Wiki :: blog

By: cks

25 December 2024 at 03:25

In a recent entry on short lived TLS certificates and graceful certificate rollover in web servers, I mentioned that one issue with software automatically reloading TLS certificates was that TLS certificates are almost always stored in multiple files. Typically this is either two files (the TLS certificate's key and a 'fullchain' file with the TLS certificate and intermediate certificates together) or three files (the key, the signed certificate, and a third file with the intermediate chain). The core problem this creates is the same one you have any time information is split across multiple files, namely making 'atomic' changes to the set of files, so that software never sees an inconsistent state with some updated files and some not.

With TLS certificates, a mismatch between the key and the signed certificate will cause the server to be unable to properly prove that it controls the private key for the TLS certificate it presented. Either it will load the new key and the old certificate or the old key and the new certificate, and in both cases they won't be able to generate the correct proof (assuming the secure case where your TLS certificate software generates a new key for each TLS certificate renewal, which you want to do since you want to guard against your private key having been compromised).

The potential for a mismatch is obvious if the file with the TLS key and the file with the TLS certificate are updated separately (or a new version is written out and swapped into place separately). At this point your mind might turn to clever tricks like writing all of the new files to a new directory and somehow swapping the whole directory in at once (this is certainly where mine went). Unfortunately, even this isn't good enough because the program has to open the two (or three) files separately, and the time gap between the opens creates an opportunity for a mismatch more or less no matter what we do.

(If the low level TLS software operates by, for example, first loading and parsing the TLS certificate, then loading the private key to verify that it matches, the time window may be bigger than you expect because the parsing may take a bit of time. The minimal time window comes about if you open the two files as close to each other as possible and defer all loading and processing until after both are opened.)

The only completely sure way to get around this is to put everything in one file (and then use an appropriate way to update the file atomically). Short of that, I believe that software could try to compensate by checking that the private key and the TLS certificate match after they're automatically reloaded, and if they don't, it should reload both.

(If you control both the software that will use the TLS certificates and the renewal software, you can do other things. For example, you can always update the files in a specific order and then make the server software trigger an automatic reload only when the timestamp changes on the last file to be updated. That way you know the update is 'done' by the time you're loading anything.)

(6 comments.)

Chris's Wiki :: blog
Remembering to make my local changes emit log messages when they act 21 December 2024 at 03:48

Remembering to make my local changes emit log messages when they act

Chris's Wiki :: blog

By: cks

21 December 2024 at 03:48

Over on the Fediverse, I said something:

Current status: respinning an Ubuntu package build (... painfully) because I forgot the golden rule that when I add a hack to something, I should always make it log when my hack was triggered. Even if I can observe the side effects in testing, we'll want to know it happened in production.

(Okay, this isn't applicable to all hacks, but.)

Every so often we change or augment some standard piece of software or standard part of the system to do something special under specific circumstances. A rule I keep forgetting and then either re-learning or reminding myself of is that even if the effects of my change triggering are visible to the person using the system, I want to make it log as well. There are at least two reasons for this.

The first reason is that my change may wind up causing some problem for people, even if we don't think it's going to. Should it cause such problems, it's very useful to have a log message (perhaps shortly before the problem happens) to the effect of 'I did this new thing'. This can save a bunch of troubleshooting, both at the time when we deploy this change and long afterward.

The second reason is that we may turn out to be wrong about how often our change triggers, which is to say how common the specific circumstances are. This can go either way. Our change can trigger a lot more than we expected, which may mean that it's overly aggressive and is affecting people more than we want, and cause us to look for other options. Or this could be because the issue we're trying to deal with could be more significant than we expect and justifies us doing even more. Alternately, our logging can trigger a lot less than we expect, which may mean we want to take the change out rather than have to maintain a local modification that doesn't actually do much (one that almost invariably makes the system more complex and harder to understand).

In the log message itself, I want to be clear and specific, although probably not as verbose as I would be for an infrequent error message. Especially for things I expect to trigger relatively infrequently, I should probably put as many details about the special circumstances as possible into the log message, because the log message is what me and my co-workers may have to work from in six months when we've forgotten the details.

Chris's Wiki :: blog
PCIe cards we use and have used in our servers 8 December 2024 at 03:00

PCIe cards we use and have used in our servers

Chris's Wiki :: blog

By: cks

8 December 2024 at 03:00

In a comment on my entry on how common (desktop) motherboards are supporting more M.2 NVMe slots but fewer PCIe cards, jmassey was curious about what PCIe cards we needed and used. This is a good and interesting question, especially since some number of our 'servers' are actually built using desktop motherboards for various reasons (for example, a certain number of the GPU nodes in our SLURM cluster, and some of our older compute servers, which we put together ourselves using early generation AMD Threadrippers and desktop motherboards for them).

Today, we have three dominant patterns of PCIe cards. Our SLURM GPU nodes obviously have a GPU card (x16 PCIe lanes) and we've added a single port 10G-T card (which I believe are all PCIe x4) so they can pull data from our fileservers as fast as possible. Most of our firewalls have an extra dual-port 10G card (mostly 10G-T but a few use SFPs). And a number of machines have dual-port 1G cards because they need to be on more networks; our current stock of these cards are physically x4 PCIe, although I haven't looked to see if they use all the lanes.

(We also have single-port 1G cards lying around that sometimes get used in various machines; these are x1 cards. The dual-port 10G cards are probably some mix of x4 and x8, since online checks say they come in both varieties. We have and use a few quad-port 1G cards for semi-exotic situations, but I'm not sure how many PCIe lanes they want, physically or otherwise. In theory they could reasonably be x4, since a single 1G is fine at x1.)

In the past, one generation of our fileserver setup had some machines that needed to use PCIe SAS controller in order to be able to talk to all of the drives in their chassis, and I believe these cards were PCIe x8; these machines also used a dual 10G-T card. The current generation handles all of their drives through motherboard controllers, but we might need to move back to cards in future hardware configurations (depending on what the available server motherboards handle on the motherboard). The good news, for fileservers, is that modern server motherboards increasingly have at least one onboard 10G port. But in a worst case situation, a large fileserver might need two SAS controller cards and a 10G card.

It's possible that we'll want to add NVMe drives to some servers (parts of our backup system may be limited by SATA write and read speeds today). Since I don't believe any of our current servers support PCIe bifurcation, this would require one or two PCIe x4 cards and slots (two if we want to mirror this fast storage, one if we decide we don't care). Such a server would likely also want 10G; if it didn't have a motherboard 10G port, that would require another x4 card (or possibly a dual-port 10G card at x8).

The good news for us is that servers tend to make all of their available slots be physically large (generally large enough for x8 cards, and maybe even x16 these days), so you can fit in all these cards even if some of them don't get all the PCIe lanes they'd like. And modern server CPUs are also coming with more and more PCIe lanes, so probably we can actually drive many of those slots at their full width.

(I was going to say that modern server motherboards mostly don't design in M.2 slots that reduce the available PCIe lanes, but that seems to depend on what vendor you look at. A random sampling of Supermicro server motherboards suggests that two M.2 slots are not uncommon, while our Dell R350s have none.)

(One comment.)

Chris's Wiki :: blog
The modern world of server serial ports, BMCs, and IPMI Serial over LAN 4 December 2024 at 04:30

The modern world of server serial ports, BMCs, and IPMI Serial over LAN

Chris's Wiki :: blog

By: cks

4 December 2024 at 04:30

Once upon a time, life was relatively simple in the x86 world. Most x86 compatible PCs theoretically had one or two UARTs, which were called COM1 and COM2 by MS-DOS and Windows, ttyS0 and ttyS1 by Linux, 'ttyu0' and 'ttyu1' by FreeBSD, and so on, based on standard x86 IO port addresses for them. Servers had a physical serial port on the back and wired the connector to COM1 (some servers might have two connectors). Then life became more complicated when servers implemented BMCs (Baseboard management controllers) and the IPMI specification added Serial over LAN, to let you talk to your server through what the server believed was a serial port but was actually a connection through the BMC, coming over your management network.

Early BMCs could take very brute force approaches to making this work. The circa 2008 era Sunfire X2200s we used in our first ZFS fileservers wired the motherboard serial port to the BMC and connected the BMC to the physical serial port on the back of the server. When you talked to the serial port after the machine powered on, you were actually talking to the BMC; to get to the server serial port, you had to log in to the BMC and do an arcane sequence to 'connect' to the server serial port. The BMC didn't save or buffer up server serial output from before you connected; such output was just lost.

(Given our long standing console server, we had feelings about having to manually do things to get the real server serial console to show up so we could start logging kernel console output.)

Modern servers and their BMCs are quite intertwined, so I suspect that often both server serial ports are basically implemented by the BMC (cf), or at least are wired to it. The BMC passes one serial port through to the physical connector (if your server has one) and handles the other itself to implement Serial over LAN. There are variants on this design possible; for example, we have one set of Supermicro hardware with no external physical serial connector, just one serial header on the motherboard and a BMC Serial over LAN port. To be unhelpful, the motherboard serial header is ttyS0 and the BMC SOL port is ttyS1.

When the BMC handles both server serial ports and passes one of them through to the physical serial port, it can decide which one to pass through and which one to use as the Serial over LAN port. Being able to change this in the BMC is convenient if you want to have a common server operating system configuration but use a physical serial port on some machines and use Serial over LAN on others. With the BMC switching which server serial port comes out on the external serial connector, you can tell all of the server OS installs to use 'ttyS0' as their serial console, then connect ttyS0 to either Serial over LAN or the physical serial port as you need.

Some BMCs (I'm looking at you, Dell) go to an extra level of indirection. In these, the BMC has an idea of 'serial device 1' and 'serial device 2', with you controlling which of the server's ttyS0 and ttyS1 maps to which 'serial device', and then it has a separate setting for which 'serial device' is mapped to the physical serial connector on the back. This helpfully requires you to look at two separate settings to know if your ttyS0 will be appearing on the physical connector or as a Serial over LAN console (and gives you two settings that can be wrong).

In theory a BMC could share a single server serial port between the physical serial connector and an IPMI Serial over LAN connection, sending output to both and accepting input from each. In practice I don't think most BMCs do this and there are obvious issues of two people interfering with each other that BMCs may not want to get involved in.

PS: I expect more and more servers to drop external serial ports over time, retaining at most an internal serial header on the motherboard. That might simplify BMC and BIOS settings.

Chris's Wiki :: blog
My life has been improved by my quiet Prometheus alert status monitor 29 November 2024 at 04:48

My life has been improved by my quiet Prometheus alert status monitor

Chris's Wiki :: blog

By: cks

29 November 2024 at 04:48

I recently created a setup to provide a backup for our email-based Prometheus alerts; the basic result is that if our current Prometheus alerts change, a window with a brief summary of current alerts will appear out of the way on my (X) desktop. Our alerts are delivered through email, and when I set up this system I imagined it as a backup, in case email delivery had problems that stopped me from seeing alerts. I didn't entirely realize that in the process, I'd created a simple, terse alert status monitor and summary display.

(This wasn't entirely a given. I could have done something more clever when the status of alerts changed, like only displaying new alerts or alerts that had been resolved. Redisplaying everything was just the easiest approach that minimized maintaining and checking state.)

After using my new setup for several days, I've ended up feeling that I'm more aware of our general status on an ongoing and global basis than I was before. Being more on top of things this way is a reassuring feeling in general. I know I'm not going to accidentally miss something or overlook something that's still ongoing, and I actually get early warning of situations before they trigger actual emails. To put it in trendy jargon, I feel like I have more situational awareness. At the same time this is a passive and unintrusive thing that I don't have to pay attention to if I'm busy (or pay much attention to in general, because it's easy to scan).

Part of this comes from how my new setup doesn't require me to do anything or remember to check anything, but does just enough to catch my eye if the alert situation is changing. Part of this comes from how it puts information about all current alerts into one spot, in a terse form that's easy to scan in the usual case. We have Grafana dashboards that present the same information (and a lot more), but it's more spread out (partly because I was able to do some relatively complex transformations and summarizations in my code).

My primary source for real alerts is still our email messages about alerts, which have gone through additional Alertmanager processing and which carry much more information than is in my terse monitor (in several ways, including explicitly noting resolved alerts). But our email is in a sense optimized for notification, not for giving me a clear picture of the current status, especially since we normally group alert notifications on a per-host basis.

(This is part of what makes having this status monitor nice; it's an alternate view of alerts from the email message view.)

Chris's Wiki :: blog
My new solution for quiet monitoring of our Prometheus alerts 23 November 2024 at 03:25

My new solution for quiet monitoring of our Prometheus alerts

Chris's Wiki :: blog

By: cks

23 November 2024 at 03:25

Our Prometheus setup delivers all alert messages through email, because we do everything through email (as a first approximation). As we saw yesterday, doing everything through email has problems when your central email server isn't responding; Prometheus raised alerts about the problems but couldn't deliver them via email because the core system necessary to deliver email wasn't doing so. Today, I built myself a little X based system to get around that, using the same approach as my non-interrupting notification of new email.

At a high level, what I now have is an xlbiff based notification of our current Prometheus alerts. If there are no alerts, everything is quiet. If new alerts appear, xlbiff will pop up a text window over in the corner of my screen with a summary of what hosts have what alerts; I can click the window to dismiss it. If the current set of alerts changes, xlbiff will re-display the alerts. I currently have xlbiff set to check the alerts every 45 seconds, and I may lengthen that at some point.

(The current frequent checking is because of what started all of this; if there are problems with our email alert notifications, I want to know about it pretty promptly.)

The work of fetching, checking, and formatting alerts is done by a Python program I wrote. To get the alerts, I directly query our Prometheus server rather than talking to Alertmanager; as a side effect, this lets me see pending alerts as well (although then I have to have the Python program ignore a bunch of pending alerts that are too flaky). I don't try to do the ignoring with clever PromQL queries; instead the Python program gets everything and does the filtering itself.

Pulling the current alerts directly from Prometheus means that I can't readily access the explanatory text we add as annotations (and that then appears in our alert notification emails), but for the purposes of a simple notification that these alerts exist, the name of the alert or other information from the labels is good enough. This isn't intended to give me full details about the alerts, just to let me know what's out there. Most of the time I'll get email about the alert (or alerts) soon anyway, and if not I can directly look at our dashboards and Alertmanager.

To support this sort of thing, xlbiff has the notion of a 'check' program that can print out a number every time it runs, and will get passed the last invocation's number on the command line (or '0' at the start). Using this requires boiling down the state of the current alerts to a single signed 32-bit number. I could have used something like the count of current alerts, but me being me I decided to be more clever. The program takes the start time of every current alert (from the ALERTS_FOR_STATE Prometheus metric), subtracts a starting epoch to make sure we're not going to overflow, and adds them all up to be the state number (which I call a 'checksum' in my code because I started out thinking about more complex tricks like running my output text through CRC32).

(As a minor wrinkle, I add one second to the start time of every firing alert so that when alerts go from pending to firing the state changes and xlbiff will re-display things. I did this because pending and firing alerts are presented differently in the text output.)

To get both the start time and the alert state, we must use the usual trick for pulling in extra labels:

ALERTS_FOR_STATE * ignoring(alertstate) group_left(alertstate) ALERTS

I understand why ALERTS_FOR_STATE doesn't include the alert state, but sometimes it does force you to go out of your way.

PS: If we had alerts going off all of the time, this would be far too obtrusive an approach. Instead, our default state is that there are no alerts happening, so this alert notifier spends most of its time displaying nothing (well, having no visible window, which is even better).

(One comment.)

Chris's Wiki :: blog
Our Prometheus alerting problem if our central mail server isn't working 22 November 2024 at 04:04

Our Prometheus alerting problem if our central mail server isn't working

Chris's Wiki :: blog

By: cks

22 November 2024 at 04:04

Over on the Fediverse, I said something:

Ah yes, the one problem that our Prometheus based alert system can't send us alert email about: when the central mail server explodes. Who rings the bell to tell you that the bell isn't working?

(This is of course an aspect of monitoring your Prometheus setup itself, and also seeing if Alertmanager is truly healthy.)

There is a story here. The short version of the story is that today we wound up with a mail loop that completely swamped our central Exim mail server, briefly running its one minute load average up to a high water mark of 3,132 before a co-worker who'd noticed the problem forcefully power cycled it. Plenty of alerts fired during the incident, but since we do all of our alert notification via email and our central email server wasn't delivering very much email (on account of that load average, among other factors), we didn't receive any.

The first thing to note is that this is a narrow and short term problem for us (which is to say, me and my co-workers). On the short term side, we send and receive enough email that not receiving email for very long during working hours is unusual enough that someone would have noticed before too long, in fact my co-worker noticed the problems even without an alert actively being triggered. On the narrow side, I failed to notice this as it was going on because the system stayed up, it just wasn't responsive. Once the system was rebooting, I noticed almost immediately because I was in the office and some of the windows on my office desktop disappeared.

(In that old version of my desktop I would have noticed the issue right away, because an xload for the machine in question was right in the middle of these things. These days it's way off to the right side, out of my routine view, but I could change that back.)

One obvious approach is some additional delivery channel for alerts about our central mail server. Unfortunately, we're entirely email focused; we don't currently use Slack, Teams, or other online chatting systems, so sending selected alerts to any of them is out as a practical option. We do have work smartphones, so in theory we could send SMS messages; in practice, free email to SMS gateways have basically vanished, so we'd have to pay for something (either for direct SMS access and we'd build some sort of system on top, or for a SaaS provider who would take some sort of notification and arrange to deliver it via SMS).

For myself, I could probably build some sort of script or program that regularly polled our Prometheus server to see if there were any relevant alerts. If there were, the program would signal me somehow, either by changing the appearance of a status window in a relatively unobtrusive way (eg turning it red) or popping up some sort of notification (perhaps I could build something around a creative use of xlbiff to display recent alerts, although this isn't as simple as it looks).

(This particular idea is a bit of a trap, because I could spend a lot of time crafting a little X program that, for example, had a row of boxes that were green, yellow, or red depending on the alert state of various really important things.)

(3 comments.)

Chris's Wiki :: blog
IPv6 networks do apparently get probed (and implications for address assignment) 16 November 2024 at 03:30

IPv6 networks do apparently get probed (and implications for address assignment)

Chris's Wiki :: blog

By: cks

16 November 2024 at 03:30

For reasons beyond the scope of this entry, my home ISP recently changed my IPv6 assignment from a /64 to a (completely different) /56. Also for reasons beyond the scope of this entry, they left my old /64 routing to me along with my new /56, and when I noticed I left my old IPv6 address on my old /64 active, because why not. Of course I changed my DNS immediately, and at this point it's been almost two months since my old /64 appeared in DNS. Today I decided to take a look at network traffic to my old /64, because I knew there was some (which is actually another entry), and to my surprise much more appeared than I expected.

On my old /64, I used ::1/64 and ::2/64 for static IP addresses, of which the first was in DNS, and the other IPv6 addresses in it were the usual SLAAC assignments. The first thing I discovered in my tcpdump was a surprisingly large number of cloud-based IPv6 addresses that were pinging my ::1 address. Once I excluded that traffic, I was left with enough volume of port probes that I could easily see them in a casual tcpdump.

The somewhat interesting thing is that these IPv6 port probes were happening at all. Apparently there is enough out there on IPv6 that it's worth scraping IPv6 addresses from DNS and then probing potentially vulnerable ports on them to see if something responds. However, as I kept watching I discovered something else, which is that a significant number of these probes were not to my ::1 address (or to ::2). Instead they were directed to various (very) low-number addresses on my /64. Some went to the ::0 address, but I saw ones to ::3, ::5, ::7, ::a, ::b, ::c, ::f, ::15, and a (small) number of others. Sometimes a sequence of source addresses in the same /64 would probe the same port on a sequence of these addresses in my /64.

(Some of this activity is coming from things with DNS, such as various shadowserver.org hosts.)

As usual, I assume that people out there on the IPv6 Internet are doing this sort of scanning of low-numbered /64 IPv6 addresses because it works. Some number of people put additional machines on such low-numbered addresses and you can discover or probe them this way even if you can't find them in DNS.

One of the things that I take away from this is that I may not want to put servers on these low IPv6 addresses in the future. Certainly one should have firewalls and so on, even on IPv6, but even then you may want to be a little less obvious and easily found. Or at the least, only use these IPv6 addresses for things you're going to put in DNS anyway and don't mind being randomly probed.

PS: This may not be news to anyone who's actually been using IPv6 and paying attention to their traffic. I'm late to this particular party for various reasons.

(One comment.)

Chris's Wiki :: blog
Your options for displaying status over time in Grafana 11 15 November 2024 at 03:41

Your options for displaying status over time in Grafana 11

Chris's Wiki :: blog

By: cks

15 November 2024 at 03:41

A couple of years ago I wrote about your options for displaying status over time in Grafana 9, which discussed the problem of visualizing things how many (firing) Prometheus alerts there are of each type over time. Since then, some things have changed in the Grafana ecosystem, and especially some answers have recently become clearer to me (due to an old issue report), so I have some updates to that entry.

The generally best panel type you want to use for this is a state timeline panel, with 'merge equal consecutive values' turned on. State timelines are no longer 'beta' in Grafana 11 and they work for this, and I believe they're Grafana's more or less officially recommended solution for this problem. By default a state timeline panel will show all labels, but you can enable pagination. The good news (in some sense) is that Grafana is aware that people want a replacement for the old third party Discrete panel (1, 2, 3) and may at some point do more to move toward this.

You can also use bar graphs and line graphs, as mentioned back then, which continue to have the virtue that you can selectively turn on and off displaying the timelines of some alerts. Both bar graphs and line graphs continue to have their issues for this, although I think they're now different issues than they had in Grafana 9. In particular I think (stacked) line graphs are now clearly less usable and harder to read than stacked bar graphs, which is a pity because they used to work decently well apart from a few issues.

(I've been impressed, not in a good way, at how many different ways Grafana has found to make their new time series panel worse than the old graph panel in a succession of Grafana releases. All I can assume is that everyone using modern Grafana uses time series panels very differently than we do.)

As I found out, you don't want to use the status history panel for this. The status history panel isn't intended for this usage; it has limits on the number of results it can represent and it lacks the 'merge equal consecutive values' option. More broadly, Grafana is apparently moving toward merging all of the function of this panel into the Heatmap panel (also). If you do use the status history panel for anything, you want to set a general query limit on the number of results returned, and this limit is probably best set low (although how many points the panel will accept depends on its size in the browser, so life is fun here).

Since the status history panel is basically a variant of heatmaps, you don't really want to use heatmaps either. Using Heatmaps to visualize state over time in Grafana 11 continue to have the issues that I noted in Grafana 9, although some of them may be eliminated at some point in the future as the status history panel is moved further out. Today, if for some reason you have to choose between Heatmaps and Status History for this, I think you should use Status History with a query limit.

If we ever have to upgrade from our frozen Grafana version, I would expect to keep our line graph alert visualizations and replace our Discrete panel usage with State Timeline panels with pagination turned on.

Chris's Wiki :: blog
Finding a good use for keep_firing_for in our Prometheus alerts 13 November 2024 at 04:06

Finding a good use for keep_firing_for in our Prometheus alerts

Chris's Wiki :: blog

By: cks

13 November 2024 at 04:06

A while back (in 2.42.0), Prometheus introduced a feature to artificially keep alerts firing for some amount of time after their alert condition had cleared; this is 'keep_firing_for'. At the time, I said that I didn't really see a use for it for us, but I now have to change that. Not only do we have a use for it, it's one that deals with a small problem in our large scale alerts.

Our 'there is something big going on' alerts exist only to inhibit our regular alerts. They trigger when there seems to be 'too much' wrong, ideally fast enough that their inhibition effect stops the normal alerts from going out. Because normal alerts from big issues being resolved don't necessarily clean out immediately, we want our large scale alerts to linger on for some time after the amount of problems we have drop below their trigger point. Among other things, this avoids a gotcha with inhibitions and resolved alerts. Because we created these alerts before v2.42.0, we implemented the effect of lingering on by using max_over_time() on the alert conditions (this was the old way of giving an alert a minimum duration).

The subtle problem with using max_over_time() this way is that it means you can't usefully use a 'for:' condition to de-bounce your large scale alert trigger conditions. For example, if one of the conditions is 'there are too many ICMP ping probe failures', you'd potentially like to only declare a large scale issue if this persisted for more than one round of pings; otherwise a relatively brief blip of a switch could trigger your large scale alert. But because you're using max_over_time(), no short 'for:' will help; once you briefly hit the trigger number, it's effectively latched for our large scale alert lingering time.

Switching to extending the large scale alert directly with 'keep_firing_for' fixes this issue, and also simplifies the alert rule expression. Once we're no longer using max_over_time(), we can set 'for: 1m' or another useful short number to de-bounce our large scale alert trigger conditions.

(The drawback is that now we have a single de-bounce interval for all of the alert conditions, whereas before we could possibly have a more complex and nuanced set of conditions. For us, this isn't a big deal.)

I suspect that this may be generic to most uses of max_over_time() in alert rule expressions (fortunately, this was our only use of it). Possibly there are reasonable uses for it in sub-expressions, clever hacks, and maybe also using times and durations (eg, also, also).

Chris's Wiki :: blog
Prometheus makes it annoyingly difficult to add more information to alerts 12 November 2024 at 03:58

Prometheus makes it annoyingly difficult to add more information to alerts

Chris's Wiki :: blog

By: cks

12 November 2024 at 03:58

Suppose, not so hypothetically, that you have a special Prometheus meta-alert about large scale issues, that exists to avoid drowning you in alerts about individual hosts or whatever when you have a large scale issue. As part of that alert's notification message, you'd like to include some additional information about things like why you triggered the alert, how many down things you detected, and so on.

While Alertmanager creates the actual notification messages by expanding (Go) templates, it doesn't have direct access to Prometheus or any other source of external information, for relatively straightforward reasons. Instead, you need to pass any additional information from Prometheus to Alertmanager in the form (generally) of alert annotations. Alert annotations (and alert labels) also go through template expansion, and in the templates for alert annotations, you can directly make Prometheus queries with the query function. So on the surface this looks relatively simple, although you're going to want to look carefully at YAML string quoting.

I did some brief experimentation with this today, and it was enough to convince me that there are some issues with doing this in practice. The first issue is that of quoting. Realistic PromQL queries often use " quotes because they involve label values, and the query you're doing has to be a (Go) template string, which probably means using Go raw quotes unless you're unlucky enough to need ` characters, and then there's YAML string quoting. At a minimum this is likely to be verbose.

A somewhat bigger problem is that straightforward use of Prometheus template expansion (using a simple pipeline) is generally going to complain in the error log if your query provides no results. If you're doing the query to generate a value, there are some standard PromQL hacks to get around this. If you want to find a label, I think you need to use a more complex template with operation; on the positive side, this may let you format a message fragment with multiple labels and even the value.

More broadly, if you want to pass multiple pieces of information from a single query into Alertmanager (for example, the query value and some labels), you have a collection of less than ideal approaches. If you create multiple annotations, one for each piece of information, you give your Alertmanager templates the maximum freedom but you have to repeat the query and its handling several times. If you create a text fragment with all of the information that Alertmanager will merely insert somewhere, you basically split writing your alerts between Alertmanager and Prometheus alert rules, And if you encode multiple pieces of information into a single annotation with some scheme, you can use one query in Prometheus and not lock yourself into how the Alertmanager template will use the information, but your Alertmanager template will have to parse that information out again with Go template functions.

What all of this is a symptom of is that there's no particularly good way to pass structured information between Prometheus and Alertmanager. Prometheus has structured information (in the form of query results) and your Alertmanager template would like to use it, but today you have to smuggle that through unstructured text. It would be nice if there was a better way.

(Prometheus doesn't quite pass through structured information from a single query, the alert rule query, but it does make all of the labels and annotations available to Alertmanager. You could imagine a version where this could be done recursively, so some annotations could themselves have labels and etc.)

Chris's Wiki :: blog
Doing general address matching against varying address lists in Exim 30 October 2024 at 02:23

Doing general address matching against varying address lists in Exim

Chris's Wiki :: blog

By: cks

30 October 2024 at 02:23

In various Exim setups, you sometimes want to match an email address against a file (or in general a list) of addresses and some sort of address patterns; for example, you might have a file of addresses and so on that you will never accept as sender addresses. Exim has two different mechanisms for doing this, address lists and nwildlsearch lookups in files that are performed through the '${lookup}' string expansion item. Generally it's better to use address lists, because they have a wildcard syntax that's specifically focused on email addresses, instead of the less useful nwildlsearch lookup wildcarding.

Exim has specific features for matching address lists (including in file form) against certain addresses associated with the email message; for example, both ACLs and routers can match against the envelope sender address (the SMTP MAIL FROM) using 'senders = ...'. If you want to match against message addresses that are not available this way, you must use a generic 'condition =' operation and either '${lookup}' or '${if match_address {..}{...}}', depending on whether you want to use a nwildlsearch lookup or an actual address list (likely in a file). As mentioned, normally you'd prefer to use an actual address list.

Now suppose that your file of addresses is, for example, per-user. In a straight 'senders =' match this is no problem, you can just write 'senders = /some/where/$local_part_data/addrs'. Life is not as easy if you want to match a message address that is not directly supported, for example the email address of the 'From:' header. If you have the user (or whatever other varying thing) in $acl_m0_var, you would like to write:

condition = ${if match_address {${address:$h_from:}} {/a/dir/$acl_m0_var/fromaddrs} }

However, match_address (and its friends) have a deliberate limitation, which is that in common Exim build configurations they don't perform string expansion on their second argument.

The way around this turns out to be to use an explicitly defined and named 'addresslist' that has the string expansion:

addresslist badfromaddrs = /a/dir/$acl_m0_var/fromaddrs

[...]

  condition = ${if match_address {${address:$h_from:}} {+badfromaddrs} }

This looks weird, since at the point we're setting up badfromaddrs the $acl_m0_var is not even vaguely defined, but it works. The important thing that makes this go is a little sentence at the start of the Exim documentation's Expansion of lists:

Each list is expanded as a single string before it is used. [...]

Although the second argument of match_address is not string-expanded when used, if it specifies a named address list, that address list is string expanded when used and so our $acl_m0_var variable is substituted in and everything works.

Speaking from personal experience, it's easy to miss this sentence and its importance, especially if you normally use address lists (and domain lists and so on) without any string expansion, with fixed arguments.

(Probably the only reason I found it was that I was in the process of writing a question to the Exim mailing list, which of course got me to look really closely at the documentation to make sure I wasn't asking a stupid question.)

Chris's Wiki :: blog
Having rate-limits on failed authentication attempts is reassuring 23 October 2024 at 03:24

Having rate-limits on failed authentication attempts is reassuring

Chris's Wiki :: blog

By: cks

23 October 2024 at 03:24

A while back I added rate-limits to failed SMTP authentication attempts. Mostly I did it because I was irritated at seeing all of the failed (SMTP) authentication attempts in logs and activity summaries; I didn't think we were in any actual danger from the usual brute force mass password guessing attacks we see on the Internet. To my surprise, having this rate-limit in place has been quite reassuring, to the point where I no longer even bother looking at the overall rate of SMTP authentication failures or their sources. Attackers are unlikely to make much headway or have much of an impact on the system.

Similarly, we recently updated an OpenBSD machine that has its SSH port open to the Internet from OpenBSD 7.5 to OpenBSD 7.6. One of the things that OpenBSD 7.6 brings with it is the latest version of OpenSSH, 9.8, which has per-source authentication rate limits (although they're not quite described that way and the feature is more general). This was also a reassuring change. Attackers wouldn't be getting into the machine in any case, but I have seen the machine use an awful lot of CPU at times when attackers were pounding away, and now they're not going to be able to do that.

(We've long had firewall rate limits on connections, but they have to be set high for various reasons including that the firewall can't tell connections that fail to authenticate apart from brief ones that did.)

I can wave my hands about why it feels reassuring (and nice) to know that we have rate-limits in place for (some) commonly targeted authentication vectors. I know it doesn't outright eliminate the potential exposure, but I also know that it helps reduce various risks. Overall, I think of it as making things quieter, and in some sense we're no longer getting constantly attacked as much.

(It's also nice to hope that we're frustrating attackers and wasting their time. They do sort of have limits on how much time they have and how many machines they can use and so on, so our rate limits make attacking us more 'costly' and less useful, especially if they trigger our rate limits.)

PS: At the same time, this shows my irrationality, because for a long time I didn't even think about how many SSH or SMTP authentication attempts were being made against us. It was only after I put together some dashboards about this in our metrics system that I started thinking about it (and seeing temporary changes in SSH patterns and interesting SMTP and IMAP patterns). Had I never looked, I would have never thought about it.

Chris's Wiki :: blog
Our various different types of Ubuntu installs 17 October 2024 at 02:15

Our various different types of Ubuntu installs

Chris's Wiki :: blog

By: cks

17 October 2024 at 02:15

In my entry on how we have lots of local customizations I mentioned that the amount of customization we do to any particular Ubuntu server depends on what class or type of machine they are. That's a little abstract, so let's talk about how our various machines are split up by type.

Our general install framework has two pivotal questions that categorize machines. The first question is what degree of NFS mounting the machine will do, with the choices being all of the NFS filesystems from our fileservers (more or less), NFS mounting just our central administrative filesystem either with our full set of accounts or with just staff accounts, rsync'ing that central administrative filesystem (which implies only staff accounts), or being a completely isolated machine that doesn't have even the central administrative filesystem.

Servers that people will use have to have all of our NFS filesystems mounted, as do things like our Samba and IMAP servers. Our fileservers don't cross-mount NFS filesystems from each other, but they do need a replicated copy of our central administrative filesystem and they have to have our full collection of logins and groups for NFS reasons. Many of our more stand-alone, special purpose servers only need our central administrative filesystem, and will either NFS mount it or rsync it depending on how fast we want updates to propagate. For example, our local DNS resolvers don't particularly need fast updates, but our external mail gateway needs to be up to date on what email addresses exist, which is propagated through our central administrative filesystem.

On machines that have all of our NFS mounts, we have a further type choice; we can install them either as a general login server (called an 'apps' server for historical reasons), as a 'comps' compute server (which includes our SLURM nodes), or only install a smaller 'base' set of packages on them (which is not all that small; we used to try to have a 'core' package set and a larger 'base' package set but over time we found we never installed machines with only the 'core' set). These days the only difference between general login servers and compute servers is some system settings, but in the past they used to have somewhat different package sets.

The general login servers and compute servers are mostly not further customized (there are a few exceptions, and SLURM nodes need a bit of additional setup). Almost all machines that get only the base package set are further customized with additional packages and specific configuration for their purpose, because the base package set by itself doesn't make the machine do anything much or be particularly useful. These further customizations mostly aren't scripted (or otherwise automated) for various reasons. The one big exception is installing our NFS fileservers, which we decided was both large enough and we had enough of that we wanted to script it so that everything came out the same.

As a practical matter, the choice between NFS mounting our central administrative filesystem (with only staff accounts) and rsync'ing it makes almost no difference to the resulting install. We tend to think of the two types of servers it creates as almost equivalent and mostly lump them together. So as far as operating our machines goes, we mostly have 'all NFS mounts' machines and 'only the administrative filesystem' machines, with a few rare machines that don't have anything (and our NFS fileservers, which are special in their own way).

(In the modern Linux world of systemd, much of our customizations aren't Ubuntu specific, or even specific to Debian and derived systems that use apt-get. We could probably switch to Debian relatively easily with only modest changes, and to an RPM based distribution with more work.)

Chris's Wiki :: blog
We have lots of local customizations (and how we keep track of them) 15 October 2024 at 03:02

We have lots of local customizations (and how we keep track of them)

Chris's Wiki :: blog

By: cks

15 October 2024 at 03:02

In a comment on my entry on forgetting some of our local changes to our Ubuntu installs, pk left an interesting and useful comment on how they manage changes so that the changes are readily visible in one place. This is a very good idea and we do something similar to it, but a general limitation of all such approaches is that it's still hard to remember all of your changes off the top of your head once you've made enough of them. Once you're changing enough things, you generally can't put them all in one directory that you can simply 'ls' to be reminded of everything you change; at best, you're looking at a list of directories where you change things.

Our system for customizing Ubuntu stores the master version of customizations in our central administrative filesystem, although split across several places for convenience. We broadly have one directory hierarchy for Ubuntu release specific files (or at least ones that are potentially version specific; in practice a lot are the same between different Ubuntu releases), a second hierarchy (or two) for files that are generic across Ubuntu versions (or should be), and then a per-machine hierarchy for things specific to a single machine. Each hierarchy mirrors the final filesystem location, so that our systemd unit files will be in, for example, <hierarchy root>/etc/systemd/system.

Our current setup embeds the knowledge of what files will or won't be installed on any particular class of machines into the Ubuntu release specific 'postinstall' script that we run to customize machines, in the form of a whole bunch of shell commands to copy each of the files (or collections of files). This gives us straightforward handling of files that aren't always installed (or that vary between types of machines), at the cost of making it a little unclear if a particular file in the master hierarchy will actually be installed. We could try to do something more clever, but it would be less obvious that tne current straightforward approach where the postinstall script has a lot of 'cp -a <src>/etc/<file> /etc/<file>' and it's easy to see what you need to do to add one or specially handle one.

(The obvious alternate approach would be to have a master file that listed all of the files to be installed on each type of machine. However, one advantage of the current approach is that it's easy to have various commentary about the files being installed and why, and it's also easy to run commands, install packages, and so on in between installing various files. We don't install them all at once.)

Based on some brute force approximation, it appears that we install around 100 customization files on a typical Ubuntu machine (we install more on some types of machines than on other types, depending on whether the machine will have all of our NFS mounts and whether or not it's a machine regular people will log in to). Specific machines can be significantly customized beyond this; for example, our ZFS fileservers get an additional scripted customization pass.

PS: The reason we have this stuff scripted and stored in a central filesystem is that we have over a hundred servers and a lot of them are basically identical to each other (most obviously, our SLURM nodes). In aggregate, we install and reinstall a fair number of machines and almost all of them have this common core.

(5 comments.)

Chris's Wiki :: blog
Our local changes to standard (Ubuntu) installs are easy to forget 14 October 2024 at 03:08

Our local changes to standard (Ubuntu) installs are easy to forget

Chris's Wiki :: blog

By: cks

14 October 2024 at 03:08

We have been progressively replacing a number of old one-off Linux machines with up to date replacements that run Ubuntu and so are based on our standard Ubuntu install. One of those machines has a special feature where a group of people are allowed to use passworded sudo to gain access to a common holding account. After we deployed the updated machine, these people got in touch with us to report that something had gone wrong with the sudo system. This was weird to me, because I'd made sure to faithfully replicate the old system's sudo customizations to the new one. When I did some testing, things got weirder; I discovered that sudo was demanding the root password instead of my password. This was definitely not how things were supposed to work for this sudo access (especially since the people with sudo access don't know the root password for the machine).

Whether or not sudo does this is controlled by the setting of 'rootpw' in sudoers or one of the files it includes (at least with Ubuntu's standard sudo.conf). The stock Ubuntu sudoers doesn't set 'rootpw', and of course this machine's sudoers customizations didn't set them either. But when I looked around, I discovered that we had long ago set up an /etc/sudoers.d customization file to set 'rootpw' and made it part of our standard Ubuntu install. When I rebuilt this machine based on our standard Ubuntu setup, the standard install stuff had installed this sudo customization. Since we'd long ago completely forgotten about its existence, I hadn't remembered it while customizing the machine to its new purpose, so it had stayed.

(We don't normally use passworded sudo, and we definitely want access to root to require someone to know the special root password, not just the password to a sysadmin's account.)

There are probably a lot of things that we've added to our standard install over the years that are like this sudo customization. They exist to make things work (or not work), and as long as they keep quietly doing their jobs it's very easy to forget them and their effects. Then we do something exceptional on a machine and they crop up, whether it's preventing sudo from working like we want it to or almost giving us a recursive syslog server.

(I don't have any particular lesson to draw from this, except that it's surprisingly difficult to de-customize a machine. One might think the answer is to set up the machine from scratch outside our standard install framework, but the reality is that there's a lot from the standard framework that we still want on such machines. Even with issues like this, it's probably easier to install them normally and then fix the issues than do a completely stock Ubuntu server install.)

(4 comments.)

Chris's Wiki :: blog
Some thoughts on why 'inetd activation' didn't catch on 13 October 2024 at 02:06

Some thoughts on why 'inetd activation' didn't catch on

Chris's Wiki :: blog

By: cks

13 October 2024 at 02:06

Inetd is a traditional Unix 'super-server' that listens on multiple (IP) ports and runs programs in response to activity on them; it dates from the era of 4.3 BSD. In theory inetd can act as a service manager of sorts for daemons like the BSD r* commands, saving them from having to implement things like daemonization, and in fact it turns out that one version of this is how these daemons were run in 4.3 BSD. However, running daemons under inetd never really caught on (even in 4.3 BSD some important daemons ran outside of inetd), and these days it's basically dead. You could ask why, and I have some thoughts on that.

The initial version of inetd only officially supported running TCP services in a mode where each connection ran a new instance of the program (call this the CGI model). On the machines of the 1980s and 1990s, this wasn't a particularly attractive way to run anything but relatively small and simple programs (and ones that didn't have to do much work on startup). In theory you could possibly run TCP services in a mode where they were passed the server socket and then accepted new connections themselves for a while; in practice, no one seems to have really written daemons that supported this. Daemons that supported an 'inetd mode' generally meant the 'run a copy of the program for each connection' mode.

(Possibly some of them supported both modes of inetd operation, but system administrators would pretty much assume that if a daemon's documentation said just 'inetd mode' that it meant the CGI model.)

Another issue is that inetd is not a service manager. It will start things for you, but that's it; it won't shut down things for you (although you can get it to stop listening on a port), and it won't tell you what's running (you get to inspect the process list). On Unixes with a System V init system or something like it, running your daemons as standalone things gave you access to start, stop, restart, status, and so on service management options that might even work (depending on the quality of the init.d scripts involved). Since daemons had better usability when run as standalone services, system administrators and others had relatively little reason to push for inetd support, especially in the second mode.

In general, running any important daemon under inetd has many of the same downside as systemd socket activation of services. As a practical matter, system administrators like to know that important daemons are up and running right away, and they don't have some hidden issue that will cause them to fail to start just when you want them. The normal CGI-like inetd mode also means that any changes to configuration files and the like take effect right away, which may not be what you want; system administrators tend to like controlling when daemons restart with new configurations.

All of this is likely tied to what we could call 'cultural factors'. I suspect that authors of daemons perceived running standalone as the more serious and prestigious option, the one for serious daemons like named and sendmail, and inetd activation to be at most a secondary feature. If you wrote a daemon that only worked with inetd activation, you'd practically be proclaiming that you saw your program as a low importance thing. This obviously reinforces itself, to the degree that I'm surprised sshd even has an option to run under inetd.

(While some Linuxes are now using systemd socket activation for sshd, they aren't doing it via its '-i' option.)

PS: There are some services that do still generally run under inetd (or xinetd, often the modern replacement, cf). For example, I'm not sure if the Amanda backup system even has an option to run its daemons as standalone things.

(4 comments.)

Chris's Wiki :: blog
Brief notes on making Prometheus's SNMP exporter use additional SNMP MIB(s) 30 September 2024 at 03:13

Brief notes on making Prometheus's SNMP exporter use additional SNMP MIB(s)

Chris's Wiki :: blog

By: cks

30 September 2024 at 03:13

Suppose, not entirely hypothetically, that you have a DSL modem that exposes information about the state of your DSL link through SNMP, and you would like to get that information into Prometheus so that you could track it over time (for reasons). You could scrape this information by 'hand' using scripts, but Prometheus has an officially supported SNMP exporter. Unfortunately, in practice the Prometheus SNMP exporter pretty much has a sign on the front door that says "no user serviceable parts, developer access only"; how you do things with it if its stock configuration doesn't meet your needs is what I would call rather underdocumented.

The first thing you'll need to do is find out what generally known and unknown SNMP attributes ('OIDs') your device exposes. You can do this using tools like snmpwalk, and see also some general information on reading things over SNMP. Once you've found out what OIDs your device supports, you need to find out if there are public MIBs for them. In my case, my DSL modem exposed information about network interfaces in the standard and widely available 'IF-MIB', and ADSL information in the standard but not widely available 'ADSL-LINE-MIB'. For the rest of this entry I''ll assume that you've managed to fetch the ADSL-LINE-MIB and everything it depends on and put them in a directory, /tmp/adsl-mibs.

The SNMP exporter effectively has two configuration files (as I wrote about recently); a compiled ('generated') configuration file (or set of them) that lists in exhausting detail all of the SNMP OIDs to be collected, and an input file to a separate tool, the generator, that creates the compiled main file. To collect information from a new MIB, you need to set up a new SNMP exporter 'module' for it, and specify the root OID or OIDs involved to walk. This looks like:

---

modules:
  # The ADSL-LINE-MIB MIB
  adsl_line_mib:
    walk:
      - 1.3.6.1.2.1.10.94
      # or:
      #- adslMIB

Here adsl_line_mib is the name of the new SNMP exporter module, and we give it the starting OID of the MIB. You can't specify the name of the MIB itself as the OID to walk, although this is how 'snmpwalk' will present it. Instead you have to use the MIB's 'MODULE-IDENTITY' line, such as 'adslMIB'. Alternately, perusal of your MIB and snmpwalk results may suggest alternate names to use, such as 'adslLineMib'. Using the top level OID is probably easier.

The name of your new module is arbitrary, but it's conventional to use the name of the MIB in this form. You can do other things in your module; reading the existing generator.yml is probably the most useful documentation. As various existing modules show, you can walk multiple OIDs in one module.

This configuration file leaves out the 'auths:' section from the main generator.yml, because we only need one of them, and what we're doing is generating an additional configuration file for snmp_exporter that we'll use along with the stock snmp.yml. To actually generate our new snmp-adsl.yml, we do:

cd snmp_exporter/generator
go build
make # builds ./mibs
./generator generate \
   -m ./mibs \
   -m /tmp/adsl-mibs \
   -g generator-adsl.yml
   -o /tmp/snmp-adsl.yml

We give the generator both its base set of MIBs, which will define various common things, and the directory with our ADSL-LINE-MIB and all of the MIBs it may depend on. Although the input is small, the snmp-adsl.yml will generally be quite big; in my case, over 2,000 lines.

As I mentioned the other day, you may find that some of the SNMP OIDs actually returned by your device don't conform to the SNMP MIB. When this happens, your scrape results will not be a success but instead a HTTP 500 error with text that says things like:

An error has occurred while serving metrics:

error collecting metric Desc{fqName: "snmp_error", help: "BITS type was not a BISTRING on the wire.", constLabels: {}, variableLabels: {}}: error for metric adslAturCurrStatus with labels [1]: <nil>

This says that the the actual OID(s) for adslAturCurrStatus from my actual device didn't match what the MIB claimed. In this case, my raw snmpwalk output for this OID is:

.1.3.6.1.2.1.10.94.1.1.3.1.6.1 = BITS: 00 00 00 01 31

(I don't understand what this means, since I'm not anywhere near an SNMP expert.)

If the information is sufficiently important, you'll need to figure out how to modify either the MIB or the generated snmp-adsl.yml to get the information without snmp_exporter errors. Doing so is far beyond the scope of this entry. If the information is not that important, the simple way is to exclude it with a generator override:

---

modules:
  adsl_line_mib:
    walk:
      # ADSL-LINE-MIB
      #- 1.3.6.1.2.1.10.94
      - adslMIB
    overrides:
     # My SmartRG SR505N produces values for this metric
     # that make the SNMP exporter unhappy.
     adslAturCurrStatus:
       ignore: true

You can at least get the attribute name you need to ignore from the SNMP exporter's error message. Unfortunately this error message is normally visible only in scrape output, and you'll only see it if you scrape manually with something like 'curl'.

Chris's Wiki :: blog
Brief notes on how the Prometheus SNMP exporter's configurations work 28 September 2024 at 03:19

Brief notes on how the Prometheus SNMP exporter's configurations work

Chris's Wiki :: blog

By: cks

28 September 2024 at 03:19

A variety of devices (including DSL modems) expose interesting information via SNMP (which is not simple, despite its name). If you have a Prometheus environment, it would be nice to get (some of) this information from your SNMP capable devices into Prometheus. You could do this by hand with scripts and commands like 'snmpget', but there is also the officially supported SNMP exporter. Unfortunately, in practice the Prometheus SNMP exporter pretty much has a sign on the front door that says "no user serviceable parts, developer access only". Understanding how to do things even a bit out of standard with it is, well, a bit tricky. So here are some notes.

The SNMP exporter ships with a 'snmp.yml' configuration file that's what the actual 'snmp_exporter' program uses at runtime (possibly augmented by additional files you provide). As you'll read when you look at the file, this file is machine generated. As far as I can tell, the primary purpose of this file is to tell the exporter what SNMP OIDs it could try to read from devices, what metrics generated from them should be called, and how to interpret the various sorts of values it gets back over SNMP (for instance, network interfaces have a 'ifType' that in raw format is a number, but where the various values correspond to different types of physical network types). These SNMP OIDs are grouped into 'modules', with each module roughly corresponding to a SNMP MIB (the correspondence isn't necessarily exact). When you ask the SNMP exporter to query a SNMP device, you normally tell the exporter what modules to use, which determines what OIDs will be retrieved and what metrics you'll get back.

The generated file is very verbose, which is why it's generated, and its format is pretty underdocumented, which certainly does help contribute to the "no user serviceable parts" feeling. There is very little support for directly writing a new snmp.yml module (which you can at least put in a separate 'snmp-me.yml' file) if you happen to have a few SNMP OIDs that you know directly, don't have a MIB for, and want to scrape and format specifically. Possibly the answer is to try to write a MIB yourself and generate a snmp-me.yml from it, but I haven't had to do this so I have no opinions on which way is better.

The generated file and its modules are created from various known MIBs by a separate program, the generator. The generator has its own configuration file to describe what modules to generate, what OIDs go into each module, and so on. This means that reading generator.yml is the best way to find out what MIBs the SNMP exporter already supports. As far as I know, although generator.yml doesn't necessarily specify OIDs by name, the generator requires MIBs for everything you want to be in the generated snmp.yml file and generate metrics for.

The generator program and its associated data isn't available as part of the pre-built binary SNMP exporter packages. If you need anything beyond the limited selection of MIBs that are compiled into the stock snmp.yml, you need to clone the repository, go to the 'generator' subdirectory, build the generator with 'go build' (currently), run 'make' to fetch and process the MIBs it expects, get (or write) MIBs for your additional metrics, and then write yourself a minimal generator-me.yml of your own to add one or more (new) modules for your new MIBs. You probably don't want to regenerate the main snmp.yml; you might as well build a 'snmp-me.yml' that just has your new modules in it, and run the SNMP exporter with snmp-me.yml as an additional configuration file.

As a practical matter, you may find that your SNMP capable device doesn't necessarily conform to the MIB that theoretically describes it, including OIDs with different data formats (or data) than expected. In the simple case, you can exclude OIDs or named attributes from being fetched so that the non-conformance doesn't cause the SNMP exporter to throw errors:

modules:
  adsl_line_mib:
[...]
    overrides:
     adslAturCurrStatus:
       ignore: true

More complex mis-matches between the MIB and your device will have you reading whatever you can find for the available options for generator.yml or even for snmp.yml itself. Or you can change your mind and scrape through scripts or programs in other languages instead of the SNMP exporter (it's what we do for some of our machine room temperature sensors).

(I guess another option is editing the MIB so that it corresponds to what your device returns, which should make the generator produce a snmp-me.yml that matches what the SNMP exporter sees from the device.)

PS: A peculiarity of the SNMP exporter is that the SNMP metrics it generates are all named after their SNMP MIB names, which produce metric names that are not at all like conventional Prometheus metric names. It's possible to put a common prefix, such as 'snmp_metric_', on all SNMP metrics to make them at least a little bit better. Technically this is a peculiarity of snmp.yml, but changing it is functionally impossible unless you hand-edit your own version.

Chris's Wiki :: blog
The impact of the September 2024 CUPS CVEs depends on your size 27 September 2024 at 03:16

The impact of the September 2024 CUPS CVEs depends on your size

Chris's Wiki :: blog

By: cks

27 September 2024 at 03:16

The recent information security news is that there are a series of potentially serious issues in CUPS (via), but on the other hand a lot of people think that this isn't an exploit with a serious impact because, based on current disclosures, someone has to print something to a maliciously added new 'printer' (for example). My opinion is that how potentially serious this issue is for you depends on the size and scope of your environment.

Based on what we know, the vulnerability requires the CUPS server to also be running 'cups-browsed'. One of the things that cups-browsed does is allow remote printers to register themselves on the CUPS server; you set up your new printer, point it at your local CUPS print server, and everyone can now use it. As part of this registration, the collection of CUPS issues allows a malicious 'printer' to set up server side data (a CUPS PPD) that contains things that will run commands on the print server when a print job is sent to this malicious 'printer'. In order to get anything to happen, an attacker needs to get someone to do this.

In a personal environment or a small organization, this is probably unlikely. Either you know all the printers that are supposed to be there and a new one showing up is alarming, or at the very least you'll probably assume that the new printer is someone's weird experiment or local printer or whatever, and printing to it won't make either you or the owner very happy. You'll take your print jobs off to the printers you know about, and ignore the new one.

(Of course, an attacker with local knowledge could target their new printer name to try to sidestep this; for example, calling it 'Replacement <some existing printer>' or the like.)

In a larger organization, such as ours, people don't normally know all of the printers that are around and don't generally know when new printers show up. In such an environment, it's perfectly reasonable for people to call up a 'what printer do you want to use' dialog, see a new to them printer with an attractive name, and use it (perhaps thinking 'I didn't know they'd put a printer in that room, that's conveniently close'). And since printer names that include locations are perpetually misleading or wrong, most of the time people won't be particularly alarmed if they go to the location where they expect the printer (and their print job) to be and find nothing. They'll shrug, go back, and re-print their job to a regular printer they know.

(There are rare occasions here where people get very concerned when print output can't be found, but in most cases the output isn't sensitive and people don't care if there's an extra printed copy of a technical paper or the like floating around.)

Larger scale environments, possibly with an actual CUPS print server, are also the kind of environment where you might deliberately run cups-browsed. This could be to enable easy addition of new printers to your print server or to allow people's desktops to pick up what printers were available out there without you needing to even have a central print server.

My view is that this set of CVEs shows that you probably can't trust cups-browsed in general and need to stop running it, unless you're very confident that your environment is entirely secure and will never have a malicious attacker able to send packets to cups-browsed.

(I said versions of this on the Fediverse (1, 2), so I might as well elaborate on it here.)

(One comment.)

Chris's Wiki :: blog
Our broad reasons for and approach to mirroring disks 21 September 2024 at 02:51

Our broad reasons for and approach to mirroring disks

Chris's Wiki :: blog

By: cks

21 September 2024 at 02:51

When I talked about our recent interest in FreeBSD, I mentioned the issue of disk mirroring. One of the questions this raises is what we use disk mirroring for, and how we approach it in general. The simple answer is that we mirror disks for extra redundancy, not for performance, but we don't go too far to get extra redundancy.

The extremely thorough way to do disk mirroring for redundancy is to mirror with different makes and ages of disks on each side of the mirror, to try to avoid both age related failures and model or maker related issues (either firmware or where you find out that the company used some common problematic component). We don't go this far; we generally buy a block of whatever SSD is considered good at the moment, then use them for a while, in pairs, either fresh in newly deployed servers or re-using a pair in a server being re-deployed. One reason we tend to do this is that we generally get 'consumer' drives, and finding decent consumer drives is hard enough at the best of times without having to find two different vendors of them.

(We do have some HDD mirrors, for example on our Prometheus server, but these are also almost always paired disks of the same model, bought at the same time.)

Because we have backups, our redundancy goals are primarily to keep servers operating despite having one disk fail. This means that it's important that the system keep running after a disk failure, that it can still reboot after a disk failure (including of its first, primary disk), and that the disk can be replaced and put into service without downtime (provided that the hardware supports hot swapping the drive). The less this is true, the less useful any system's disk mirroring is to us (including 'hardware' mirroring, which might make you take a trip through the BIOS to trigger a rebuild after a disk replacement, which means downtime). It's also vital that the system be able to tell us when a disk has failed. Not being able to reliably tell us this is how you wind up with systems running on a single drive until that single drive then fails too.

On our ZFS fileservers it would be quite undesirable to have to restore from backups, so we have an elaborate spares system that uses extra disk space on the fileservers (cf) and a monitoring system to rapidly replace failed disks. On our regular servers we don't (currently) bother with this, even on servers where we could add a third disk as a spare to the two system disks.

(We temporarily moved to three way mirrors for system disks on some critical servers back in 2020, for relatively obvious reasons. Since we're now in the office regularly, we've moved back to two way mirrors.)

Our experience so far with both HDDs and SSDs is that we don't really seem to have clear age related or model related failures that take out multiple disks at once. In particular, we've yet to lose both disks of a mirror before one could be replaced, despite our habit of using SSDs and HDDs in basically identical pairs. We have had a modest number of disk failures over the years, but they've happened by themselves.

(It's possible that at some point we'll run a given set of SSDs for long enough that they start hitting lifetime limits. But we tend to grab new SSDs when re-deploying important servers. We also have a certain amount of server generation turnover for important servers, and when we use the latest hardware it also gets brand new SSDs.)

(4 comments.)

Chris's Wiki :: blog
Why we're interested in FreeBSD lately (and how it relates to OpenBSD here) 16 September 2024 at 03:09

Why we're interested in FreeBSD lately (and how it relates to OpenBSD here)

Chris's Wiki :: blog

By: cks

16 September 2024 at 03:09

We have a long and generally happy history of using OpenBSD and PF for firewalls. To condense a long story, we're very happy with the PF part of our firewalls, but we're increasingly not as happy with the OpenBSD part (outside of PF). Part of our lack of cheer is the state of OpenBSD's 10G Ethernet support when combined with PF, but there are other aspects as well; we never got OpenBSD disk mirroring to be really useful and eventually gave up on it.

We wound up looking at FreeBSD after another incident with OpenBSD doing weird and unhelpful hardware things, because we're a little tired of the whole area. Our perception (which may not be reality) is that FreeBSD likely has better driver support for modern hardware, including 10G cards, and has gone further on SMP support for networking, hopefully including PF. The last time we looked at this, OpenBSD PF was more or less limited by single-'core' CPU performance, especially when used in bridging mode (which is what our most important firewall uses). We've seen fairly large bandwidth rates through our OpenBSD PF firewalls (in the 800 MBytes/sec range), but never full 10G wire bandwidth, so we've wound up suspecting that our network speed is partly being limited by OpenBSD's performance.

(To get to this good performance we had to buy servers that focused on single-core CPU performance. This created hassles in our environment, since these special single-core performance servers had to be specially reserved for OpenBSD firewalls. And single-core performance isn't going up all that fast.)

FreeBSD has a version of PF that's close enough to OpenBSD's older versions to accept much or all of the syntax of our pf.conf files (we're not exactly up to the minute on our use of PF features and syntax). We also perceive FreeBSD as likely more normal to operate than OpenBSD has been, making it easier to integrate into our environment (although we'd have to actually operate it for a while to see if that was actually the case). If FreeBSD has great 10G performance on our current generation commodity servers, without needing to buy special servers for it, and fixes other issues we have with OpenBSD, that makes it potentially fairly attractive.

(To be clear, I think that OpenBSD is (still) a great operating system if you're interested in what it has to offer for security and so on. But OpenBSD is necessarily opinionated, since it has a specific focus, and we're not really using OpenBSD for that focus. Our firewalls don't run additional services and don't let people log in, and some of them can only be accessed over a special, unrouted 'firewall' subnet.)

(2 comments.)

Chris's Wiki :: blog
Getting maximum 10G Ethernet bandwidth still seems tricky 15 September 2024 at 02:51

Getting maximum 10G Ethernet bandwidth still seems tricky

Chris's Wiki :: blog

By: cks

15 September 2024 at 02:51

For reasons outside the scope of this entry, I've recently been trying to see how FreeBSD performs on 10G Ethernet when acting as a router or a bridge (both with and without PF turned on). This pretty much requires at least two more 10G test machines, so that the FreeBSD server can be put between them. When I set up these test machines, I didn't think much about them so I just grabbed two old servers that were handy (well, reasonably handy), stuck a 10G card into each, and set them up. Then I actually started testing their network performance.

I'm used to 1G Ethernet, where long ago it became trivial to achieve full wire bandwidth, even bidirectional full bandwidth (with test programs; there are many things that can cause real programs to not get this). 10G Ethernet does not seem to be like this today; the best I could do was get close to around 950 MBytes a second in one direction (which is not 10G's top speed). With the right circumstances, bidirectional traffic could total to just over 1 GByte a second, which is of course nothing like what we'd like to see.

(This isn't a new problem with 10G Ethernet, but I was hoping this had been solved in the past decade or so.)

There's a lot of things that could be contributing to this, like the speed of the CPU (and perhaps RAM), the specific 10G hardware I was using (including if it lacked performance increasing features that more expensive hardware would have had), and Linux kernel or driver issues (although this was Ubuntu 24.04, so I would hope that they were sorted out). I'm especially wondering about CPU limitations, because the kernel's CPU usage did seem to be quite high during my tests and, as mentioned, they're old servers with old CPUs (different old CPUs, even, one of which seemed to perform a bit better than the other).

(For the curious, one was a Celeron G530 in a Dell R210 II and the other a Pentium G6950 in a Dell R310, both of which date from before 2016 and are something like four generations back from our latest servers (we've moved on slightly since 2022).)

Mostly this is something I'm going to have to remember about 10G Ethernet in the future. If I'm doing anything involving testing its performance, I'll want to use relatively modern test machines, possibly several of them to create aggregate traffic, and then I'll want to start out by measuring the raw performance those machines can give me under the best circumstances. Someday perhaps 10G Ethernet will be like 1G Ethernet for this, but that's clearly not the case today (in our environment).

(2 comments.)

Chris's Wiki :: blog
What admin access researchers have to their machines here 13 September 2024 at 03:31

What admin access researchers have to their machines here

Chris's Wiki :: blog

By: cks

13 September 2024 at 03:31

Recently on the Fediverse, Stephen Checkoway asked what level of access fellow academics had to 'their' computers to do things like install software (via). This is an issue very relevant to where I work, so I put a short-ish answer in the Fediverse thread and now I'm going to elaborate it at more length. Locally (within the research side of the department) we have a hierarchy of machines for this sort of thing.

At the most restricted are the shared core machines my group operates in our now-unusual environment, such as the mail server, the IMAP server, the main Unix login server, our SLURM cluster and general compute servers, our general purpose web server, and of course the NFS fileservers that sit behind all of this. For obvious reasons, only core staff have any sort of administrative access to these machines. However, since we operate a general Unix environment, people can install whatever they want to in their own space, and they can request that we install standard Ubuntu packages, which we mostly do (there are some sorts of packages that we'll decline to install). We do have some relatively standard Ubuntu features turned off for security reasons, such as "user namespaces", which somewhat limits what people can do without system privileges. Only our core machines live on our networks with public IPs; all other machines have to go on separate private "sandbox" networks.

The second most restricted are researcher owned machines that want to NFS mount filesystems from our NFS fileservers. By policy, these must be run by the researcher's Point of Contact, operated securely, and only the Point of Contact can have root on those machines. Beyond that, researchers can and do ask their Point of Contact to install all sorts of things on their machines (the Point of Contact effectively works for the researcher or the research group). As mentioned, these machines live on "sandbox" networks. Most often they're servers that the researcher has bought with grant funding, and there are some groups that operate more and better servers than we (the core group) do.

Next are non-NFS machines that people put on research group "sandbox" networks (including networks where some machines have NFS access); people do this with both servers and desktops (and sometimes laptops as well). The policies on who has what power over these machines is up to the research group and what they (and their Point of Contact) feel comfortable with. There are some groups where I believe the Point of Contact runs everything on their sandbox network, and other groups where their sandbox network is wide open with all sorts of people running their own machines, both servers and desktops. Usually if a researcher buys servers, the obvious person to have run them is their Point of Contact, unless the research work being done on the servers is such that other people need root access (or it's easier for the Point of Contact to hand the entire server over to a graduate student and have them run it as they need it).

Finally there are generic laptops and desktops, which normally go on our port-isolated 'laptop' network (called the 'red' network after the colour of network cables we use for it, so that it's clearly distinct from other networks). We (the central group) have no involvement in these machines and I believe they're almost always administered by the person who owns or at least uses them, possibly with help from that person's Point of Contact. These days, some number of laptops (and probably even desktops) don't bother with wired networking and use our wireless network instead, where similar 'it's yours' policies apply.

People who want access to their files from their self-managed desktop or laptop aren't left out in the cold, since we have a SMB (CIFS) server. People who use Unix and want their (NFS, central) home directory mounted can use the 'cifs' (aka 'smb3') filesystem to access it through our SMB server, or even use sshfs if they want to. Mounting via cifs or sshfs is in some cases superior to using NFS, because they can give you access to important shared filesystems that we can't NFS export to machines outside our direct control.

(One comment.)

Chris's Wiki :: blog
Rate-limiting failed SMTP authentication attempts in Exim 4.95 12 September 2024 at 03:01

Rate-limiting failed SMTP authentication attempts in Exim 4.95

Chris's Wiki :: blog

By: cks

12 September 2024 at 03:01

Much like with SSH servers, if you have a SMTP server exposed to the Internet that supports SMTP authentication, you'll get a whole lot of attackers showing up to do brute force password guessing. It would be nice to slow these attackers down by rate-limiting their attempts. If you're using Exim, as we are, then this is possible to some degree. If you're using Exim 4.95 on Ubuntu 22.04 (instead of a more recent Exim), it's trickier than it looks.

One of Exim's ACLs, the ACL specified by acl_smtp_auth, is consulted just before Exim accepts a SMTP 'AUTH <something>' command. If this ACL winds up returning a 'reject' or a 'defer' result, Exim will defer or reject the AUTH command and the SMTP client will not be able to try authenticating. So obviously you need to put your ratelimit statement in this ACL, but there are two complications. First, this ACL doesn't have access to the login name the client is trying to authenticate (this information is only sent after Exim accepts the 'AUTH <whatever>' command), so all you can ratelimit is the source IP (or a network area derived from it). Second, this ACL happens before you know what the authentication result is, so you don't want to actually update your ratelimit in it, just check what the ratelimit is.

This leads to the basic SMTP AUTH ACL of:

acl_smtp_auth = acl_check_auth

begin acl

acl_check_auth:
  # We'll cover what this is for later
  warn
    set acl_c_auth = true

  deny
    ratelimit = 10 / 10m / per_cmd / readonly / $sender_host_address
    delay = 10s
    message = You are failing too many authentication attempts.
    # you might also want:
    # log_message = ....

  # don't forget this or you will be sad
  # (because no one will be able to authenticate)
  accept

(The 'delay = 10s' usefully slows down our brute force SMTP authentication attackers because they seem to wait for the reply to their SMTP AUTH command rather than giving up and terminating the session after a couple of seconds.)

This ratelimit is read-only because we don't want to update it unless the SMTP authentication fails; otherwise, you will wind up (harshly) rate-limiting legitimate people who repeatedly connect to you, authenticate, perhaps send an email message, and then disconnect. Since we can't update the ratelimit in the SMTP AUTH ACL, we need to somehow recognize when authentication has failed and update the ratelimit in that place.

In Exim 4.97 and later, there's a convenient and direct way to do this through the events system and the 'auth:fail' event that is raised by an Exim server when SMTP authentication fails. As I understand it, the basic trick is that you make the auth:fail event invoke a special ACL, and have the user ACL update the ratelimit. Unfortunately Ubuntu 22.04 has Exim 4.95, so we must be more clever and indirect, and as a result somewhat imperfect in what we're doing.

To increase the ratelimit when SMTP authentication has failed, we add an ACL that is run at the end of the connection and increases the ratelimit if an authentication was attempted but did not succeed, which we detect by the lack of authentication information. Exim has two possible 'end of session' ACL settings, one that is used if the session is ended with a SMTP QUIT command and one that is ended if the SMTP session is just ended without a QUIT.

So our ACL setup to update our ratelimit looks like this:

[...]
acl_smtp_quit = acl_count_failed_auth
acl_smtp_notquit = acl_count_failed_auth

begin acl
[...]

acl_count_failed_auth:
  warn:
    condition = ${if bool{$acl_c_auth} }
    !authenticated = *
    ratelimit = 10 / 10m / per_cmd / strict / $sender_host_address

  accept

Our $acl_c_auth SMTP connection ACL variable tells us whether or not the connection attempted to authenticate (sometimes legitimate people simply connect and don't do anything before disconnecting), and then we also require that the connection not be authenticated now to screen out people who succeeded in their SMTP authentication. The settings for the two 'ratelimit =' settings have to match or I believe you'll get weird results.

(The '10 failures in 10 minutes' setting works for us but may not work for you. If you change the 'deny' to 'warn' in acl_check_auth and comment out the 'message =' bit, you can watch your logs to see what rates real people and your attackers actually use.)

The limitation on this is that we're actually increasing the ratelimit based not on the number of (failed) SMTP authentication attempts but on the number of connections that tried but failed SMTP authentication. If an attacker connects and repeatedly tries to do SMTP AUTH in the session, failing each time, we wind up only counting it as a single 'event' for ratelimiting because we only increase the ratelimit (by one) when the session ends. For the brute force SMTP authentication attackers we see, this doesn't seem to be an issue; as far as I can tell, they disconnect their session when they get a SMTP authentication failure.

(2 comments.)

Chris's Wiki :: blog
I should probably reboot BMCs any time they behave oddly 9 September 2024 at 03:13

I should probably reboot BMCs any time they behave oddly

Chris's Wiki :: blog

By: cks

9 September 2024 at 03:13

Today on the Fediverse I said:

It has been '0' days since I had to reset a BMC/IPMI for reasons (in this case, apparently something power related happened that glitched the BMC sufficiently badly that it wasn't willing to turn on the system power). Next time a BMC is behaving oddly I should just immediately tell it to cold reset/reboot and see, rather than fiddling around.

(Assuming the system is already down. If not, there are potential dangers in a BMC reset.)

I've needed to reset a BMC before, but this time was more odd and less clear than the KVM over IP that wouldn't accept the '2' character.

We apparently had some sort of power event this morning, with a number of machines abruptly going down (distributed across several different PDUs). Most of the machines rebooted fine, either immediately or after some delay. A couple of the machines did not, and conveniently we had set up their BMCs on the network (although they didn't have KVM over IP). So I remotely logged in to their BMC's web interface, saw that the BMC was reporting that the power was off, and told the BMC to power on.

Nothing happened. Oh, the BMC's web interface accepted my command, but the power status stayed off and the machines didn't come back. Since I had a bike ride to go to, I stopped there. After I came back from the bike ride I tried some more things (still remotely). One machine I could remotely power cycle through its managed PDU, which brought it back. But the other machine was on an unmanaged PDU with no remote control capability. I wound up trying IPMI over the network (with ipmitool), which had no better luck getting the machine to power on, and then I finally decided to try resetting the BMC. That worked, in that all of a sudden the machine powered on the way it was supposed to (we set the 'what to do after power comes back' on our machines to 'last power state', which would have been 'powered on').

As they say, I have questions. What I don't have is any answers. I believe that the BMC's power control talks to the server's motherboard, instead of to the power supply units, and I suspect that it works in a way similar to desktop ATX chassis power switches. So maybe the BMC software had a bug, or some part of the communication between the BMC and the main motherboard circuitry got stuck or desynchronized, or both. Resetting the BMC would reset its software, and it could also force a hardware reset to bring the communication back to a good state. Or something else could be going on.

(Unfortunately BMCs are black boxes that are supposed to just work, so there's no way for ordinary system administrators like me to peer inside.)

(4 comments.)

Chris's Wiki :: blog
Using rsync to create a limited ability to write remote files 5 September 2024 at 02:56

Using rsync to create a limited ability to write remote files

Chris's Wiki :: blog

By: cks

5 September 2024 at 02:56

Suppose that you have an isolated high security machine and you want to back up some of its data on another machine, which is also sensitive in its own way and which doesn't really want to have to trust the high security machine very much. Given the source machine's high security, you need to push the data to the backup host instead of pulling it. Because of the limited trust relationship, you don't want to give the source host very much power on the backup host, just in case. And you'd like to do this with standard tools that you understand.

I will cut to the chase: as far as I can tell, the easiest way to do this is to use rsync's daemon mode on the backup host combined with SSH (to authenticate either end and encrypt the traffic in transit). It appears that another option is rrsync, but I just discovered that and we have prior experience with rsync's daemon mode for read-only replication.

Rsync's daemon mode is controlled by a configuration file that can restrict what it allows the client (your isolated high security source host) to do, particularly where the client can write, and can even chroot if you run things as root. So the first ingredient we need is a suitable rsyncd.conf, which will have at least one 'module' that defines parameters:

[backup-host1]
comment = Backup module for host1
# This will normally have restricted
# directory permissions, such as 0700.
path = /backups/host1
hosts allow = <host1 IP>
# Let's assume we're started out as root
use chroot = yes
uid = <something>
gid = <something>

The rsyncd.conf 'hosts allow' module parameter works even over SSH; rsync will correctly pull out the client IP from the environment variables the SSH daemon sets.

The next ingredient is a shell script that forces the use of this rsyncd.conf:

#!/bin/sh
exec /usr/bin/rsync --server --daemon --config=/backups/host1-rsyncd.conf .

As with the read-only replication, this script completely ignores command line arguments that the client may try to use. Very cautious people could inspect the client's command line to look for unexpected things, but we don't bother.

Finally you need a SSH keypair and a .ssh/authorized_keys entry on the backup machine for that keypair that forces using your script:

from="<host1 IP>",command="/backups/host1-script",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty [...]

(Since we're already restricting the rsync module by IP, we definitely want to restrict the key usage as well.)

On the high security host, you transfer files to the backup host with:

rsync -a --rsh="/usr/bin/ssh -i /client/identity" yourfile LOGIN@SERVER::backup-host1/

Depending on what you're backing up and how you want to do things, you might want to set the rsyncd.conf module parameters 'write only = true' and perhaps 'refuse options = delete', if you're sure you don't want the high security machine to be able to retrieve its files once it has put them there. On the other hand, if the high security machine is supposed to be able to routinely retrieve its backups (perhaps to check that they're good), you don't want this.

(If the high security machine is only supposed to read back files very rarely, you can set 'write only = true' until it needs to retrieve a file.)

There are various alternative approaches, but this one is relatively easy to set up, especially if you already have a related rsync daemon setup for read-only replication.

(On the one hand it feels annoying that there isn't a better way to do this sort of thing by now. On the other hand, the problems involved are not trivial. You need encryption, authentication of both ends, a confined transfer protocol, and so on. Here, SSH provides the encryption and authentication and rsync provides the confined transfer protocol, at the cost of having to give access to a Unix account and trust rsync's daemon mode code.)

(4 comments.)

Chris's Wiki :: blog
Some reasons why we mostly collect IPMI sensor data locally 28 August 2024 at 02:40

Some reasons why we mostly collect IPMI sensor data locally

Chris's Wiki :: blog

By: cks

28 August 2024 at 02:40

Most servers these days support IPMI and can report various sensor readings through it, which you often want to use. In general, you can collect IPMI sensor readings either on the host itself through the host OS or over the network using standard IPMI networking protocols (there are several generations of them). Locally, we have almost always collected this information locally (and then fed it into our Prometheus based monitoring system), for an assortment of reasons, some of them general and some of them specific to us.

When we collect IPMI sensor data locally, we export it through the the standard Prometheus host agent, which has a feature where you can give it text files of additional metrics (cf). Although there is a 'standard' third party network IPMI metrics exporter, we ended up rolling our own for various reasons (through a Prometheus exporter that can run scripts for us). So we could collect IPMI sensor data either way, but we almost entirely collect the data locally.

(These days it is a standard part of our general Ubuntu customizations to set up sensor data collection from the IPMI if the machine has one.)

The generic reasons for not collecting IPMI sensor data over the network is that your server BMCs might not be on the network at all (perhaps they don't have a dedicated BMC network interface), or you've sensibly put them on a secured network and your monitoring system doesn't have access to it. We have two additional reasons for preferring local IPMI sensor data collection.

First, even when our servers have dedicated management network ports, we don't always bother to wire them up; it's often just extra work for relatively little return (and it exposes the BMC to the network, which is not always a good thing). Second, when we collect IPMI sensor data through the host, we automatically start and stop collecting sensor data for the host when we start or stop monitoring the host in general (and we know for sure that the IPMI sensor data really matches that host). We almost never care about IPMI data when either the host isn't otherwise being monitored or the host is off.

Our system for collecting IPMI sensor data over the network actually dates from when this wasn't true, because we once had some (donated) blade servers that periodically mysteriously locked up under some conditions that seemed related to load (so much so that we built a system to automatically power cycle them via IPMI when they got hung). One of the things we were very interested in was if these blade servers were hitting temperature or fan limits when they hung. Since the machines had hung we couldn't collect IPMI information through their host agent; getting it from the IPMI over the network was our only option.

(This history has created a peculiarity, which is that our script for collecting network IPMI sensor data used what was at the time the existing IPMI user that was already set up to remotely power cycle the C6220 blades. So now anything we want to remotely collect IPMI sensor data from has a weird 'reboot' user, which these days doesn't necessarily have enough IPMI privileges to actually reset the machine.)

PS: We currently haven't built a local IPMI sensor data collection system for our OpenBSD machines, although OpenBSD can certainly talk to a local IPMI, so we collect data from a few of those machines over the network.

Chris's Wiki :: blog
JSON is usually the least bad option for machine-readable output formats 25 August 2024 at 02:28

JSON is usually the least bad option for machine-readable output formats

Chris's Wiki :: blog

By: cks

25 August 2024 at 02:28

Over on the Fediverse, I said something:

In re JSON causing problems, I would rather deal with JSON than yet another bespoke 'simpler' format. I have plenty of tools that can deal with JSON in generally straightforward ways and approximately none that work on your specific new simpler format. Awk may let me build a tool, depending on what your format is, and Python definitely will, but I don't want to.

This is re: <Royce Williams Fediverse post>

This is my view as a system administrator, because as a system administrator I deal with a lot of tools that could each have their own distinct output format, each of which I have to parse separately (for example, smartctl's bespoke output, although that output format sort of gets a pass because it was intended for people, not further processing).

JSON is not my ideal output format. But it has the same virtue as gofmt does; as Rob Pike has said, "gofmt's style is no one's favorite, yet gofmt is everyone's favorite" (source, also), because gofmt is universal and settles the arguments. Everything has to have some output format, so having a single one that is broadly used and supported is better than having N of them. And jq shows the benefit of this universality, because if something outputs JSON, jq can do useful things with it.

(In turn, the existence of jq makes JSON much more attractive to system administrators than it otherwise would be. If I had no ready way to process JSON output, I'd be much less happy about it and it would stop being the easy output format to deal with.)

I don't have any particular objection to programs that want to output in their own format (perhaps a simpler one). But I want them to give me an option for JSON too, and most of the time I'm going to go with JSON. I've already written enough ad-hoc text processing things in awk, and a few too many heavy duty text parsing things in Python. I don't really want to write another one just for you. If your program does use only a custom output format, I want there to be a really good reason why you did it, not just that you don't like the aesthetics of JSON. As Rob Pike says, no one likes gofmt's style, but we all like that everyone uses it.

(It's my view that JSON's increased verbosity over alternates isn't a compelling reason unless there's either a really large amount of data or you have to fit into very constrained space, bandwidth, or other things. In most environments, disk space and bandwidth are much cheaper than people's time and the liability of yet another custom tool that has to be maintained.)

PS: All of this is for output formats that are intended to be further processed. JSON is a terrible format for people to read directly, so terrible that my usual reaction to having to view raw JSON is to feed it through 'jq . | less'. But your tool should almost always also have an option for some machine readable format (trust me, someday system administrators will want to process the information your tool generates).

(7 comments.)

Chris's Wiki :: blog
Some brief notes on 'numfmt' from GNU Coreutils 21 August 2024 at 03:20

Some brief notes on 'numfmt' from GNU Coreutils

Chris's Wiki :: blog

By: cks

21 August 2024 at 03:20

Many years ago I learned about numfmt (also) from GNU Coreutils (see the comments on this entry and then this entry). An additional source of information is Pádraig Brady's numfmt - A number reformatting utility. Today I was faced with a situation where I wanted to compute and print multi-day, cumulative Amanda dump total sizes for filesystems in a readable way, and the range went from under a GByte to several TBytes, so I didn't want to just convert everything to TBytes (or GBytes) and be done with it. I was doing the summing up in awk and briefly considered doing this 'humanization' in awk (again, I've done it before) before I remembered numfmt and decided to give it a try.

The basic pattern for using numfmt here was:

cat <amanda logs> | awk '...' | sort -nr | numfmt --to iec

This printed out '<size> <what ...>', and then numfmt turned the first field into humanized IEC values. As I did here, it's better to sort before numfmt, using the full precision raw number, rather than after numfmt (with 'sort -h'), with its rounded (printed) values.

Although Amanda records dump sizes in KBytes, I had my awk print them out in bytes. It turns out that I could have kept them in KBytes and had numfmt do the conversion, with 'numfmt --from-unit 1024 --to iec'.

(As far as I can tell, the difference between --from-unit and --to-unit is that the former multiplies the number and the latter divides it, which is probably not going to be useful with IEC units. However, I can see it being useful if you wanted to mass-convert times in sub-second units to seconds, or convert seconds to a larger unit such as hours. Unfortunately numfmt currently has no unit options for time, so you can only do pure numeric shifts.)

If left to do its own formatting, numfmt has two issues (at least when doing conversions to IEC units). First, it will print some values with one decimal place and others with no decimal place. This will generally give you a result that can be hard to skim because not everything lines up, like this:

 3.3T [...]
 581G [...]
 532G [...]
 [...]
  11G [...]
 9.8G [...]
 [...]
 1.1G [...]
 540M [...]

I prefer all of the numbers to line up, which means explicitly specifying the number of decimal places that everything gets. I tend to use one decimal place for everything, but none ('.0') is a perfectly okay choice. This is done with the --format argument:

 ... | numfmt --format '%.1f' --to iec

The second issue is that in the process of reformatting your numbers, numfmt will by and large remove any nice initial formatting you may have tried to do in your awk. Depending on how much (re)formatting you want to do, you may want another 'awk' step after the numfmt to pretty-print everything, or you can perhaps get away with --format:

... | numfmt --format '%10.1f  ' --to iec

Here I'm specifying a field width for enough white space and also putting some spaces after the number.

Even with the need to fiddle around with formatting afterward, using numfmt was very much the easiest and fastest way to humanize numbers in this script. Now that I've gone through this initial experience with numfmt, I'll probably use it more in the future.

Chris's Wiki :: blog
Workarounds are often forever (unless you work to make them otherwise) 16 August 2024 at 02:37

Workarounds are often forever (unless you work to make them otherwise)

Chris's Wiki :: blog

By: cks

16 August 2024 at 02:37

Back in 2018, ZFS on Linux had a bug that could panic the system if you NFS-exported ZFS snapshots. We were setting up ZFS based NFS fileservers and we knew about this bug, so at the time we set things so that only filesystems themselves were NFS exported and available on our servers. Any ZFS snapshots on filesystems were only visible if you directly logged in to the fileservers, which was (and is) something that only core system staff could do. This is somewhat inconvenient; we have to get involved any time people want to get stuff back from snapshots.

It is now 2024. ZFS on Linux became OpenZFS (in 2020) and has long since fixed that issue and released versions with the fix. If I'm retracing Git logs correctly, the fix was in 0.8.0, so it was included (among many others) in Ubuntu 22.04's ZFS 2.1.5 (what our fileservers are currently running) and Ubuntu 24.04's ZFS 2.2.2 (what our new fileservers will run).

When we upgraded the fileservers from 18.04 to 22.04, did we go back to change our special system for generating NFS export entries to allow NFS clients to access ZFS snapshots? You already know the answer to that. We did not, because we had completely forgotten about it. Nor did we go back to do it as we were preparing the 24.04 setup of our ZFS fileservers. It was only today that it came up, as we were dealing with restoring a file from those ZFS snapshots. Since it's come up, we're probably going to test the change and then do it for our future 24.04 fileservers, since it will make things a bit more convenient for some people.

(The good news is that I left comments to myself in one program about why we weren't using the relevant NFS export option, so I could tell for sure that it was this long since fixed bug that had caused us to leave it out.)

It's a trite observation that there's nothing so permanent as a temporary solution, but just because it's trite doesn't mean that it's wrong. A temporary workaround that code comments say we thought we might revert later in the life of our 18.04 fileservers has lasted about six years, despite being unnecessary since no later than when our fileservers moved to Ubuntu 22.04 (admittedly, this wasn't all that long ago).

One moral I take from this is that if I want us to ever remove a 'temporary' workaround, I need to somehow explicitly schedule us reconsidering the workaround. If we don't explicitly schedule things, we probably won't remember (unless it's something sufficiently painful that it keeps poking us until we can get rid of it). The purpose of the schedule isn't necessarily to make us do the thing, it's to remind us that the thing exists and maybe it shouldn't.

(As a corollary, the schedule entry should include pointers to a lot of detail, because when it goes off in a year or two we won't really remember what it's talking about. That's why we have to schedule a reminder.)

Chris's Wiki :: blog
Traceroute, firewalls, and the modern Internet: a horrible realization 15 August 2024 at 03:11

Traceroute, firewalls, and the modern Internet: a horrible realization

Chris's Wiki :: blog

By: cks

15 August 2024 at 03:11

The venerable traceroute command sort of reports the hops your packets take to reach a host, and in the process can reveal where your packets are getting dropped or diverted. The traditional default way that traceroute works is by sending UDP packets to a series of high UDP ports with increasing IP TTLs, and seeing where each reply comes from. If the TTL runs out on the way, traceroute gets one reply; if the packet reaches the host, traceroute gets another one (assuming that nothing is listening on the particular UDP port on the host, which usually it isn't). Most versions of traceroute can also use ICMP based probes, while some of them can also use TCP based ones.

While writing my entry on using traceroute with a fixed target port, I had a horrible realization: traceroute's UDP probes mostly won't make it through firewalls. Traceroute's UDP probes are made to a series of high UDP ports (often starting at port 33434 and counting up). Most firewalls are set to block unsolicited incoming UDP traffic by default; you normally specifically configure them to pass only some UDP traffic through to limited ports (such as port 53 for DNS queries to your DNS servers). When traceroute's UDP packets, sent to effectively random high ports, arrive at such a firewall, the firewall will discard or reject them and your traceroute will go no further.

(If you're extremely confident no one will ever run something that listens on the UDP port range, you can make your firewall friendly to traceroute by allowing through UDP ports 33434 to 33498 or so. But I wouldn't want to take that risk.)

The best way around this is probably to use ICMP for traceroute (using a fixed UDP port is more variable and not always possible). Most Unix traceroute implementations support '-I' to do this.

This matters in two situations. First, if you're asking outside people to run traceroutes to your machines and send you the results, and you have a firewall; without having them use ICMP, their traceroutes will all look like they fail to reach your machines (although you may be able to tell whether or not their packets reach your firewall). Second, if you're running traceroute against some outside machine that is (probably) behind a firewall, especially if the firewall isn't directly in front of it. In that case, your traceroute will always stop at or just before the firewall.

(4 comments.)

Chris's Wiki :: blog
A note to myself about using traceroute to check for port reachability 15 August 2024 at 03:08

A note to myself about using traceroute to check for port reachability

Chris's Wiki :: blog

By: cks

15 August 2024 at 03:08

Once upon a time, the Internet was a simple place; if you could ping some remote IP, you could probably reach it with anything. The Internet is no longer such a simple place, or rather I should say that various people's networks no longer are. These days there are a profusion of firewalls, IDS/IDR/IPS systems, and so on out there in the world, and some of them may decide to block access only to specific ports (and only some of the time). In this much more complicated world, you can want to check not just whether a machine is responding to pings, but if a machine responds to a specific port and if it doesn't, where your traffic stops.

The general question of 'where does your traffic stop' is mostly answered by the venerable traceroute. If you think there's some sort of general block, you traceroute to the target and then blame whatever is just beyond the last reported hop (assuming that you can traceroute to another IP at the same destination to determine this). I knew that traceroute normally works by sending UDP packets to 'random' ports (with manipulated (IP) TTLs, and the ports are not actually picked randomly) and then looking at what comes back, and I superstitiously remembered that you could fix the target port with the '-p' argument. This is, it turns out, not actually correct (and these days that matters).

There are several common versions of (Unix) traceroute out there; Linux, FreeBSD, and OpenBSD all use somewhat different versions. In all of them, what '-p port' actually does by itself is set the starting port, which is then incremented by one for each additional hop. So if you do 'traceroute -p 53 target', only the first hop will be probed with a UDP packet to port 53.

In Linux traceroute, you get a fixed UDP port by using the additional argument '-U'; -U by itself defaults to using port 53. Linux traceroute can also do TCP traceroutes with -T, and when you do TCP traceroutes the port is always fixed.

In OpenBSD traceroute, as far as I can see you just can't get a fixed UDP port. OpenBSD traceroute also doesn't do TCP traceroutes. On today's Internet, this is actually a potentially significant limitation, so I suspect that you most often want to try ICMP probes ('traceroute -I').

In FreeBSD traceroute, you get a fixed UDP port by turning on 'firewall evasion mode' with the '-e' argument. FreeBSD traceroute sort of supports a TCP traceroute with '-P tcp', but as the manual page says you need to see the BUGS section; it's going to be most useful if you believe your packets are getting filtered well before their destination. Using the TCP mode doesn't automatically turn on fixed port numbers, so in practice you probably want to use, for example, 'traceroute -P tcp -e -p 22 <host>' (with the port number depending on what you care about).

Having written all of this down, hopefully I will remember it for the next time it comes up (or I can look it up here, to save me reading through manual pages).

(One comment.)

Chris's Wiki :: blog
Some thoughts on OpenSSH 9.8's PerSourcePenalties feature 14 August 2024 at 03:06

Some thoughts on OpenSSH 9.8's PerSourcePenalties feature

Chris's Wiki :: blog

By: cks

14 August 2024 at 03:06

One of the features added in OpenSSH 9.8 is a new SSH server security feature to slow down certain sorts of attacks. To quote the release notes:

[T]he server will now block client addresses that repeatedly fail authentication, repeatedly connect without ever completing authentication or that crash the server. [...]

This is the PerSourcePenalties configuration setting and its defaults, and also see PerSourcePenaltyExemptList and PerSourceNetBlockSize. OpenSSH 9.8 isn't yet in anything we can use at work, but it will be in the next OpenBSD release (and then I'll get it on Fedora).

On the one hand, this new option is exciting to me because for the first time it lets us block only rapidly repeating SSH sources that fail to authenticate, as opposed to rapidly repeating SSH sources that are successfully logging in to do a whole succession of tiny little commands. Right now our perimeter firewall is blind to whether a brief SSH connection was successful or not, so all it can do is block on total volume, and this means we need to be conservative in its settings. This is a single machine block (instead of the global block our perimeter firewall can do), but a lot of SSH attackers do seem to target single machines with their attacks (for a single external source IP, at least).

(It's also going to be a standard OpenSSH feature that won't require any configuration, firewall or otherwise, and will slow down rapid attackers.)

On the other hand, this is potentially an issue for anything that makes health checks like 'is this machine responding with a SSH banner' (used in our Prometheus setup) or 'does this machine have the SSH host key we expect' (used in our NFS mount authentication system). Both of these cases will stop before authentication and so fall into the 'noauth' category of PerSourcePenalties. The good news is that the default refusal duration for this penalty is only one second, which is usually not very long and you're probably not going to run into in health checks. The exception is if you're trying to verify multiple types of SSH host keys for a server, because you can only verify one host key in a given connection, so if you need to verify both a RSA host key and an Ed25519 host key, you need two connections.

(Even then, the OpenSSH 9.8 default is that only you only get blocked once you've built up 15 seconds of penalties. At the default settings, this would be hard with even repeated host key checks, unless the server has multiple IPs and you're checking all of them.)

It's going to be interesting to read practical experience reports with this feature as OpenSSH 9.8 rolls out to more and more people. And on that note I'm certainly going to wait for people's reports before doing things like increasing the 'authfail' penalty duration, as tempting as it is right now (especially since it's not clear from the current documentation how unenforced penalty times accumulate).

(One comment.)

Chris's Wiki :: blog
Uncertainties and issues in using IPMI temperature data 13 August 2024 at 03:24

Uncertainties and issues in using IPMI temperature data

Chris's Wiki :: blog

By: cks

13 August 2024 at 03:24

In a comment on my entry about a machine room temperature distribution surprise, tbuskey suggested (in part) using the temperature sensors that many server BMCs support and make visible through IPMI. As it happens, I have flirted with this and have some pessimistic views on it in practice in a lot of circumstances (although I'm less pessimistic now that I've looked at our actual data).

The big issue we've run into is limitations in what temperature sensors are available with any particular IPMI, which varies both between vendors and between server models even for the same vendor. Some of these sensors are clearly internal to the system and some are often vaguely described (at least in IPMI sensor names), and it's hit or miss if you have a sensor that either explicitly labels itself as an 'ambient' temperature or that is probably this because it's called an 'inlet' temperature. My view is that only sensors that report on ambient air temperature (at the intake point, where it is theoretically cool) are really useful, even for relative readings. Internal temperatures may not rise very much even if the ambient temperature does, because the system may respond with measures like ramping up fan speed; obviously this has limits, but you'd generally like to be alerted before things have gotten that bad.

(Out of the 85 of our servers that are currently reporting any IPMI temperatures at all, only 53 report an inlet temperature and only nine report an 'ambient' temperature. One server reports four inlet temperatures; 'ambient', two power supplies, and a 'board inlet' temperature. Currently its inlet ambient is 22C, the board inlet is 32C, and the power supplies are 31C and 36C.)

The next issue I'm seeing in our data is that either we have temperature differences of multiple degrees C between machines higher and lower in racks, or the inlet temperature sensors aren't necessarily all that accurate (even within the same model of server, which will all have the 'inlet' temperature sensor in the same place). I'd be a bit surprised if our machine room ambient air did have this sort of temperature gradient, but I've been surprised before. But that probably means that you have to care about where in the rack your indicator machines are, not just where in the room.

(And where in the room probably matters too, as discussed. I see about a 5C swing in inlet temperatures between the highest and lowest machines in our main machine room.)

We push all of the IPMI readings we can get (temperature and otherwise) into our Prometheus environment and we use some of the IPMI inlet temperature readings to drive alerts. But we consider them only a backup to our normal machine room temperature monitoring, which is done by dedicated units that we trust; if we can't get readings from the main unit for some reason, we'll at least get alerts if something also goes wrong with the air conditioning. I wouldn't want to use IPMI readings as our primary temperature monitoring unless I had no other choice.

(The other aspect of using IPMI temperature measurements is that either the server has to be up or you have to be able to talk to its BMC over the network, depending on how you're collecting the readings. We generally collect IPMI readings through the host agent, using an appropriate ipmitool sub-command. Doing this through the host agent has the advantage that the BMC doesn't even have to be connected to the network, and usually we don't care about BMC sensor readings for machines that are not in service.)

(One comment.)

Normal view

Sidebar: Implications of how the Duo module is implemented

Sidebar: Why xload turns green under high load

Sidebar: Restricting root logins via OpenSSH