Normal view
Static Web Hosting on the Intel N150: FreeBSD, SmartOS, NetBSD, OpenBSD and Linux Compared
Our mixed assortment of DNS server software (as of December 2025)
Without deliberately planning it, we've wound up running an assortment of DNS server software on an assortment of DNS servers. A lot of this involves history, so I might as well tell the story of that history in the process. This starts with our three sets of DNS servers: our internal DNS master (with a duplicate) that holds both the internal and external views of our zones, our resolving DNS servers (which use our internal zones), and our public authoritative DNS server (carrying our external zones, along with various relics of the past). These days we also have an additional resolving DNS server that resolves from outside our networks and so gives the people who can use it an external view of our zones.
In the beginning we ran Bind on everything, as was the custom in those days (and I suspect we started out without a separation between the three types of DNS servers, but that predates my time here), and I believe all of the DNS servers were Solaris. Eventually we moved the resolving DNS servers and the public authoritative DNS server to OpenBSD (and the internal DNS master to Ubuntu), still using Bind. Then OpenBSD switched which nameservers they liked from Bind to Unbound and NSD, so we went along with that. Our authoritative DNS server had a relatively easy NSD configuration, but our resolving DNS servers presented some challenges and we wound up with a complex Unbound plus NSD setup. Recently we switched our internal resolvers to using Bind on Ubuntu, and then we switched our public authoritative DNS server from OpenBSD to Ubuntu but kept it still with NSD, since we already had a working NSD configuration for it.
This has wound up with us running the following setups:
- Our internal DNS masters run Bind in a somewhat complex split horizon
configuration.
- Our internal DNS resolvers run Bind in a simpler configuration where
they act as internal authoritative secondary DNS servers for our own
zones and as general resolvers.
- Our public authoritative DNS server (and its hot spare) run NSD as an
authoritative secondary, doing zone transfers from our internal DNS
masters.
- We have an external DNS resolver machine that runs Unbound in an extremely simple configuration. We opted to build this machine with Unbound because we didn't need it to act as anything other than a pure resolver, and Unbound is simple to set up for that.
At one level, this is splitting our knowledge and resources among three DNS servers rather than focusing on one. At another level, two out of the three DNS servers are being used in quite simple setups (and we already had the NSD setup written from prior use). Our only complex configurations are all Bind based, and we've explicitly picked Bind for complex setups because we feel we understand it fairly well from long experience with it.
(Specifically, I can configure a simple Unbound resolver faster and easier than I can do the same with Bind. I'm sure there's a simple resolver-only Bind configuration, it's just that I've never built one and I have built several simple and not so simple Unbound setups.)
Getting out of being people's secondary authoritative DNS server is hard
Many, many years ago, my department operated one of the university's secondary authoritative DNS servers, which was used by most everyone with a university subdomain and as a result was listed as one of their DNS NS records. This DNs server was also the authoritative DNS server for our own domains, because this was in the era where servers were expensive and it made perfect sense to do this. At the time, departments who wanted a subdomain pretty much needed to have a Unix system administrator and probably run their own primary DNS server and so on. Over time, the university's DNS infrastructure shifted drastically, with central IT offering more and more support, and more than half a decade ago our authoritative DNS server stopped being a university secondary, after a lot of notice to everyone.
Experienced system administrators can guess what happened next. Or rather, what didn't happen next. References to our DNS server lingered in various places for years, both in the university's root zones as DNS glue records and in people's own DNS zone files as theoretically authoritative records. As late as the middle of last year, when I started grinding away on this, I believe that roughly half of our authoritative DNS server's traffic was for old zones we didn't serve and was getting DNS 'Refused' responses. The situation is much better today, after several rounds of finding other people's zones that were still pointing to us, but it's still not quite over and it took a bunch of tedious work to get this far.
(Why I care about this is that it's hard to see if your authoritative DNS server is correctly answering everything it should if things like tcpdumps of DNS traffic are absolutely flooded with bad traffic that your DNS server is (correctly) rejecting.)
In theory, what we should have done when we stopped being a university secondary authoritative DNS server was to switch the authoritative DNS server for our own domains to another name and another IP address; this would have completely cut off everyone else when we turned the old server off and removed its name from our DNS. In practice the transition was not clearcut, because for a while we kept on being a secondary for some other university zones that have long-standing associations with the department. Also, I think we were optimistic about how responsive people would be (and how many of them we could reach).
(Also, there's a great deal of history tied up in the specific name and IP address of our current authoritative DNS server. It's been there for a very long time.)
PS: Even when no one is incorrectly pointing to us, there's clearly a background Internet radiation of external machines throwing random DNS queries at us. But that's another entry.
Duplicate metric labels and group_*() operations in Prometheus
Suppose that you have an internal master DNS server and a backup for that master server. The two servers are theoretically fed from the same data and so should have the same DNS zone contents, and especially they should have the same DNS zone SOAs for all zones in both of their internal and external views. They both run Bind and you use the Bind exporter, which provides the SOA values for every zone Bind is configured to be a primary or a secondary for. So you can write an alert with an expression like this:
bind_zone_serial{host="backup"}
!= on (view,zone_name)
bind_zone_serial{host="primary"}
This is a perfectly good alert (well, alert rule), but it has lost all of the additional labels you might want in your alert. Especially, it has lost both host names. You could hard-code the host name in your message about the alert, but it would be nice to do better and propagate your standard labels into the alert. To do this you want to use one of group_left() and group_right(), but which one you want depends on where you want the labels to come from.
(Normally you have to chose between the two depending on which side has multiple matches, but in this case we have a one to one matching.)
For labels that are duplicated between both sides, the group_*() operators pick which side's labels you get, but backwards from their names. If you use group_right(), the duplicate label values come from the left; if you use group_left(), the duplicate label values come from the right. Here, we might change the backup host's name but we're probably not going to change the primary host's name, so we likely want to preserve the 'host' label from the left side and thus we use group_right():
bind_zone_serial{host="backup"}
!= on (view,zone_name)
group_right (job,host,instance)
bind_zone_serial{host="primary"}
One reason this little peculiarity is on my mind at the moment is that Cloudflare's excellent pint Prometheus rule linter recently picked up a new 'redundant label' lint rule that complains about this for custom labels such as 'host':
Query is trying to join the 'host' label that is already present on the other side of the query.
(It doesn't complain about job or instance, presumably because it understands why you might do this for those labels. As the pint message will tell you, to silence this you need to disable 'promql/impossible' for this rule.)
When I first saw pint's warning I didn't think about it and removed the 'host' label from the group_right(), but fortunately I actually tested what the result would be and saw that I was now getting the wrong host name.
(This is different from pulling in labels from other metrics, where the labels aren't duplicated.)
PS: I clearly knew this at some point, when I wrote the original alert rule, but then I forgot it by the time I was looking at pint's warning message. PromQL is the kind of complex thing where the details can fall out of my mind if I don't use it often enough, which I don't these days since our alert rules are relatively stable.
BSD PF versus Linux nftables for firewalls for us
One of the reactions I saw to our move from OpenBSD to FreeBSD for firewalls was to wonder why we weren't moving all the way to nftables based Linux firewalls. It's true that this would reduce the number of different Unixes we have to operate and probably get us more or less state of the art 10G network performance. However, I have some negative views on the choice of PF versus nftables, both in our specific situation and in general.
(I've written about this before but it was in the implicit context of Linux iptables.)
In our specific situation:
- We have a lot of existing, relatively complex PF firewall rules;
for example, our perimeter firewall has over 400 non-comment lines
of rules, definitions, and so on. Translating these from OpenBSD
PF to FreeBSD PF is easy, if it's necessary at all. Translating
everything to nftables is a lot more work, and as far as I know
there's no translation tool, especially not one that we could
really trust. We'd probably have to basically rebuild each
firewall from the ground up, which is both a lot of work and a
high-stakes thing. We'd have to be extremely convinced that we
had to do this in order to undertake it.
- We have a lot of well developed tooling around operating, monitoring,
and gathering metrics from PF-based firewalls, most of it locally
created. Much or all of this tooling ports straight over from
OpenBSD to FreeBSD, while we have no equivalent tooling for
nftables and would have to develop (or find) equivalents.
- We already know PF and almost all of that knowledge transfers over from OpenBSD PF to FreeBSD PF (and more will transfer with FreeBSD 15, which has some PF and PF syntax updates from modern OpenBSD).
In general (much of which also applies to our specific situation):
- There are a number of important PF features that nftables at
best has in incomplete, awkward versions. For example, nftables'
version of pflog is awkward and half-baked compared to the real
thing (also). While you may be able to put
together some nftables based rough equivalent of BSD pfsync, casual reading suggests
that it's a lot more involved and complex (and maybe less integrated
with nftables).
- The BSD PF firewall system is straightforward and easy to understand
and predict. The Linux firewall system is much more complex and
harder to understand, and this complexity bleeds through into
nftables configuration, where you need to know chains and tables
and so on. Much of this Linux complexity is not documented in ways
that are particularly accessible.
- Nftables documentation is opaque compared to the BSD pf.conf
manual page
(also). Partly this is
because there is no 'nftables.conf' manual page; instead,
your entry point is the nft manual page, which
is both a command line tool and the documentation of the format
of nftables rules. I find that these are two tastes that don't
go well together.
(This is somewhat forced by the nftables decision to retain compatibility with adding and removing rules on the fly. PF doesn't give you a choice, you load your entire ruleset from a file.)
- nftables is already the third firewall rule format and system that the Linux kernel has had over the time that I've been writing Linux firewall rules (ipchains, iptables, nftables). I have no confidence that there won't be a fourth before too long. PF has been quite stable by comparison.
What I mostly care about is what I have to write and read to get the IP filtering and firewall setup that we want (and then understand it later), not how it gets compiled down and represented in the kernel (this has come up before). Assuming that the nftables backend is capable enough and the result performs sufficiently well, I'd be reasonably happy with a PF like syntax (and semantics) on top of kernel nftables (although we'd still have things like the pflog and pfsync issues).
Can I get things done in nftables? Certainly, nftables is relatively inoffensive. Do I want to write nftables rules? No, not really, no more than I want to write iptables rules. I do write nftables and iptables rules when I need to do firewall and IP filtering things on a Linux machine, but for a dedicated machine for this purpose I'd rather use a PF-based environment (which is now FreeBSD).
As far as I can tell, the state of Linux IP filtering documentation is partly a result of the fact that Linux doesn't have a unified IP filtering system and environment the way that OpenBSD does and FreeBSD mostly does (or at least successfully appears to so far). When the IP filtering system is multiple more or less separate pieces and subsystems, you naturally tend to get documentation that looks at each piece in isolation and assumes you already know all of the rest.
(Let's also acknowledge that writing good documentation for a complex system is hard, and the Linux IP filtering system has evolved to be very complex.)
PS: There's no real comparison between PF and the older iptables system; PF is clearly far more high level than you can reasonably do in iptables, which by comparison is basically an IP filtering assembly language. I'm willing to tentatively assume that nftables can be used in a higher level way than iptables can (I haven't used it for enough to have a well informed view either way); if it can't, then there's again no real comparison between PF and nftables.
We're (now) moving from OpenBSD to FreeBSD for firewalls
A bit over a year ago I wrote about why we'd become interested in FreeBSD; to summarize, FreeBSD appeared promising as a better, easier to manage host operating system for PF-based things. Since then we've done enough with FreeBSD to have decided that we actively prefer it to OpenBSD. It's been relatively straightforward to convert our firewall OpenBSD PF rulesets to FreeBSD PF and the resulting firewalls have clearly better performance on our 10G network than our older OpenBSD ones did (with less tuning).
(It's possible that the very latest OpenBSD has significantly improved bridging and routing firewall performance so that it no longer requires the fastest single-core CPU performance you can get to go decently. But pragmatically it's too late; FreeBSD had that performance earlier and we now have more confidence in FreeBSD's performance in the firewall role than OpenBSD's.)
There are some nice things about FreeBSD, like root on ZFS, and broadly I feel that it's more friendly than OpenBSD. But those are secondary to its firewall network performance (and PF compatibility); if its network performance was no better than OpenBSD (or worse), we wouldn't be interested. Since it is better, it's now displacing OpenBSD for our firewalls and our latest VPN servers. We've stopped building new OpenBSD machines, so as firewalls come up for replacement they get rebuilt as FreeBSD machines.
(We have a couple of non-firewall OpenBSD machines that will likely turn into Ubuntu machines when we replace them, although we can't be sure until it actually happens.)
Would we consider going back to OpenBSD? Maybe, but probably not. Now that we've migrated a significant number of firewalls, moving the remaining ones to FreeBSD is the easiest approach, even if new OpenBSD firewalls would equal their performance. And the FreeBSD 10G firewall performance we're getting is sufficiently good that it leaves OpenBSD relatively little ground to exceed it.
(There are some things about FreeBSD that we're not entirely enthused about. We're going to be doing more firewall upgrades than we used to with OpenBSD, for one.)
PS: As before, I don't think there's anything wrong with OpenBSD if it meets your needs. We used it happily for years until we started being less happy with its performance on 10G Ethernet. A lot of people don't have that issue.
Containers and giving up on expecting good software installation practices
Over on the Fediverse, I mentioned a grump I have about containers:
As a sysadmin, containers irritate me because they amount to abandoning the idea of well done, well organized, well understood, etc installation of software. Can't make your software install in a sensible way that people can control and limit? Throw it into a container, who cares what it sprays where across the filesystem and how much it wants to be the exclusive owner and controller of everything in sight.
(This is a somewhat irrational grump.)
To be specific, it's by and large abandoning the idea of well done installs of software on shared servers. If you're only installing software inside a container, your software can spray itself all over the (container) filesystem, put itself in hard-coded paths wherever it feels like, and so on, even if you have completely automated instructions for how to get it to do that inside a container image that's being built. Some software doesn't do this and is well mannered when installed outside a container, but some software does and you'll find notes to the effect that the only supported way of installing it is 'here is this container image', or 'here is the automated instructions for building a container image'.
To be fair to containers, some of this is due to missing Unix APIs (or APIs that theoretically exist but aren't standardized). Do you want multiple Unix logins for your software so that it can isolate different pieces of itself? There's no automated way to do that. Do you run on specific ports? There's generally no machine-readable way to advertise that, and people may want you to build in mechanisms to vary those ports and then specify the new ports to other pieces of your software (that would all be bundled into a container image). And so on. A container allows you to put yourself in an isolated space of Unix UIDs, network ports, and so on, one where you won't conflict with anyone else and won't have to try to get the people who want to use your software to create and manage the various details (because you've supplied either a pre-built image or reliable image building instructions).
But I don't have to be happy that software doesn't necessarily even try, that we seem to be increasingly abandoning much of the idea of running services in shared environments. Shared environments are convenient. A shared Unix environment gives you a lot of power and avoids a lot of complexity that containers create. Fortunately there's still plenty of software that is willing to be installed on shared systems.
(Then there is the related grump that the modern Linux software distribution model seems to be moving toward container-like things, which has a whole collection of issues associated with it.)
A problem for downloading things with curl
For various reasons, I'm working to switch from wget to curl, and generally this has been going okay. However, I've now run into one situation where I don't know how to make curl do what I want. It is, of course, a project that doesn't bother to do easily-fetched downloads, but in a very specific way. In fact it's Django (again).
The Django URLs for downloads look like this:
https://www.djangoproject.com/download/5.2.8/tarball/
The way the websites of many projects turn these into actual files is to provide a filename in the HTTP Content-Disposition header in the reply. In curl, these websites can be handled with the -J (--remote-header-name) option, which uses the filename from the Content-Disposition if there is one.
Unfortunately, Django's current website does not operate this way. Instead, the URL above is a HTTP redirection to the actual .tar.gz file (on media.djangoproject.com). The .tar.gz file is then served without a Content-Disposition header as an application/octet-stream. Wget will handle this with --trust-server-names, but as far as I can tell from searching through the curl manpage, there is no option that will do this in curl.
(In optimistic hope I even tried --location-trusted, but no luck.)
If curl is directed straight to the final URL, 'curl -O' alone is
enough to get the right file name. However, if curl goes through a
redirection, there seems to be no option that will cause it to
re-evaluate the 'remote name' based on the new URL; the initial URL
and the name derived from it sticks, and you get a file unhelpfully
called 'tarball' (in this case). If you try to be clever by running
the initial curl without -O but capturing any potential redirection
with "-w '%{redirect_url}\n'" so you can manually follow it in
a second curl command, this works (for one level of redirections)
but leaves you with a zero-length file called 'tarball' from the
first curl.
It's possible that this means curl is the wrong tool for the kind of file downloads I want to do from websites like this, and I should get something else entirely. However, that something else should at least be a completely self contained binary so that I can easily drag it around to all of the assorted systems where I need to do this.
(I could always try to write my own in Go, or even take this as an opportunity to learn Rust, but that way lies madness and a lot of exciting discoveries about HTTP downloads in the wild. The more likely answer is that I hold my nose and keep using wget for this specific case.)
PS: I think it's possible to write a complex script using curl that more or less works here, but one of the costs is that you have to make first a HEAD and then a GET request to the final target, and that irritates me.
How I handle URLs in my unusual X desktop
I have an unusual X desktop environment that has evolved over a long period, and as part of that I have an equally unusual and slowly evolved set of ways to handle URLs. By 'handle URLs', what I mean is going from an URL somewhere (email, text in a terminal, etc) to having the URL open in one of my several browser environments. Tied into this is handling non-URL things that I also want to open in a browser, for example searching for various sorts of things in various web places.
The simplest place to start is at the end. I have several browser environments and to go along with them I have a script for each that opens URLs provided as command line arguments in a new window of that browser. If there's no command line arguments, the scripts open a default page (usually a blank page, but for my main browser it's a special start page of links). For most browsers this works by running 'firefox <whatever>' and so will start the browser if it's not already running, but for my main browser I use a lightweight program that uses Firefox's X-based remote control protocol. which means I have to start the browser outside of it.
Layered on top of these browser specific scripts is a general script
to open URLs that I call 'openurl'. The purpose of openurl is to
pick a browser environment based on the particular site I'm going
to. For example, if I'm opening the URL of a site where I know I
need JavaScript, the script opens the URL in my special 'just make
it work' JavaScript enabled Firefox. Most urls open in my normal,
locked down Firefox. I configure programs like Thunderbird to open
URLs through this openurl script, sometimes directly and sometimes
indirectly.
(I haven't tried to hook openurl into the complex mechanisms
that xdg-open uses to decide how to open URLs. Probably I should but the whole
xdg-open thing irritates me.)
Layered on top of openurl and the specific browser scripts is a
collection of scripts that read the X selection and do a collection
of URL-related things with it. One script reads the X selection,
looks for it being a URL, and either feeds the URL to openurl or
just runs openurl to open my start page. Other scripts feed the
URL to alternate browser environments or do an Internet search for
the selection. Then I have a fvwm menu with
all of these scripts in it and one of my fvwm mouse button bindings brings up this menu. This lets me select a
URL in a terminal window, bring up the menu, and open it in either
the default browser choice or a specific browser choice.
(I also have a menu entry for 'open the selection in my main browser' in one of my main fvwm menus, the one attached to the middle mouse button, which makes it basically reflexive to open a new browser window or open some URL in my normal browser.)
The other way I handle URLs is through dmenu. One of the things my dmenu environment does is recognize URLs and open them in my default browser environment. I also have short dmenu commands to open URLs in my other browser environments, or open URLs based on the parameters I pass the command (such as a 'pd' script that opens Python documentation for a standard library module). Dmenu itself can paste in the current X selection with a keystroke, which makes it convenient to move URLs around. Dmenu is also how I typically open a URL if I'm typing it in instead of copying it from the X selection, rather than opening a new browser window, focusing the URL bar, and entering the URL there.
(I have dmenu set up to also recognize 'about:*' as URLs and have various Firefox about: things pre-configured as hidden completions in dmenu, along with some commonly used website URLs.)
As mentioned, dmenu specifically opens plain URLs in my default
browser environment rather than going through openurl. I may
change this someday but in practice there aren't enough special
sites that it's an issue. Also, I've made dedicated little
dmenu-specific scripts that open up the
various sites I care about in the appropriate browser, so I can
type 'mastodon' in dmenu to open up my Fediverse account in the JavaScript-enabled Firefox
instance.
You can add arbitrary zones to NSD (without any glue records)
Suppose, not hypothetically, that you have a very small DNS server for a captive network situation, where the DNS server exists only to give clients answers for a small set of hosts. One of the ways you can implement this is with an authoritative DNS servers, such as NSD, that simply has an extremely minimal set of DNS data. If you're using NSD for this, you might be curious how minimal you can be and how much you need to mimic ordinary DNS structure.
Here, by 'mimic ordinary DNS structure', I mean inserting various levels of NS records so there is a more or less conventional path of NS delegations from the DNS root ('.') down to your name. If you're providing DNS clients with 'dog.example.org', you might conventionally have a NS record for '.', a NS record for 'org.', and a NS record for 'example.org.', mimicking what you'd see in global DNS. Of course all of your NS records are going to point to your little DNS server, but they're present if anything looks.
Perhaps unsurprisingly, NSD doesn't require this and DNS clients normally don't either. If you say:
zone: name: example.org zonefile: example-stub
and don't have any other DNS data, NSD won't object and it will answer queries for 'dog.example.org' with your minimal stub data. This works for any zone, including completely made up ones:
zone: name: beyond.internal zonefile: beyond-stub
The actual NSD stub zone files can be quite minimal. An older OpenBSD
NSD appears to be happy with zone files that have only a $ORIGIN,
a $TTL, a '@ IN SOA' record, and what records you care about in
the zone.
Once I thought about it, I realized I should have expected this. An authoritative DNS server normally only holds data for a small subset of zones and it has to be willing to answer queries about the data it holds. Some authoritative DNS servers (such as Bind) can also be used as resolving name servers so they'd sort of like to have information about at least the root nameservers, but NSD is a pure authoritative server so there's no reason for it to care.
As for clients, they don't normally do DNS resolution starting from the root downward. Instead, they expect to operate by sending the entire query to whatever their configured DNS resolver is, which is going to be your little NSD setup. In a number of configurations, clients either can't talk directly to outside DNS or shouldn't try to do DNS resolution that way because it won't work; they need to send everything to their configured DNS resolver so it can do, for example, "split horizon" DNS.
(Yes, the modern vogue for DNS over HTTPS puts a monkey wrench into split horizon DNS setups. That's DoH's problem, not ours.)
Since this works for a .net zone, you can use it to try to disable DNS over HTTPS resolvers in your stub DNS environment by providing a .net zone with 'use-application-dns CNAME .' or the like, to trigger at least Firefox's canary domain detection.
(I'm not going to address whether you should have such a minimal stub DNS environment or instead count on your firewall to block traffic and have a normal DNS environment, possibly with split horizon or response policy zones to introduce your special names.)
We can't really do progressive rollouts of disruptive things
In a comment on my entry on how we reboot our machines right after updating their kernels, Jukka asked a good question:
While I do not know how many machines there are in your fleet, I wonder whether you do incremental rolling, using a small snapshot for verification before rolling out to the whole fleet?
We do this to some extent but we can't really do it very much. The core problem is that the state of almost all of our machines is directly visible and exposed to people. This is because we mostly operate an old fashioned Unix login server environment, where people specifically use particular servers (either directly by logging in to them or implicitly because their home directory is on a particular NFS fileserver). About the only genuinely generic machines we have are the nodes in our SLURM cluster, where we can take specific unused nodes out of service temporarily without anyone noticing.
(Some of these login servers in use all of the time; others we might find idle if we're extremely lucky. But it's hard to predict when someone will show up to try to use a currently empty server.)
This means that progressively rolling out a kernel update (and rebooting things) to our important, visible core servers requires multiple people-visible reboots of machines, instead of one big downtime when everything is rebooted. Generally we feel that repeated disruptions are much more annoying and disruptive overall to people; it's better to get the pain of reboot disruptions over all at once. It's also much easier to explain to people, and we don't have to annoy them with repeated notifications that yet another subset of our servers and services will be down for a bit.
(To make an incremental deployment more painful for us, these will normally have to be after-hours downtimes, which means that we'll be repeatedly staying late, perhaps once a week for three or four weeks as we progressively work through a rollout.)
In addition to the nodes of our SLURM cluster, there are a number of servers that can be rebooted in the background to some degree without people noticing much. We will often try the kernel update out on a few of them in advance, and then update others of them earlier in the day (or the day before) both as a final check and to reduce the number of systems we have to cover at the actual out of hours downtime. But a lot of our servers cannot really be tested much in advance, such as our fileservers or our web server (which is under constant load for reasons outside the scope of this entry). We can (and do) update a test fileserver or a test web server, but neither will see a production load and it's under production loads that problems are most likely to surface.
This is a specific example of how the 'cattle' model doesn't fit all situations. To have a transparent rolling update that involves reboots (or anything else that's disruptive on a single machine), you need to be able to transparently move people off of machines and then back on to them. This is hard to get in any environment where people have long term usage of specific machines, where they have login sessions and running compute jobs and so on, and where you have have non-redundant resources on a single machine (such as NFS fileservers without transparent failover from server to server).
We (I) need a long range calendar reminder system
About four years ago I wrote an entry about how your SMART drive database of attribute meanings needs regular updates. That entry was written on the occasion of updating the database we use locally on our Ubuntu servers, and at the time we were using a mix of Ubuntu 18.04 and Ubuntu 20.04 servers, both of which had older drive databases that probably dated from early 2018 and early 2020 respectively. It is now late 2025 and we use a mix of Ubuntu 24.04 and 22.04 servers, both of which have drive databases that are from after October of 2021.
Experienced system administrators know where this one is going: today I updated our SMART drive database again, to a version of the SMART database that was more recent than the one shipped with 24.04 instead of older than it.
It's a fact of life that people forget things. People especially forget things that are a long way away, even if they make little notes in their worklog message when recording something that they did (as I did four years ago). It's definitely useful to plan ahead in your documentation and write these notes, but without an external thing to push you or something to explicitly remind you, there's no guarantee that you'll remember.
All of which leads me to the view that it would be useful for us
to have a long range calendar reminder system, something that could
be used to set reminders for more than a year into the future and
ideally allow us to write significant email messages to our future
selves to cover all of the details (although there are hacks around
that, such as putting the details on a web page and having the
calendar mail us a link). Right now the best calendar reminder
system we have is the venerable calendar, which
we can arrange to email one-line notes to our general address that
reaches all sysadmins, but calendar doesn't let you include the
year in the reminder date.
(For SMART drive database updates, we could get away with mailing ourselves once a year in, say, mid-June. It doesn't hurt to update the drive database more than every Ubuntu LTS release. But there are situations where a reminder several years in the future is what we want.)
PS: Of course it's not particularly difficult to build an ad-hoc script system to do this, with various levels of features. But every local ad-hoc script that we write is another little bit of overhead, and I'd like to avoid that kind of thing if at all possible in favour of a standard solution (that isn't a shared cloud provider calendar).
Uses for DNS server delegation
A commentator on my entry on systemd-resolved's new DNS server delegation feature asked:
My memory might fail me here, but: wasn't something like this a feature introduced in ISC's BIND 8, and then considered to be a bad mistake and dropped again in BIND 9 ?
I don't know about Bind, but what I do know is that this feature is present in other DNS resolvers (such as Unbound) and that it has a variety of uses. Some of those uses can be substituted with other features and some can't be, at least not as-is.
The quick version of 'DNS server delegation' is that you can send all queries under some DNS zone name off to some DNS server (or servers) of your choice, rather than have DNS resolution follow any standard NS delegation chain that may or may not exist in global DNS. In Unbound, this is done through, for example, Forward Zones.
DNS server delegation has at least three uses that I know of. First, you can use it to insert entire internal TLD zones into the view that clients have. People use various top level names for these zones, such as .internal, .kvm, .sandbox (our choice), and so on. In all cases you have some authoritative servers for these zones and you need to direct queries to these servers instead of having your queries go to the root nameservers and be rejected.
(Obviously you will be sad if IANA ever assigns your internal TLD to something, but honestly if IANA allows, say, '.internal', we'll have good reason to question their sanity. The usual 'standard DNS environment' replacement for this is to move your internal TLD to be under your organizational domain and then implement split horizon DNS.)
Second, you can use it to splice in internal zones that don't exist in external DNS without going to the full overkill of split horizon authoritative data. If all of your machines live in 'corp.example.org' and you don't expose this to the outside world, you can have your public example.org servers with your public data and your corp.example.org authoritative servers, and you splice in what is effectively a fake set of NS records through DNS server delegation. Related to this, if you want you can override public DNS simply by having an internal and an external DNS server, without split horizon DNS; you use DNS server delegation to point to the internal DNS server for certain zones.
(This can be replaced with split horizon DNS, although maintaining split horizon DNS is its own set of headaches.)
Finally, you can use this to short-cut global DNS resolution for reliability in cases where you might lose external connectivity. For example, there are within-university ('on-campus' in our jargon) authoritative DNS servers for .utoronto.ca and .toronto.edu. We can use DNS server delegation to point these zones at these servers to be sure we can resolve university names even if the university's external Internet connection goes down. We can similarly point our own sub-zone at our authoritative servers, so even if our link to the university backbone goes down we can resolve our own names.
(This isn't how we actually implement this; we have a more complex split horizon DNS setup that causes our resolving DNS servers to have a complete copy of the inside view of our zones, acting as caching secondaries.)
Keeping notes is for myself too, illustrated (once again)
Yesterday I wrote about restarting or redoing something after a systemd service restarts. The non-hypothetical situation that caused me to look into this was that after we applied a package update to one system, systemd-networkd on it restarted and wiped out some critical policy based routing rules. Since I vaguely remembered this happening before, I sighed and arranged to have our rules automatically reapplied on both systems with policy based routing rules, following the pattern I worked out.
Wait, two systems? And one of them didn't seem to have problems after the systemd-networkd restart? Yesterday I ignored that and forged ahead, but really it should have set off alarm bells. The reason the other system wasn't affected was I'd already solved the problem the right way back in March of 2024, when we first hit this networkd behavior and I wrote an entry about it.
However, I hadn't left myself (or my co-workers) any notes about that March 2024 fix; I'd put it into place on the first machine (then the only machine we had that did policy based routing) and forgotten about it. My only theory is that I wanted to wait and be sure it actually fixed the problem before documenting it as 'the fix', but if so, I made a mistake by not leaving myself any notes that I had a fix in testing. When I recently built the second machine with policy based routing I copied things from the first machine, but I didn't copy the true networkd fix because I'd forgotten about it.
(It turns out to have been really useful that I wrote that March 2024 entry because it's the only documentation I have, and I'd probably have missed the real fix if not for it. I rediscovered it in the process of writing yesterday's entry.)
I know (and knew) that keeping notes is good, and that my memory is fallible. And I still let this slip through the cracks for whatever reason. Hopefully the valuable lesson I've learned from this will stick a bit so I don't stub my toe again.
(One obvious lesson is that I should make a note to myself any time I'm testing something that I'm not sure will actually work. Since it may not work I may want to formally document it in our normal system for this, but a personal note will keep me from completely losing track of it. You can see the persistence of things 'in testing' as another example of the aphorism that there's nothing as permanent as a temporary fix.)
Using systems because you know them already
Every so often on the Fediverse, people ask for advice on a monitoring system to run on their machine (desktop or server), and some of the time Prometheus, and when it does I wind up making awkward noises. On the one hand, we run Prometheus (and Grafana) and are happy with it, and I run separate Prometheus setups on my work and home desktops. On the other hand, I don't feel I can recommend picking Prometheus for a basic single-machine setup, despite running it that way myself.
Why do I run Prometheus on my own machines if I don't recommend that you do so? I run it because I already know Prometheus (and Grafana), and in fact my desktops (re)use much of our production Prometheus setup (but they scrape different things). This is a specific instance (and example) of a general thing in system administration, which is that not infrequently it's simpler for you to use something you already know even if it's not necessarily an exact fit (or even a great fit) for the problem. For example, if you're quite familiar with operating PostgreSQL databases, it might be simpler to use PostgreSQL for a new system where SQLite could do perfectly well and other people would find SQLite much simpler. Especially if you have canned setups, canned automation, and so on all ready to go for PostgreSQL, and not for SQLite.
(Similarly, our generic web server hammer is Apache, even if we're doing things that don't necessarily need Apache and could be done perfectly well or perhaps better with nginx, Caddy, or whatever.)
This has a flipside, where you use a tool because you know it even if there might be a significantly better option, one that would actually be easier overall even accounting for needing to learn the new option and build up the environment around it. What we could call "familiarity-driven design" is a thing, and it can even be a confining thing, one where you shape your problems to conform to the tools you already know.
(And you may not have chosen your tools with deep care and instead drifted into them.)
I don't think there's any magic way to know which side of the line you're on. Perhaps the best we can do is be a little bit skeptical about our reflexive choices, especially if we seem to be sort of forcing them in a situation that feels like it should have a simpler or better option (such as basic monitoring of a single machine).
(In a way it helps that I know so much about Prometheus because it makes me aware of various warts, even if I'm used to them and I've climbed the learning curves.)
How part of my email handling drifted into convoluted complexity
Once upon a time, my email handling was relatively
simple. I wasn't on any big mailing lists, so I had almost everything
delivered straight to my inbox (both in the traditional /var/mail
mbox sense and then through to
MH's
own inbox folder directory). I did some mail filtering with procmail,
but it was all for things that I basically never looked at, so I
had procmail write them to mbox
files under $HOME/.mail. I moved email from my Unix /var/mail
inbox to MH's inbox with MH's inc command (either running
it directly or having exmh run it for
me). Rarely, I had a mbox
file procmail had written that I wanted to read, and at that point
I inc'd it either to my MH +inbox or to some other folder.
Later, prompted by wanting to improve my breaks and vacations, I diverted a bunch of mailing lists away
from my inbox. Originally I had
procmail write these diverted messages to mbox files, then later
I'd inc the files to read the messages. Then I found that outside
of vacations, I needed to make this email more readily accessible, so I had procmail put them in MH
folder directories under Mail/inbox (one of MH's nice features is
that your inbox is a regular folder and can have sub-folders, just
like everything else). As I noted at the time, procmail only partially emulates
MH when doing this, and one of the things it doesn't do is keep
track of new, unread ('unseen') messages.
(MH has a general purpose system for keeping track of 'sequences' of messages in a MH folder, so it tracks unread messages based on what is in the special 'unseen' sequence. Inc and other MH commands update this sequence; procmail doesn't.)
Along with this procmail setup I wrote a basic script, called
mlists, to report how many messages each of these 'mailing list'
inboxes had in them. After a while I started diverting lower priority
status emails and so on through this system (and stopped reading
the mailing lists); if I got a type of
email in any volume that I didn't want to read right away during
work, it probably got shunted to these side inboxes. At some point
I made mlists optionally run the MH scan command to show me
what was in each inbox folder (well, for the inbox folders where
this was potentially useful information). The mlists script was
still mostly simple and the whole system still made sense, but it
was a bit more complex than before, especially when it also got a
feature where it auto-reset the current message number in each
folder to the first message.
A couple of years ago, I switched the MH frontend I used from
exmh to MH-E in
GNU Emacs, which changed how I read my email in practice. One of the changes was that I started
using the GNU Emacs Speedbar,
which always displays a count of messages in MH folders and especially
wants to let you know about folders with unread messages. Since I
had the hammer of my mlists script handy, I proceeded to mutate
it to be what a comment in the script describes as "a discount
maintainer of 'unseen'", so that MH-E's speedbar could draw my
attention to inbox folders that had new messages.
This is not the right way to do this. The right way to do this is
to have procmail deliver messages through MH's rcvstore, which as a MH
command can update the 'unseen' sequence properly. But using rcvstore
is annoying, partly because you have to use another program to add
the locking it needs, so at every point the path of least resistance
was to add a bit more hacks to what I already had. I had procmail,
and procmail could deliver to MH folder directories, so I used it
(and at the time the limitations were something I considered a
feature). I had a script to give me basic information, so it could
give me more information, and then it could do one useful thing
while it was giving me information, and then the one useful thing
grew into updating 'unseen'.
And since I have all of this, it's not even worth the effort of
switching to the proper rcvstore approach and throwing a
bunch of it away. I'm always going to want the 'tell me stuff'
functionality of my mlists script, so part of it has to stay
anyway.
Can I see similarities between this and how various of our system
tools have evolved, mutated, and become increasingly complex? Of
course. I think it's much the same obvious forces involved, because
each step seems reasonable in isolation, right up until I've built
a discount environment that duplicates much of rcvstore.
Sidebar: an extra bonus bit of complexity
It turns out that part of the time, I want to get some degree of live notification of messages being filed into these inbox folders. I may not look at all or even many of them, but there are some periodic things that I do want to pay attention to. So my discount special hack is basically:
tail -f .mail/procmail-log | egrep -B2 --no-group-separator 'Folder: /u/cks/Mail/inbox/'
(This is a script, of course, and I run it in a terminal window.)
This could be improved in various ways but then I'd be sliding down the convoluted complexity slope and I'm not willing to do that. Yet. Give it a few years and I may be back to write an update.
More on the tools I use to read email affecting my email reading
About two years ago I wrote an entry about how my switch from reading email with exmh to reading it in GNU Emacs with MH-E had affected my email reading behavior more than I expected. As time has passed and I've made more extensive customizations to my MH-E environment, this has continued. One of the recent ways I've noticed is that I'm slowly making more and more use of the fact that GNU Emacs is a multi-window editor ('multi-frame' in Emacs terminology) and reading email with MH-E inside it still leaves me with all of the basic Emacs facilities. Specifically, I can create several Emacs windows (frames) and use this to be working in multiple MH folders at the same time.
Back when I used exmh extensively, I mostly had MH pull my email into the default 'inbox' folder, where I dealt with it all at once. Sometimes I'd wind up pulling some new email into a separate folder, but exmh only really giving me a view of a single folder at a time combined with a system administrator's need to be regularly responding to email made that a bit awkward. At first my use of MH-E mostly followed that; I had a single Emacs MH-E window (frame) and within that window I switched between folders. But lately I've been creating more new windows when I want to spend time reading a non-inbox folder, and in turn this has made me much more willing to put new email directly into different (MH) folders rather than funnel it all into my inbox.
(I don't always make a new window to visit another folder, because I don't spend long on many of my non-inbox folders for new email. But for various mailing lists and so on, reading through them may take at least a bit of time so it's more likely I'll decide I want to keep my MH inbox folder still available.)
One thing that makes this work is that MH-E itself has reasonably good support for displaying and working on multiple folders at once. There are probably ways to get MH-E to screw this up and run MH commands with the wrong MH folder as the current folder, so I'm careful that I don't try to have MH-E carry out its pending MH operations in two MH-E folders at the same time. There are areas where MH-E is less than ideal when I'm also using command-line MH tools, because MH-E changes MH's global notion of the current folder any time I have it do things like show a message in some folder. But at least MH-E is fine (in normal circumstances) if I use MH commands to change the current folder; MH-E will just switch it back the next time I have it show another message.
PS: On a purely pragmatic basis, another change in my email handling is that I'm no longer as irritated with HTML emails because GNU Emacs is much better at displaying HTML than exmh was. I've actually left my MH-E setup showing HTML by default, instead of forcing multipart/alternative email to always show the text version (my exmh setup). GNU Emacs and MH-E aren't up to the level of, say, Thunderbird, and sometimes this results in confusing emails, but it's better than it was.
(The situation that seems tricky for MH-E is that people sometimes include inlined images, for example screenshots as part of problem reports, and MH-E doesn't always give any indication that it's even omitting something.)
FreeBSD vs. SmartOS: Who's Faster for Jails, Zones, and bhyve VMs?
Maybe I should add new access control rules at the front of rule lists
Not infrequently I wind up maintaining slowly growing lists of filtering rules to either allow good things or weed out bad things. Not infrequently, traffic can potentially match more than one filtering rule, either because it has multiple bad (or good) characteristics or because some of the match rules overlap. My usual habit has been to add new rules to the end of my rule lists (or the relevant section of them), so the oldest rules are at the top and the newest ones are at the bottom.
After writing about how access control rules need some form of usage counters, it's occurred to me that maybe I want to reverse this, at least in typical systems where the first matching rule wins. The basic idea is that the rules I'm most likely to want to drop are the oldest rules, but by having them first I'm hindering my ability to see if they've been made obsolete by newer rules. If an old rule matches some bad traffic, a new rule matches all of the bad traffic, and the new rule is last, any usage counters will show a mix of the old rule and the new rule, making it look like the old rule is still necessary. If the order was reversed, the new rule would completely occlude the old rule and usage counters would show me that I could weed the old rule out.
(My view is that it's much less likely that I'll add a new rule at the bottom that's completely ineffectual because everything it matches is already matched by something earlier. If I'm adding a new rule, it's almost certainly because something isn't being handled by the collection of existing rules.)
Another possible advantage to this is that it will keep new rules at the top of my attention, because when I look at the rule list (or the section of it) I'll probably start at the top. Currently, the top is full of old rules that I usually ignore, but if I put new rules first I'll naturally see them right away.
(I think that most things I deal with are 'first match wins' systems. A 'last match wins' system would naturally work right here, but it has other confusing aspects. I also have the impression that adding new rules at the end is a common thing, but maybe it's just in the cultural water here.)
Access control rules need some form of usage counters
Today, for reasons outside the scope of this entry, I decided to spend some time maintaining and pruning the access control rules for Wandering Thoughts, this blog. Due to the ongoing crawler plague (and past abuses), Wandering Thoughts has had to build up quite a collection of access control rules, which are mostly implemented as a bunch of things in an Apache .htaccess file (partly 'Deny from ...' for IP address ranges and partly as rewrite rules based on other characteristics). The experience has left me with a renewed view of something, which is that systems with access control rules need some way of letting you see which rules are still being used by your traffic.
It's in the nature of systems with access control rules to accumulate more and more rules over time. You hit another special situation, you add another rule, perhaps to match and block something or perhaps to exempt something from blocking. These rules often interact in various ways, and over time you'll almost certainly wind up with a tangled thicket of rules (because almost no one goes back to carefully check and revisit all existing rules when they add a new one or modify an existing one). The end result is a mess, and one of the ways to reduce the mess is to weed out rules that are now obsolete. One way a rule can be obsolete is that it's not used any more, and often these are the easiest rules to drop once you can recognize them.
(A rule that's still being matched by traffic may be obsolete for other reasons, and rules that aren't currently being matched may still be needed as a precaution. But it's a good starting point.)
If you have the necessary log data, you can sometimes establish if a rule was actually ever used by manually checking your logs. For example, if you have logs of rejected traffic (or logs of all traffic), you can search it for an IP address range to see if a particular IP address rule ever matched anything. But this requires tedious manual effort and that means that only determined people will go through it, especially regularly. The better way is to either have this information provided directly, such as by counters on firewall rules, or to have something in your logs that makes deriving it easy.
(An Apache example would be to augment any log line that was matched by some .htaccess rule with a name or a line number or the like. Then you could go readily through your logs to determine which lines were matched and how often.)
The next time I design an access control rule system, I'm hopefully going to remember this and put something in its logging to (optionally) explain its decisions.
(Periodically I write something that has an access control rule system of some sort. Unfortunately all of mine to date have been quiet on this, so I'm not at all without sin here.)
Our too many paths to 'quiet' Prometheus alerts
One of the things our Prometheus environment has is a notion of different sorts of alerts, and in particular of less important alerts that should go to a subset of people (ie, me). There are various reasons for this, including that the alert is in testing, or it concerns a subsystem that only I should have to care about, or that it fires too often for other people (for example, a reboot notification for a machine we routinely reboot).
For historical reasons, there are at least four different ways that this can be done in our Prometheus environment:
- a special label can be attached to the Prometheus alert rule, which is appropriate if the alert rule
itself is in testing or otherwise is low priority.
- a special label can be attached to targets in a scrape configuration, although this has some side effects that
can be less than ideal. This affects all alerts that trigger based on
metrics from, for example, the Prometheus host agent (for that host).
- our Prometheus configuration itself can apply alert relabeling
to add the special label for everything from a specific host, as
indicated by a "host" label that we add.
This is useful if we have so many exporters being scraped from a
particular host, or if I want to keep metric continuity (ie, the
metrics not changing their label set) when a host moves into
production.
- our Alertmanager configuration can specifically route certain alerts about certain machines to the 'less important alerts' destination.
The drawback of these assorted approaches is that now there are at least three places to check and possibly to update when a host moves from being a testing host into being a production host. A further drawback is some of these (the first two) are used a lot more often than others of these (the last two). When you have multiple things, some of which are infrequently used, and fallible humans have to remember to check them all, you can guess what can happen next.
And that is the simple version of why alerts about one of our fileservers wouldn't have gone to everyone here for about the past year.
How I discovered the problem was that I got an alert about one of the fileserver's Prometheus exporters restarting, and decided that I should update the alert configuration to make it so that alerts about this service restarting only went to me. As I was in the process of doing this, I realized that the alert already had only gone to me, despite there being no explicit configuration in the alert rule or the scrape configuration. This set me on an expedition into the depths of everything else, where I turned up an obsolete bit in our general Prometheus configuration.
On the positive side, now I've audited our Prometheus and Alertmanager configurations for any other things that shouldn't be there. On the negative side, I'm now not completely sure that there isn't a fifth place that's downgrading (some) alerts about (some) hosts.
The Bash Readline bindings and settings that I want
Normally I use Bash (and Readline in general) in my own environment, where I have a standard .inputrc set up to configure things to my liking (although it turns out that one particular setting doesn't work now (and may never have), and I didn't notice). However, sometimes I wind up using Bash in foreign environments, for example if I'm su'd to root at the moment, and when that happens the differences can be things that I get annoyed by. I spent a bit of today running into this again and being irritated enough that this time I figured out how to fix it on the fly.
The general Bash command to do readline things is 'bind',
and I believe it accepts all of the same syntax as readline init
files do,
both for keybindings and for turning off (mis-)features like
bracketed paste (which we
dislike enough that turning it off for root is a standard feature
of our install framework). This makes it convenient if I forget the
exact syntax, because I can just look at my standard .inputrc and
copy lines from it.
What I want to do is the following:
- Switch Readline to the Unix word erase behavior I want:
set bind-tty-special-chars off
Control-w: backward-kill-wordBoth of these are necessary because without the first, Bash will automatically bind Ctrl-w (my normal word-erase character) to 'unix-word-rubout' and not let you override that with your own binding.
(This is the difference that I run into all the time, because I'm very used to be able to use Ctrl-W to delete only the most recent component of a path. I think this partly comes from habit and partly because you tab-complete multi-component paths a component at a time, so if I mis-completed the latest component I want to Ctrl-W just it. M-Del is a standard Readline binding for this, but it's less convenient to type and not something I remember.)
- Make readline completion treat symbolic links to directories as if they
were directories:
set mark-symlinked-directories onWhen completing paths and so on, I mostly don't bother thinking about the difference between an actual directory (such as /usr/bin) and a symbolic link to a directory (such as /bin on modern Linuxes). If I type '/bi<TAB>' I want this to complete to '/bin/', not '/bin', because it's basically guaranteed that I will go on to tab-complete something in '/bin/'. If I actually want the symbolic link, I'll delete the trailing '/' (which does happen every so often, but much less frequently than I want to tab-complete through the symbolic link).
- Make readline forget any random edits I did to past history lines when I
hit Return to finally do something:
set revert-all-at-newline onThe behavior I want from readline is that past history is effectively immutable. If I edit some bit of it and then abandon the edit by moving to another command in the history (or just start a command from scratch), the edited command should revert to being what I actually typed back when I executed it no later than when I hit Return on the current command and start a new one. It infuriates me when I cursor-up (on a fresh command) and don't see exactly the past commands that I typed.
(My notes say I got this from Things You Didn't Know About GNU Readline.)
This is more or less in the order I'm likely to fix them. The different (and to me wrong) behavior of C-w is a relatively constant irritation, while the other two are less frequent.
(If this irritates me enough on a particular system, I can probably
do something in root's .bashrc, if only to add an alias to use
'bind -f ...' on a prepared file. I can't set these in /root/.inputrc,
because my co-workers don't particularly agree with my tastes on
these and would probably be put out if standard readline behavior
they're used to suddenly changed on them.)
(In other Readline things I want to remember, there's Readline's support for fishing out last or first or Nth arguments from earlier commands.)
Giving up on Android devices using IPv6 on our general-access networks
We have a couple of general purpose, general access networks that anyone can use to connect their devices to; one is a wired network (locally, it's called our 'RED' network after the colour of the network cables used for it), and the other is a departmental wireless network that's distinct from the centrally run university-wide network. However, both of these networks have a requirement that we need to be able to more or less identify who is responsible for a machine on them. Currently, this is done through (IPv4) DHCP and registering the Ethernet address of your device. This is a problem for any IPv6 deployment, because the Android developers refuse to support DHCPv6.
We're starting to look more seriously at IPv6, including sort of planning out how our IPv6 subnets will probably work, so I came back to thinking about this issue recently. My conclusion and decision was to give up on letting Android devices use IPv6 on our networks. We can't use SLAAC (StateLess Address AutoConfiguration) because that doesn't require any sort of registration, and while Android devices apparently can use IPv6 Prefix Delegation, that would consume /64s at a prodigious rate using reasonable assumptions. We'd also have to build a system to do it. So there's no straightforward answer, and while I can think of potential hacks, I've decided that none of them are particular good options compared to the simple choice to not support IPv6 for Android by way of only supporting DHCPv6.
(Our requirement for registering a fixed Ethernet address also means that any device that randomizes its wireless Ethernet address on every connection has to turn that off. Hopefully all such devices actually have such an option.)
I'm only a bit sad about this, because you can only hope that a rock rolls uphill for so long before you give up. IPv6 is still not a critical thing in my corner of the world (as shown by how no one is complaining to us about the lack of it), so some phones continuing to not have IPv6 is not likely to be a big deal to people here.
(Android devices that can be connected to wired networking will be able to get IPv6 on some research group networks. Some research groups ask for their network to be open and not require pre-registration of devices (which is okay if it only exists in access-controlled space), and for IPv6 I expect we'll do this by turning on SLAAC on the research group's network and calling it a day.)
An interesting thing about people showing up to probe new DNS resolvers
Over on the Fediverse, I said something:
It appears to have taken only a few hours (or at most a few hours) from putting a new resolving DNS server into production to seeing outside parties specifically probing it to see if it's an open resolver.
I assume people are snooping activity on authoritative DNS servers and going from there, instead of spraying targeted queries at random IPs, but maybe they are mass scanning.
There turns out to be some interesting aspects to these probes. This new DNS server has two network interfaces, both firewalled off from outside queries, but only one is used as the source IP on queries to authoritative DNS servers. In addition, we have other machines on both networks, with firewalls, so I can get a sense of the ambient DNS probes.
Out of all of these various IPs, the IP that the new DNS server used for querying authoritative DNS servers, and only that IP, very soon saw queries that were specifically tuned for it:
124.126.74.2.54035 > 128.100.X.Y.53: 16797 NS? . (19) 124.126.74.2.7747 > 128.100.X.Y.7: UDP, length 512 124.126.74.2.54035 > 128.100.X.Y.53: 17690 PTR? Y.X.100.128.in-addr.arpa. (47)
This was a consistent pattern from multiple IPs; they all tried to query for the root zone, tried to check the UDP echo port, and then tried a PTR query for the machine's IP itself. Nothing else saw this pattern; not the machine's other IP on a different network, not another IP on the same network, and so on. This pattern and the lack of this pattern to other IPs is what's led me to assume that people are somehow identifying probe targets based on what source IPs they seem making upstream queries.
(There are a variety of ways that you could do this without having special access to DNS servers. APNIC has long used web ad networks and special captive domains and DNS servers for them to do various sorts of measurements, and you could do similar things to discover who was querying your captive DNS servers.)
How you want to have the Unbound DNS server listen on all interfaces
Suppose, not hypothetically, that you have an Unbound server with multiple network interfaces, at least two (which I will call A and B), and you'd like Unbound to listen on all of the interfaces. Perhaps these are physical interfaces and there are client machines on both, or perhaps they're virtual interfaces and you have virtual machines on them. Let's further assume that these are routed networks, so that in theory people on A can talk to IP addresses on B and vice versa.
The obvious and straightforward way to have Unbound listen on all of your interfaces is with a server stanza like this:
server: interface: 0.0.0.0 interface: ::0 # ... probably some access-control statements
This approach works 99% of the time, which is probably why it appears all over the documentation. The other 1% of the time is when a DNS client on network A makes a DNS request to Unbound's IP address on network B; when this happens, the network A client will not get any replies. Well, it won't get any replies that it accepts. If you use tcpdump to examine network traffic, you will discover that Unbound is sending replies to the client on network A using its network A IP address as the source address (which is the default behavior if you send packets to a network you're directly attached to; you normally want to use your IP on that network as the source IP). This will fail with almost all DNS client libraries because DNS clients reject replies from unexpected sources, which is to say any IP other than the IP they sent their query to.
(One way this might happen is if the client moves from network B to network A without updating its DNS configuration. Or you might be testing to see if Unbound's network B IP address answers DNS requests.)
The other way to listen on all interfaces in modern Unbound is to
use 'interface-automatic: yes'
(in server options),
like this:
server: interface-automatic: yes
The important bit of what interface-automatic does for you is mentioned in passing in its documentation, and I've emphasized it here:
Listen on all addresses on all (current and future) interfaces, detect the source interface on UDP queries and copy them to replies.
As far as I know, you can't get this 'detect the source interface'
behavior for UDP queries in any other way if you use 'interface:
0.0.0.0' to listen on everything. You get it if you listen on
specific interfaces, perhaps with 'ip-transparent: yes'
for safety:
server: interface: 127.0.0.1 interface: ::1 interface: <network A>.<my-A-IP> interface: <network B>.<my-B-IP> # insure we always start ip-transparent: yes
Since 'interface-automatic' is marked as an experimental option I'd love to be wrong, but I can't spot an option in skimming the documentation and searching on some likely terms.
(I'm a bit surprised that Unbound doesn't always copy the IP address it received UDP packets on and use that for replies, because I don't think things work if you have the wrong IP there. But this is probably an unusual situation and so it gets papered over, although now I'm curious how this interacts with default routes.)
Servers will apparently run for a while even when quite hot
This past Saturday (yesterday as I write this), a university machine room had an AC failure of some kind:
It's always fun times to see a machine room temperature of 54C and slowly climbing. It's not our machine room but we have switches there, and I have a suspicion that some of them will be ex-switches by the time this is over.
This machine room and its AC has what you could call a history; in 2011 it flooded partly due to an AC failure, then in 2016 it had another AC issue, and another in 2024 (and those are just the ones I remember and can find entries for).
Most of this machine room is a bunch of servers from another department, and my assumption is that they are what created all of the heat when the AC failed. Both we and the other department have switches in the room, but networking equipment is usually relatively low-heat compared to active servers. So I found it interesting that the temperature graph rises in a smooth arc to its maximum temperature (and then drops abruptly, presumably as the AC starts to get fixed). To me this suggests that many of the servers in the room kept running, despite the ambient temperature hitting 54C (and their internal temperatures undoubtedly being much higher). If some servers powered off from the heat, it wasn't enough to stabilized the heat level of the room; it was still increasing right up to when it started dropping rapidly.
(Servers may well have started thermally throttling various things, and it's possible that some of them crashed without powering off and thus potentially without reducing the heat load. I have second hand information that some UPS units reported battery overheating.)
It's one thing to be fairly confident that server thermal limits are set unrealistically high. It's another thing to see servers (probably) keep operating at 54C, rather than fall over with various sorts of failures. For example, I wouldn't have been surprised if power supplies overheated and shut down (or died entirely).
(I think desktop PSUs are often rated as '0C to 50C', but I suspect that neither end of that rating is actually serious, and this was over 50C anyway.)
I rather suspect that running at 50+C for a while has increased the odds of future failures and shortened the lifetime of everything in this machine room (our switches included). But it still amazes me a bit that things didn't fall over and fail, even above 50C.
(When I started writing this entry I thought I could make some fairly confident predictions about the servers keeping running purely from the temperature graph. But the more I think about it, the less I'm sure of that. There are a lot of things that could be going on, including server failures that leave them hung or locked up but still with PSUs running and pumping out heat.)
My policy of semi-transience and why I have to do it
Some time back I read Simon Tatham's Policy of transience (via) and recognized both points of similarity and points of drastic departure between Tatham and I. Both Tatham and I use transient shell history, transient terminal and application windows (sort of for me), and don't save our (X) session state, and in general I am a 'disposable' usage pattern person. However, I depart from Tatham in that I have a permanently running browser and I normally keep my login sessions running until I reboot my desktops. But broadly I'm a 'transient' or 'disposable' person, where I mostly don't keep inactive terminal windows or programs around in case I might want them again, or even immediately re-purpose them from one use to another.
(I do have some permanently running terminal windows, much like I have permanently present other windows on my desktop, but that's because they're 'in use', running some program. And I have one inactive terminal window but that's because exiting that shell ends my entire X session.)
The big way that I depart from Tatham is already visible in my old desktop tour, in the form of a collection of iconified browser windows (in carefully arranged spots so I can in theory keep track of them). These aren't web pages I use regularly, because I have a different collection of schemes for those. Instead they're a collection of URLs that I'm keeping around to read later or in general to do something with. This is anathema to Tatham, who keeps track of URLs to read in other ways, but I've found that it's absolutely necessary for me.
Over and over again I've discovered that if something isn't visible to me, shoved in front of my nose, it's extremely likely to drop completely out of my mind. If I file email into a 'to be dealt with' or 'to be read later' or whatever folder, or if I write down URLs to visit later and explanations of them, or any number of other things, I almost might as well throw those things away. Having a web page in an iconified Firefox window in no way guarantees that I'll ever read it, but writing its URL down in a list guarantees that I won't. So I keep an optimistic collection of iconified Firefox windows around (and every so often I look at some of them and give up on them).
It would be nice if I didn't need to do this and could de-clutter various bits of my electronic life. But by now I've made enough attempts over a long enough period of time to be confident that my mind doesn't work that way and is unlikely to ever change its ways. I need active, ongoing reminders for things to stick, and one of the best forms is to have those reminders right on my desktop.
(And because the reminders need to be active and ongoing, they also need to be non-intrusive. Mailing myself every morning with 'here are the latest N URLs you've saved to read later' wouldn't work, for example.)
PS: I also have various permanently running utility programs and their windows, so my desktop is definitely not minimalistic. A lot of this is from being a system administrator and working with a bunch of systems, where I want various sorts of convenient fast access and passive monitoring of them.
My approach to testing new versions of Exim for our mail servers
When I wrote about how Exim's ${run ...} string expansion
operator changed how it did quoting, I (sort
of) mentioned that I found this when I tested a new version of
Exim. Some people would do testing like
this in a thorough, automated manner, but I don't go that far.
Instead I have a written down test plan, with some resources set
up for it in advance. Well, it's more accurate to say that I have
test plans, because I have a separate test plan for each of our
important mail servers because they have different features and so
need different things tested.
In the beginning I simply tested all of the important features of a particular mail server by hand and from memory when I rebuilt it on a new version of Ubuntu. Eventually I got tired of having to reinvent my test process from scratch (or from vague notes) every time around (for each mail server), so I started writing it down. In the process of writing my test process down the natural set of things happened; I made it more thorough and systematic, and I set up various resources (like saved copies of the EICAR test file) to make testing more cut and paste. Having an organized, written down test plan, even as basic as ours is, has made it easier to test new builds of our Exim servers and made that testing more comprehensive.
I test most of our mail servers primarily by using swaks to send various bits of test email to them and then watching what happens (both in the swaks SMTP session and in the Exim logs). So a lot of the test plan is 'run this swaks command and ...', with various combinations of sending and receiving addresses, starting with the very most basic test of 'can it deliver from a valid dummy address to a valid dummy address'. To do some sorts of testing, such as DNS blocklist tests, I take advantage of the fact that all of the IP-based DNS blocklists we use include 127.0.0.2, so that part of the test plan is 'use swaks on the mail machine itself to connect from 127.0.0.2'.
(Some of our mail servers can apply different filtering rules to different local addresses, so I have various pre-configured test addresses set up to make it easy to test that per-address filtering is working.)
The actual test plans are mostly a long list of 'run more or less this swaks command, pointing it at your test server, to test this thing, and you should see the following result'. This is pretty close to cut and paste, which makes it relatively easy and fast for me to run through.
One qualification is that these test plans aren't attempting to be an exhaustive check of everything we do in our Exim configurations. Instead, they're mostly about making sure that the basics work, like delivering straightforward email, and that Exim can interact properly with the outside world, such as talking to ClamAV and rspamd or running external programs (which also tests that the programs themselves work on the new Ubuntu version). Testing every corner of our configurations would be exhausting and my feeling is that it would generally be pointless. Exim is stable software and mostly doesn't change or break things from version to version.
(Part of this is pragmatic experience with Exim and knowledge of what our configuration does conditionally and what it checks all of the time. If Exim does a check all of the time and basic mail delivery works, we know we haven't run into, say, an issue with tainted data.)
Some practical challenges of access management in 'IAM' systems
Suppose that you have a shiny new IAM system, and you take the 'access management' part of it seriously. Global access management is (or should be) simple; if you disable or suspect someone in your IAM system, they should wind up disabled everywhere. Well, they will wind up unable to authenticate. If they have existing credentials that are used without checking with your IAM system (including things like 'an existing SSH login'), you'll need some system to propagate the information that someone has been disabled in your IAM to consumers and arrange that existing sessions, credentials, and so on get shut down and revoked.
(This system will involve both IAM software features and features in the software that uses the IAM to determine identity.)
However, this only covers global access management. You probably have some things that only certain people should have access to, or that treat certain people differently. This is where our experiences with a non-IAM environment suggest to me that things start getting complex. For pure access, the simplest thing probably is if every separate client system or application has a separate ID and directly talks to the IAM, and the IAM can tell it 'this person cannot authenticate (to you)' or 'this person is disabled (for you)'. This starts to go wrong if you ever put two or more services or applications behind the same IAM client ID, for example if you set up a web server for one application (with an ID) and then host another application on the same web server because of convenience (your web server is already there and already set up to talk to the IAM and so on).
This gets worse if there is a layer of indirection involved, so that systems and application don't talk directly to your IAM but instead talk to, say, a LDAP server or a Radius server or whatever that's fed from your IAM (or is the party that talks to your IAM). I suspect that this is one reason why IAM software has a tendency to directly support a lot of protocols for identity and authentication.
(One thing that's sort of an extra layer of indirection is what people are trying to do, since they may have access permission for some things but not others.)
Another approach is for your IAM to only manage what 'groups' people are in and provide that information to clients, leaving it up to clients to make access decisions based on group membership. On the one hand, this is somewhat more straightforward; on the other hand, your IAM system is no longer directly managing access. It has to count on clients doing the right thing with the group information it hands them. At a minimum this gives you much less central visibility into what your access management rules are.
People not infrequently want complicated access control conditions for individual applications (including things like privilege levels). In any sort of access management system, you need to be able to express these conditions in rules. There's no uniform approach or language for expressing access control conditions, so your IAM will use one, your Unix systems will use one (or more) that you probably get to craft by hand using PAM tricks, your web applications will use one or more depending on what they're written in, and so on and so forth. One of the reasons that these languages differ is that the capabilities and concepts of each system will differ; a mesh VPN has different access control concerns than a web application. Of course these differences make it challenging to handle all of their access management in one single spot in an IAM system, leaving you with the choice of either not being able to do everything you want to but having it all in the IAM or having partially distributed access management.
A change in how Exim's ${run ...} string expansion operator does quoting
The Exim mail server has, among other features,
a string expansion language
with quite a number of expansion operators.
One of those expansion operators is '${run}',
which 'expands' by running a command and substituting in its output.
As is commonly the case, ${run} is given the command to run and
all of its command line arguments as a single string, without any
explicit splitting into separate arguments:
${run {/some/command -a -b foo -c ...} [...]}
Any time a program does this, a very important question to ask is how this string is split up into separate arguments in order to be exec()'d. In Exim's case, the traditional answer is that it was rather complicated and not well documented, in a way that required you to explicitly quote many arguments that came from variables. In my entry on this I called Exim's then current behavior dangerous and wrong but also said it was probably too late to change it. Fortunately, the Exim developers did not heed my pessimism.
In Exim 4.96, this behavior of ${run} changed. To quote from the changelog:
The ${run} expansion item now expands its command string elements after splitting. Previously it was before; the new ordering makes handling zero-length arguments simpler. The old ordering can be obtained by appending a new option "preexpand", after a comma, to the "run".
(The new way is more or less the right way to do it, although it can create problems with [[some sorts of command string expansions.)
This is an important change because this change is not backward compatible if you used deliberate quoting in your ${run} command string. For example, if you ever expanded a potentially dangerous Exim variable in a ${run} command (for example, one that might have a space in it), you previously had to wrap it in ${quote}:
${run {/some/command \
--subject ${quote:$header_subject:} ...
(As seen in my entry on our attachment type logging with Exim.)
In Exim 4.96 and later, this same ${run} string expansion will add spurious quote marks around the email message's Subject: header as your program sees it. This is because ${quote:...} will add them, since you asked it to generate a quoted version of its argument, and then ${run} won't strip them out as part of splitting the command string apart into arguments because the command string has already been split before the ${quote:} was done. What this shows is that you probably don't need explicit quoting in ${run} command strings any more, unless you're doing tricky expansions with string expressions (in which case you'll have to switch back to the old way of doing it).
To be clear, I'm all for this change. It makes straightforward and innocent use of ${run} much safer and more reliable (and it plays better with Exim's new rules about 'tainted' strings from the outside world, such as the subject header). Having to remote my use of ${quote:...} is a minor price to pay, and learning this sort of stuff in advance is why I build test servers and have test plans.
(This elaborates on a Fediverse post of mine.)
My system administrator's view of IAM so far (from the outside)
Over on the Fediverse I said something about IAM:
My IAM choices appear to be "bespoke giant monolith" or "DIY from a multitude of OSS pieces", and the natural way of life appears to be that you start with the latter because you don't think you need IAM and then you discover maybe you have to blow up the world to move to the first.
At work we are the latter: /etc/passwd to LDAP to a SAML/OIDC server depending on what generation of software and what needs. With no unified IM or AM, partly because no rules system for expressing it.
Identity and Access Management (IAM) isn't the same thing as (single sign on) authentication, although I believe it's connected to authorization if you take the 'Access' part seriously, and also a bunch of IAM systems will also do some or all of authentication too so everything is in one place. However, all of these things can be separated, and in complex environments they are (for example, the university's overall IAM environment, also).
(If you have an IAM system you're presumably going to want to feed information from it to your authentication system, so that it knows who is (still) valid to authenticate and perhaps how.)
I believe that one thing that makes IAM systems complicated is interfacing with what could be called 'legacy systems', which in this context includes garden variety Unix systems. If you take your IAM system seriously, everything that knows about 'logins' or 'users' needs to somehow be drawing data from the IAM system, and the IAM system has to know how to provide each with the information it needs. Or alternately your legacy systems need to somehow merge local identity information (Unix home directories, UIDs, GIDs, etc) with the IAM information. Since people would like their IAM system to do it all, I think this is one driver of IAM system complexity and those bespoke giant monoliths that want to own everything in your environment.
(The reason to want your IAM system to do it all is that if it doesn't, you're building a bunch of local tools and then your IAM information is fragmented. What UID is this person on your Unix systems? Only your Unix systems know, not your central IAM database. For bonus points, the person might have different UIDs on different Unix systems, depending.)
If you start out with a green field new system, you can probably build in this central IAM from the start (assuming that you can find and operate IAM software that does what you want and doesn't make you back away in terror). But my impression is that central IAM systems are quite hard to set up, so the natural alternative is that you start without an IAM system and then are possibly faced with trying to pull all of your /etc/passwd, Apache authentication data, LDAP data, and so on into a new IAM system that is somehow going to take over the world. I have no idea how you'd pull off this transition, although presumably people have.
(In our case, we started our Unix systems well before IAM systems existed. There are accounts here that have existed since the 1980s, partly because professors and retired professors tend to stick around for a long time.)
The difficulty of moving our environment to anything like an IAM system leaves me looking at the whole thing from the outside. If we had to add an 'IAM system', it would likely be because something else we wanted to do needed to be fed data from some IAM system using some IAM protocol. The IAM system would probably not become the center of identity and access management, but just another thing that we pushed information into and updated information in.
Make Your Own Backup System β Part 2: Forging the FreeBSD Backup Stronghold
New Article on BSD Cafe Journal: WordPress on FreeBSD with BastilleBSD
Realizing we needed two sorts of alerts for our temperature monitoring
We have a long standing system to monitor the temperatures of our machine rooms and alert us if there are problems. A recent discussion about the state of the temperature in one of them made me realize that we want to monitor and alert for two different problems, and because they're different we need two different sorts of alerts in our monitoring system.
The first, obvious problem is a machine room AC failure, where the AC shuts off or becomes almost completely ineffective. In our machine rooms, an AC failure causes a rapid and sustained rise in temperature to well above its normal maximum level (which is typically reached just before the AC starts its next cooling cycle). AC failures are high priority issues that we want to alert about rapidly, because we don't have much time before machines start to cook themselves (and they probably won't shut themselves down before the damage has been done).
The second problem is an AC unit that can't keep up with the room's heat load; perhaps its filters are (too) clogged, or it's not getting enough cooling from the roof chillers, or various other mysterious AC reasons. The AC hasn't failed and it is still able to cool things to some degree and keep the temperature from racing up, but over time the room's temperature steadily drifts upward. Often the AC will still be cycling on and off to some degree and we'll see the room temperature vary up and down as a result; at other things the room temperature will basically reach a level and more or less stay there, presumably with the AC running continuously.
One issue we ran into is that a fast triggering alert that was implicitly written for the AC failure case can wind up flapping up and down if insufficient AC has caused the room to slowly drift close to its triggering temperature level. As the AC works (and perhaps cycles on and off), the room temperature will shift above and then back below the trigger level, and the alert flaps.
We can't detect both situations with a single alert, so we need at least two. Currently, the 'AC is not keeping up' alert looks for sustained elevated temperatures with the temperature always at or above a certain level over (much) more time than the AC should take to bring it down, even if the AC has to avoid starting for a bit of time to not cycle too fast. The 'AC may have failed' alert looks for high temperatures over a relatively short period of time, although we may want to make this an average over a short period of time.
(The advantage of an average is that if the temperature is shooting up, it may trigger faster than a 'the temperature is above X for Y minutes' alert. The drawback is that an average can flap more readily than a 'must be above X for Y time' alert.)
Checklists are hard (but still a good thing)
We recently had a big downtime at work where part of the work was me doing a relatively complex and touchy thing. Naturally I made a checklist, but also naturally my checklist turned out to be incomplete, with some things I'd forgotten and some steps that weren't quite right or complete. This is a good illustration that checklists are hard to create.
Checklists are hard partly because they require us to try to remember, reconstruct, and understand everything in what's often a relatively complex system that is too big for us to hold in our mind. If your understanding is incomplete you can overlook something and so leave out a step or a part of a step, and even if you write down a step you may not fully remember (and record) why the step has to be there. My view is that this is especially likely in system administration where we may have any number of things that have been quietly sitting in the corner for some time, working away without problems, and so they've slipped out of our minds.
(For example, one of the issues that we ran into in this downtime was not remembering all of the hosts that ran crontab jobs that used one particular filesystem. Of course we thought we did know, so we didn't try to systematically look for such crontab jobs.)
To get a really solid checklist you have to be able to test it, much like all documentation needs testing. Unfortunately, a lot of the checklists I write (or don't write) are for one-off things that we can't really test in advance for various reasons, for example because they involve a large scale change to our live systems (that requires a downtime). If you're lucky you'll realize that you don't know something or aren't confident in something while writing the checklist, so you can investigate it and hopefully get it right, but some of the time you'll be confident you understand the problem but you're wrong.
Despite any imperfections, checklists are still a good thing. An imperfect written down checklist is better than relying on your memory and mind on the fly almost all of the time (the rare exceptions are when you wouldn't even dare do the operation without a checklist but an imperfect checklist tempts you into doing it and fumbling).
(You can try to improve the situation by keeping notes on what was missed in the checklist and then saving or publishing these notes somewhere. You can review these after the fact notes on what was missed in this specific checklist if you have to do the thing again, or look for specific types of things you tend to overlook and should specifically check for the next time you're making a checklist that touches on some area.)
People still use our old-fashioned Unix login servers
Every so often I think about random things, and today's random thing was how our environment might look if it was rebuilt from scratch as a modern style greenfield development. One of the obvious assumptions is that it'd involve a lot of use of containers, which led me to wondering how you handle traditional Unix style login servers. This is a relevant issue for us because we have such traditional login servers and somewhat to our surprise, they still see plenty of use.
We have two sorts of login servers. There's effectively one general purpose login server that people aren't supposed to do heavy duty computation on (and which uses per-user CPU and RAM limits to help with that), and four 'compute' login servers where they can go wild and use up all of the CPUs and memory they can get their hands on (with no guarantees that there will be any, those machines are basically first come, first served; for guaranteed CPUs and RAM people need to use our SLURM cluster). Usage of these servers has declined over time, but they still see a reasonable amount of use, including by people who have only recently joined the department (as graduate students or otherwise).
What people log in to our compute servers to do probably hasn't changed much, at least in one sense; people probably don't log in to a compute server to read their mail with their favorite text mode mail reader (yes, we have Alpine and Mutt users). What people use the general purpose 'application' login server for likely has changed a fair bit over time. It used to be that people logged in to run editors, mail readers, and other text and terminal based programs. However, now a lot of logins seem to be done either to SSH to other machines that aren't accessible from the outside world or to run the back-ends of various development environments like VSCode. Some people still use the general purpose login server for traditional Unix login things (me included), but I think it's rarer these days.
(Another use of both sorts of servers is to run cron jobs; various people have various cron jobs on one or the other of our login servers. We have to carefully preserve them when we reinstall these machines as part of upgrading Ubuntu releases.)
PS: I believe the reason people run IDE backends on our login servers is because they have their code on our fileservers, in their (NFS-mounted) home directories. And in turn I suspect people put the code there partly because they're going to run the code on either or both of our SLURM cluster or the general compute servers. But in general we're not well informed about what people are using our login servers for due to our support model.
What OSes we use here (as of July 2025)
About five years ago I wrote an entry on what OSes we were using at the time. Five years is both a short time and a long time here, and in that time some things have changed.
Our primary OS is still Ubuntu LTS; it's our default and we use it on almost everything. On the one hand, these days 'almost everything' covers somewhat more ground than it did in 2020, as some machines have moved from OpenBSD to Ubuntu. On the other hand, as time goes by I'm less and less confident that we'll still be using Ubuntu in five years, because I expect Canonical to start making (more) unfortunate and unacceptable changes any day now. Our most likely replacement Linux is Debian.
CentOS is dead here, killed by a combination of our desire to not have two Linux variants to deal with and CentOS Stream. We got rid of the last of our CentOS machines last year. Conveniently, our previous commercial anti-spam system vendor effectively got out of the business so we didn't have to find a new Unix that they supported.
We're still using OpenBSD, but it's increasingly looking like a legacy OS that's going to be replaced by FreeBSD as we rebuild the various machines that currently run OpenBSD. Our primary interests are better firewall performance and painless mirrored root disks, but if we're going to run some FreeBSD machines and it can do everything OpenBSD can, we'd like to run fewer Unixes so we'll probably replace all of the OpenBSD machines with FreeBSD ones over time. This is a shift in progress and we'll see how far it goes, but I don't expect the number of OpenBSD machines we run to go up any more; instead it's a question of how far down the number goes.
(Our opinions about not using Linux for firewalls haven't changed. We like PF, it's just we like FreeBSD as a host for it more than OpenBSD.)
We continue to not use containers so we don't have to think about a separate, minimal Linux for container images.
There are a lot of research groups here and they run a lot of machines, so research group machines are most likely running a wide assortment of Linuxes and Unixes. We know that Ubuntu (both LTS and non-LTS) is reasonably popular among research groups, but I'm sure there are people with other distributions and probably some use of FreeBSD, OpenBSD, and so on. I believe there may be a few people still using Solaris machines.
(My office desktop continues to run Fedora, but I wouldn't run it on any production server due to the frequent distribution version updates. We don't want to be upgrading distribution versions every six months.)
Overall I'd say we've become a bit more of an Ubuntu LTS monoculture than we were before, but it's not a big change, partly because we were already mostly Ubuntu. Given our views on things like firewalls, we're probably never going to be all-Ubuntu or all-Linux.
The easiest way to interact with programs is to run them in terminals
I recently wrote about a new little script of mine, which I use to start programs in terminals in a way that I can interact with them (to simplify it). Much of what I start with this tool doesn't need to run in a terminal window at all; the actual program will talk directly to the X server or arrange to talk to my Firefox or the like. I could in theory start them directly from my X session startup script, as I do with other things.
The reason I haven't put these things in my X session startup is that running things in shell sessions in terminal windows is the easiest way to interact with them in all sorts of ways. It's trivial to stop the program or restart it, to look at its output, to rerun it with slightly different arguments if I need to, it automatically inherits various aspects of my current X environment, and so on. You can do all of these things with programs in ways other than using shell sessions in terminals, but it's generally going to be more awkward.
(For instance, on systemd based Linuxes, I could make some of these programs into systemd user services, but I'd still have to use systemd commands to manipulate them. If I run them as standalone programs started from my X session script, it's even more work to stop them, start them again, and so on.)
For well established programs that I expect to never restart or want to look at output from, I'll run them from my X session startup script. But for new programs, like these, they get to spend a while in terminal windows because that's the easiest way. And some will be permanent terminal window occupants because they sometimes produce (text) output.
On the one hand, using terminal windows for this is simple and effective, and I could probably make it better by using a multi-tabbed terminal program, with one tab for each program (or the equivalent in a regular terminal program with screen or tmux). On the other hand, it feels a bit sad that in 2025, our best approach for flexible interaction with a program and monitoring its output is 'put it in a terminal'.
(It's also irritating that with some programs, the easiest and best way to make sure that they really exit when you want them to shut down, rather than "helpfully" lingering on in various ways, is to run them from a terminal and then Ctrl-C them when you're done with them. I have to use a certain video conferencing application that is quite eager to stay running if you tell it to 'quit', and this is my solution to it. Someday I may have to figure out how to put it in a systemd user unit so that it can't stage some sort of great escape into the background.)
On sysadmins (not) changing (OpenSSL) cipher suite strings
Recently I read Apps shouldnβt let users enter OpenSSL cipher-suite strings by Frank Denis (via), which advocates for providing at most a high level interface to people that lets them express intentions like 'forward secrecy is required' or 'I have to comply with FIPS 140-3'. As a system administrator, I've certainly been guilty of not keeping OpenSSL cipher suite strings up to date, so I have a good deal of sympathies for the general view of trusting the clients and the libraries (and also possibly the servers). But at the same time, I think that this approach has some issues. In particular, if you're only going to set generic intents, you have to trust that the programs and libraries have good defaults. Unfortunately, historically time when system administrators have most reached for setting specific OpenSSL cipher suite strings was when something came up all of a sudden and they didn't trust the library or program defaults to be up to date.
The obvious conclusion is that an application or library that wants people to only set high level options needs to commit to agility and fast updates so that it always has good defaults. This needs more than just the upstream developers making prompt updates when issues come up, because in practice a lot of people will get the program or library through their distribution or other packaging mechanism. A library that really wants people to trust it here needs to work with distributions to make sure that this sort of update can rapidly flow through, even for older distribution versions with older versions of the library and so on.
(For obvious reasons, people are generally pretty reluctant to touch TLS libraries and would like to do it as little as possible, leaving it to specialists and even then as much as possible to the upstream. Bad things can and have happened here.)
If I was doing this for a library, I would be tempted to give the library two sets of configuration files. One set, the official public set, would be the high level configuration that system administrators were supposed to use to express high level intents, as covered by Frank Denis. The other set would be internal configuration that expressed all of those low level details about cipher suite preferences, what cipher suites to use when, and so on, and was for use by the library developers and people packaging and distributing the library. The goal is to make it so that emergency cipher changes can be shipped as relatively low risk and easily backported internal configuration file changes, rather than higher risk (and thus slower to update) code changes. In an environment with reproducible binary builds, it'd be ideal if you could rebuild the library package with only the configuration files changed and get library shared objects and so on that were binary identical to the previous versions, so distributions could have quite high confidence in newly-built updates.
(System administrators who opted to edit these second set of files themselves would be on their own. In packaging systems like RPM and Debian .debs, I wouldn't even have these files marked as 'configuration files'.)
A new little shell script to improve my desktop environment
Recently on the Fediverse I posted a puzzle about a little shell script:
A silly little Unix shell thing that I've vaguely wanted for ages but only put together today. See if you can guess what it's for:
#!/bin/sh
trap 'exec $SHELL' 2
"$@"
exec $SHELL(The use of this is pretty obscure and is due to my eccentric X environment.)
The actual version I now use wound up slightly more complicated,
and I call it 'thenshell'. What it does (as suggested by the name)
is to run something and then after the thing either exits or is
Ctrl-C'd, it runs a shell. This is pointless in normal circumstances
but becomes very relevant if you use this as the command for a
terminal window to run instead of your shell, as in 'xterm -e
thenshell <something>'.
Over time, I've accumulated a number of things I want to run in my eccentric desktop environment, such as my system for opening URLs from remote machines and my alert monitoring. But some of the time I want to stop and restart these (or I need to restart them), and in general I want to notice if they produce some output, so I've been running them in terminal windows. Up until now I've had to manually start a terminal and run these programs each time I restart my desktop environment, which is annoying and sometimes I forget to do it for something. My new 'thenshell' shell script handles this; it runs whatever and then if it's interrupted or exits, starts a shell so I can see things, restart the program, or whatever.
Thenshell isn't quite a perfect duplicate of the manual version. One obvious limitation is that it doesn't put the command into the shell's command history, so I can't just cursor-up and hit return to restart it. But this is a small thing compared to having all of these things automatically started for me.
(Actually, I think I might be able to get this into a version of
thenshell that knows exactly how my shell
and my environment handle history, but it would be more than a bit
of a hack. I may still try it, partly because it would be nifty.)
My pragmatic view on virtual screens versus window groups
I recently read z3bra's 2014 Avoid workspaces (via) which starts out with the tag "Virtual desktops considered harmful". At one level I don't disagree with z3bra's conclusion that you probably want flexible groupings of windows, and I also (mostly) don't use single-purpose virtual screens. But I do it another way, which I think is easier than z3bra's (2014) approach.
I've written about how I use virtual screens in my desktop environment, although a bit of that is now out of date. The short summary is that I mostly have a main virtual screen and then 'overflow' virtual screens where I move to if I need to do something else without cleaning up the main virtual screen (as a system administrator, I can be quite interrupt-driven or working on more than one thing at once). This sounds a lot like window groups, and I'm sure I could do it with them in another window manager. The advantage to me of fvwm's virtual screens is that it's very easy to move windows from one to another.
If I start a window in one virtual screen, for what I think is going to be one purpose, and it turns out that I need it for another purpose too, on another virtual screen, I don't have to fiddle around with, say, adding or changing its tags. Instead I can simply grab it and move it to the new virtual screen (or, for terminal windows and some others, iconify them on one screen, switch screens, and deiconify them). This makes it fast, fluid, and convenient to shuffle things around, especially for windows where I can do this by iconifying and deiconify them.
This is somewhat specific to (fvwm's idea of) virtual screens, where the screens have a spatial relationship to each other and you can grab windows and move them around to change their virtual screen (either directly or through FvwmPager). In particular, I don't have to switch between virtual screens to drag a window on to my current one; I can grab it in a couple of ways and yank it to where I am now.
In other words, it's the direct manipulation of window grouping that makes this work so nicely. Unfortunately I'm not sure how to get direct manipulation of currently not visible windows without something like virtual screens or virtual desktops. You could have a 'show all windows' feature, but that still requires bouncing between that all-windows view (to tag in new windows) and your regular view. Maybe that would work fluidly enough, especially with today's fast graphics.
Potential issues in running your own identity provider
Over on the Fediverse, Simon Tatham had a comment about (using) cloud identity providers that's sparked some discussion. Yesterday I wrote about the facets of identity providers. Today I'm sort of writing about why you might not want to run your own identity provider, despite the hazards of depending on the security of some outside third party. I'll do this by talking about what I see as being involved in the whole thing.
The hardcore option is to rely on no outside services at all, not even for multi-factor authentication. This pretty much reduces your choices for MFA down to TOTP and perhaps WebAuthn, either with devices or with hardware keys. And of course you're going to have to manage all aspects of your MFA yourself. I'm not sure if there's capable open source software here that will let people enroll multiple second factors, handle invalidating one, and so on.
One facet of being an identity provider is managing identities. There's a wide variety of ways to do this; there's Unix accounts, LDAP databases, and so on. But you need a central system for it, one that's flexible enough to cope with with real world, and that system is load bearing and security sensitive. You will need to keep it secure and you'll want to keep logs and audit records, and also backups so you can restore things if it explodes (or go all the way to redundant systems for this). If the identity service holds what's considered 'personal information' in various jurisdictions, you'll need to worry about an attacker being able to bulk-extract that information, and you'll need to build enough audit trails so you can tell to what extent that happened. Your identity system will need to be connected to other systems in your organization so it knows when people appear and disappear and can react appropriately; this can be complex and may require downstream integrations with other systems (either yours or third parties) to push updates to them.
Obviously you have to handle primary authentication yourself (usually through passwords). This requires you to build and operate a secure password store as well as a way of using it for authentication, either through existing technology like LDAP or something else (this may or may not be integrated with your identity service software, as passwords are often considered part of the identity). Like the identity service but more so, this system will need logs and audit trails so you can find out when and how people authenticated to it. The log and audit information emitted by open source software may not always meet your needs, in which case you may wind up doing some hacks. Depending on how exposed this primary authentication service is, it may need its own ratelimiting and alerting on signs of potential compromised accounts or (brute force) attacks. You will also definitely want to consider reacting in some way to accounts that pass primary authentication but then fail second-factor authentication.
Finally, you will need to operate the 'identity provider' portion of things, which will probably do either or both of OIDC and SAML (but maybe you (also) need Kerberos, or Active Directory, or other things). You will have to obtain the software for this, keep it up to date, worry about its security and the security of the system or systems it runs on, make sure it has logs and audit trails that you capture, and ideally make sure it has ratelimits and other things that monitor for and react to signs of attacks, because it's likely to be a fairly exposed system.
If you're a sufficiently big organization, some or all of these services probably need to be redundant, running on multiple servers (perhaps in multiple locations) so the failure of a single server doesn't lock you out of everything. In general, all of these expose you to all of the complexities of running your own servers and services, and each and all of them are load bearing and highly security sensitive, which probably means that you should be actively paying attention to them more or less all of the time.
If you're lucky you can find suitable all-in-one software that will handle all the facets you need (identity, primary authentication, OIDC/SAML/etc IdP, and perhaps MFA authentication) in a way that works for you and your organization. If not, you're going to have to integrate various different pieces of software, possibly leaving you with quite a custom tangle (this is our situation). The all in one software generally seems to have a reputation of being pretty complex to set up and operate, which is not surprising given how much ground it needs to cover (and how many protocols it may need to support to interoperate with other systems that want to either push data to it or pull data and authentication from it). As an all-consuming owner of identity and authentication, my impression is that such software is also something that's hard to add to an existing environment after the fact and hard to swap out for anything else.
(So when you pick an all in one open source software for this, you really have to hope that it stays good, reliable software for many years to come. This may mean you need to build up a lot of expertise before you commit so that you really understand your choices, and perhaps even do pilot projects to 'kick the tires' on candidate software. The modular DIY approach is more work but it's potentially easier to swap out the pieces as you learn more and your needs change.)
The obvious advantage of a good cloud identity provider is that they've already built all of these systems and they have the expertise and infrastructure to operate them well. Much like other cloud services, you can treat them as a (reliable) black box that just works. Because the cloud identity provider works at a much bigger scale than you do, they can also afford to invest a lot more into security and monitoring, and they have a lot more visibility into how attackers work and so on. In many organizations, especially smaller ones, looking after your own identity provider is a part time job for a small handful of technical people. In a cloud identity provider, it is the full time job of a bunch of developers, operations, and security specialists.
(This is much like the situation with email (also). The scale at which cloud providers operates dwarfs what you can manage. However, your identity provider is probably more security sensitive and the quality difference between doing it yourself and using a cloud identity provider may not be as large as it is with email.)
Thinking about facets of (cloud) identity providers
Over on the Fediverse, Simon Tatham had a comment about cloud identity providers, and this sparked some thoughts of my own. One of my thoughts is that in today's world, a sufficiently large organization may have a number of facets to its identity provider situation (which is certainly the case for my institution). Breaking up identity provision into multiple facets can leave it not clear if and to what extend you could be said to be using a 'cloud identity provider'.
First off, you may outsource 'multi-factor authentication', which is to say your additional factor, to a specialist SaaS provider who can handle the complexities of modern MFA options, such as phone apps for push-based authentication approval. This SaaS provider can turn off your ability to authenticate, but they probably can't authenticate as a person all by themselves because you 'own' the first factor authentication. Well, unless you have situations where people only authenticate via their additional factor and so your password or other first factor authentication is bypassed.
Next is the potential distinction between an identity provider and an authentication source. The identity provider implements things like OIDC and SAML, and you may have to use a big one in order to get MFA support for things like IMAP. However, the identity provider can delegate authenticating people to something else you run using some technology (which might be OIDC or SAML but also could be something else). In some cases this delegation can be quite visible to people authenticating; they will show up to the cloud identity provider, enter their email address, and wind up on your web-based single sign on system. You can even have multiple identity providers all working from the same authentication source. The obvious exposure here is that a compromised identity provider can manufacture attested identities that never passed through your authentication source.
Along with authentication, someone needs to be (or at least should be) the 'system of record' as to what people actually exist within your organization, what relevant information you know about them, and so on. Your outsourced MFA SaaS and your (cloud) identity providers will probably have their own copies of this data where you push updates to them. Depending on how systems consume the IdP information and what other data sources they check (eg, if they check back in with your system of record), a compromised identity provider could invent new people in your organization out of thin air, or alter the attributes of existing people.
(Small IdP systems often delegate both password validation and knowing who exists and what attributes they have to other systems, like LDAP servers. One practical difference is whether the identity provider system asks you for the password or whether it sends you to something else for that.)
If you have no in-house authentication or 'who exists' identity system and you've offloaded all of these to some external provider (or several external providers that you keep in sync somehow), you're clearly at the mercy of that cloud identity provider. Otherwise, it's less clear and a lot more situational as to when you could be said to be using a cloud identity provider and thus how exposed you are. I think one useful line to look at is to ask whether a particular identity provider is used by third party services or if it's only used to for that provider's own services. Or to put it in concrete terms, as an example, do you use Github identities only as part of using Github, or do you authenticate other things through your Github identities?
(With that said, the blast radius of just a Github (identity) compromise might be substantial, or similarly for Google, Microsoft, or whatever large provider of lots of different services that you use.)
I have divided (and partly uninformed) views on OpenTelemetry
OpenTelemetry ('OTel') is one of the current in things in the broad metrics and monitoring space. As I understand it, it's fundamentally a set of standards (ie, specifications) for how things can emit metrics, logs, and traces; the intended purpose is (presumably) so that people writing programs can stop having to decide if they expose Prometheus format metrics, or Influx format metrics, or statsd format metrics, or so on. They expose one standard format, OpenTelemetry, and then everything (theoretically) can consume it. All of this has come on to my radar because Prometheus can increasingly ingest OpenTelemetry format metrics and we make significant use of Prometheus.
If OpenTelemetry is just another metrics format that things will produce and Prometheus will consume just as it consumes Prometheus format metrics today, that seems perfectly okay. I'm pretty indifferent to the metrics formats involved, presuming that they're straightforward to generate and I never have to drop everything and convert all of our things that generate (Prometheus format) metrics to generating OpenTelemetry metrics. This would be especially hard because OpenTelemtry seems to require either Protobuf or (complex) JSON, while the Prometheus metrics format is simple text.
However, this is where I start getting twitchy. OpenTelemetry certainly gives off the air of being a complex ecosystem, and on top of that it also seems to be an application focused ecosystem, not a system focused one. I don't think that metrics are as highly regarded in application focused ecosystems as logs and traces are, while we care a lot about metrics and not very much about the others, at least in an OpenTelemtry context. To the extent that OpenTelemtry diverts people away from producing simple, easy to use and consume metrics, I'm going to wind up being unhappy with it. If what 'OpenTelemtry support' turns out to mean in practice is that more and more things have minimal metrics but lots of logs and traces, that will be a loss for us.
Or to put it another way, I worry that an application focused OpenTelemetry will pull the air away from the metrics focused things that I care about. I don't know how realistic this worry is. Hopefully it's not.
(Partly I'm underinformed about OpenTelemetry because, as mentioned I often feel disconnected from the mainstream of 'observability', so I don't particularly try to keep up with it.)
Things are different between system and application monitoring
We mostly run systems, not applications, due to our generally different system administration environment. Many organizations instead run applications. Although these applications may be hosted on some number of systems, the organizations don't care about the systems, not really; they care about how the applications work (and the systems only potentially matter if the applications have problems). It's my increasing feeling that this has created differences in the general field of monitoring such systems (as well as alerting), which is a potential issue for us because most of the attention is focused on the application area of things.
When you run your own applications, you get to give them all of the 'three pillars of observability' (metrics, traces, and logs, see here for example). In fact, emitting logs is sort of the default state of affairs for applications, and you may have to go out of your way to add metrics (my understanding is that traces can be easier). Some people even process logs to generate metrics, something that's supported by various log ingestion pipelines these days. And generally you can send your monitoring output to wherever you want, in whatever format you want, and often you can do things like structuring them.
When what you run is systems, life is a lot different. Your typical Unix system will most easily provide low level metrics about things. To the extent that the kernel and standard applications emit logs, these logs come in a variety of formats that are generally beyond your control and are generally emitted to only a few places, and the overall logs of what's happening on the system are often extremely incomplete (partly because 'what's happening on the system' is a very high volume thing). You can basically forget about having traces. In the modern Linux world of eBPF it's possible to do better if you try hard, but you'll probably be building custom tooling for your extra logs and traces so they'd better be sufficiently important (and you need the relevant expertise, which may include reading kernel and program source code).
The result is that for people like us who run systems, our first stop for monitoring is metrics and they're what we care most about; our overall unstructured logs are at best a secondary thing, and tracing some form of activity is likely to be something done only to troubleshoot problems. Meanwhile, my strong impression is that application people focus on logs and if they have them, traces, with metrics only a distant and much less important third (especially in the actual applications, since metrics can be produced by third party tools from their logs).
(This is part of why I'm so relatively indifferent to smart log searching systems. Our central syslog server is less about searching logs and much more about preserving them in one place for investigations.)
Our Grafana and Loki installs have quietly become 'legacy software' here
At this point we've been running Grafana for quite some time (since late 2018), and (Grafana) Loki for rather less time and on a more ad-hoc and experimental basis. However, over time both have become 'legacy software' here, by which I mean that we (I) have frozen their versions and don't update them any more, and we (I) mostly or entirely don't touch their configurations any more (including, with Grafana, building or changing dashboards).
We froze our Grafana version due to backward compatibility issues. With Loki I could say that I ran out of enthusiasm for going through updates, but part of it was that Loki explicitly deprecated 'promtail' in favour of a more complex solution ('Alloy') that seemed to mostly neglect the one promtail feature we seriously cared about, namely reading logs from the systemd/journald complex. Another factor was it became increasingly obvious that Loki was not intended for our simple setup and future versions of Loki might well work even worse in it than our current version does.
Part of Grafana and Loki going without updates and becoming 'legacy' is that any future changes in them would be big changes. If we ever have to update our Grafana version, we'll likely have to rebuild a significant number of our current dashboards, because they use panels that aren't supported any more and the replacements have a quite different look and effect, requiring substantial dashboard changes for the dashboards to stay decently usable. With Loki, if the current version stopped working I'd probably either discard the idea entirely (which would make me a bit sad, as I've done useful things through Loki) or switch to something else that had similar functionality. Trying to navigate the rapids of updating to a current Loki is probably roughly as much work (and has roughly as much chance of requiring me to restart our log collection from scratch) as moving to another project.
(People keep mentioning VictoriaLogs (and I know people have had good experiences with it), but my motivation for touching any part of our Loki environment is very low. It works, it hasn't eaten the server it's on and shows no sign of doing that any time soon, and I'm disinclined to do any more work with smart log collection until a clear need shows up. Our canonical source of history for logs continues to be our central syslog server.)
The five platforms we have to cover when planning systems
Suppose, not entirely hypothetically, that you're going to need a 'VPN' system that authenticates through OIDC. What platforms do you need this VPN system to support? In our environment, the answer is that we have five platforms that we need to care about, and they're the obvious four plus one more: Windows, macOS, iOS, Android, and Linux.
We need to cover these five platforms because people here use our services from all of those platforms. Both Windows and macOS are popular on laptops (and desktops, which still linger around), and there's enough people who use Linux to be something we need to care about. On mobile devices (phones and tablets), obviously iOS and Android are the two big options, with people using either or both. We don't usually worry about the versions of Windows and macOS and suggest that people to stick to supported ones, but that may need to change with Windows 10.
Needing to support mobile devices unquestionably narrows our options for what we can use, at least in theory, because there are certain sorts of things you can semi-reasonably do on Linux, macOS, and Windows that are infeasible to do (at least for us) on mobile devices. But we have to support access to various of our services even on iOS and Android, which constrains us to certain sorts of solutions, and ideally ones that can deal with network interruptions (which are quite common on mobile devices in Toronto, as anyone who takes our subways is familiar with).
(And obviously it's easier for open source systems to support Linux, macOS, and Windows than it is for them to extend this support to Android and especially iOS. This extends to us patching and rebuilding them for local needs; with various modern languages, we can produce Windows or macOS binaries from modified open source projects. Not so much for mobile devices.)
In an ideal world it would be easy to find out the support matrix of platforms (and features) for any given project. In this world, the information can sometimes be obscure, especially for what features are supported on what platforms. One of my resolutions to myself is that when I find interesting projects but they seem to have platform limitations, I should note down where in their documentation they discuss this, so I can find it later to see if things have changed (or to discuss with people why certain projects might be troublesome).
Two broad approaches to having Multi-Factor Authentication everywhere
In this modern age, more and more people are facing more and more pressure to have pervasive Multi-Factor Authentication, with every authentication your people perform protected by MFA in some way. I've come to feel that there are two broad approaches to achieving this and one of them is more realistic than the other, although it's also less appealing in some ways and less neat (and arguably less secure).
The 'proper' way to protect everything with MFA is to separately and individually add MFA to everything you have that does authentication. Ideally you will have a central 'single sign on' system, perhaps using OIDC, and certainly your people will want you to have only one form of MFA even if it's not all run through your SSO. What this implies is that you need to add MFA to every service and protocol you have, which ranges from generally easy (websites) through being annoying to people or requiring odd things (SSH) to almost impossible at the moment (IMAP, authenticated SMTP, and POP3). If you opt to set it up with no exemptions for internal access, this approach to MFA insures that absolutely everything is MFA protected without any holes through which an un-MFA'd authentication can be done.
The other way is to create some form of MFA-protected network access (a VPN, a mesh network, a MFA-authenticated SSH jumphost, there are many options) and then restrict all non-MFA access to coming through this MFA-protected network access. For services where it's easy enough, you might support additional MFA authenticated access from outside your special network. For other services where MFA isn't easy or isn't feasible, they're only accessible from the MFA-protected environment and a necessary step for getting access to them is to bring up your MFA-protected connection. This approach to MFA has the obvious problem that if someone gets access to your MFA-protected network, they have non-MFA access to everything else, and the not as obvious problem that attackers might be able to MFA as one person to the network access and then do non-MFA authentication as another person on your systems and services.
The proper way is quite appealing to system administrators. It gives us an array of interesting challenges to solve, neat technology to poke at, and appealingly strong security guarantees. Unfortunately the proper way has two downsides; there's essentially no chance of it covering your IMAP and authenticated SMTP services any time soon (unless you're willing to accept some significant restrictions), and it requires your people to learn and use a bewildering variety of special purpose, one-off interfaces and sometimes software (and when it needs software, there may be restrictions on what platforms the software is readily available on). Although it's less neat and less nominally secure, the practical advantage of the MFA protected network access approach is that it's universal and it's one single thing for people to deal with (and by extension, as long as the network system itself covers all platforms you care about, your services are fully accessible from all platforms).
(In practice the MFA protected network approach will probably be two things for people to deal with, not one, since if you have websites the natural way to protect them is with OIDC (or if you have to, SAML) through your single sign on system. Hopefully your SSO system is also what's being used for the MFA network access, so people only have to sign on to it once a day or whatever.)
Our need for re-provisioning support in mesh networks (and elsewhere)
In a comment on my entry on how WireGuard mesh networks need a provisioning system, vcarceler pointed me to Innernet (also), an interesting but opinionated provisioning system for WireGuard. However, two bits of it combined made me twitch a bit; Innernet only allows you to provision a given node once, and once a node is assigned an internal IP, that IP is never reused. This lack of support for re-provisioning machines would be a problem for us and we'd likely have to do something about it, one way or another. Nor is this an issue unique to Innernet, as a number of mesh network systems have it.
Our important servers have fixed, durable identities, and in practice these identities are both DNS names and IP addresses (we have some generic machines, but they aren't as important). We also regularly re-provision these servers, which is to say that we reinstall them from scratch, usually on new hardware. In the usual course of events this happens roughly every two years or every four years, depending on whether we're upgrading the machine for every Ubuntu LTS release or every other one. Over time this is a lot of re-provisionings, and we need the re-provisioned servers to keep their 'identity' when this happens.
We especially need to be able to rebuild a dead server as an identical replacement if its hardware completely breaks and eats its system disks. We're already in a crisis, we don't want to have a worse crisis because other things need to be updated because we can't exactly replace the server but instead have to build a new server that fills the same role, or will once DNS is updated, configurations are updated, etc etc.
This is relatively straightforward for regular Linux servers with regular networking; there's the issue of SSH host keys, but there's several solutions. But obviously there's a problem if the server is also a mesh network node and the mesh network system will not let it be re-provisioned under the same name or the same internal IP address. Accepting this limitation would make it difficult to use the mesh network for some things, especially things where we don't want to depend on DNS working (for example, sending system logs via syslog). Working around the limitation requires reverse engineering where the mesh network system stores local state and hopefully being able to save a copy elsewhere and restore it; among other things, this has implications for the mesh network system's security model.
For us, it would be better if mesh networking systems explicitly allowed this re-provisioning. They could make it a non-default setting that took explicit manual action on the part of the network administrator (and possibly required nodes to cooperate and extend more trust than normal to the central provisioning system). Or a system like Innernet could have a separate class of IP addresses, call them 'service addresses', that could be assigned and reassigned to nodes by administrators. A node would always have its unique identity but could also be assigned one or more service addresses.
(Of course our other option is to not use a mesh network system that imposes this restriction, even if it would otherwise make our lives easier. Unless we really need the system for some other reason or its local state management is explicitly documented, this is our more likely choice.)
PS: The other problem with permanently 'consuming' IP addresses as machines are re-provisioned is that you run out of them sooner or later unless you use gigantic network blocks that are many times larger than the number of servers you'll ever have (well, in IPv4, but we're not going to switch to IPv6 just to enable a mesh network provisioning system).
Using WireGuard seriously as a mesh network needs a provisioning system
One thing that my recent experience expanding our WireGuard mesh network has driven home to me is how (and why) WireGuard needs a provisioning system, especially if you're using it as a mesh networking system. In fact I think that if you use a mesh WireGuard setup at any real scale, you're going to wind up either adopting or building such a provisioning system.
In a 'VPN' WireGuard setup with a bunch of clients and one or a small number of gateway servers, adding a new client is mostly a matter of generating and giving it some critical information. However, it's possible to more or less automate this and make it relatively easy for people who want to connect to you to do this. You'll still need to update your WireGuard VPN server too, but at least you only have one of them (probably), and it may well be the host where you generate the client configuration and provide it to the client's owner.
The extra problem with adding a new client to a WireGuard mesh network is that there's many more WireGuard nodes that need to be updated (and also the new client needs a lot more information; it needs to know about all of the other nodes it's supposed to talk to). More broadly, every time you change the mesh network configuration, every node needs to update with the new information. If you add a client, remove a client, a client changes its keys for some reason (perhaps it had to be re-provisioned because the hardware died), all of these means nodes need updates (or at least the nodes that talk to the changed node). In the VPN model, only the VPN server node (and the new client) needed updates.
Our little WireGuard mesh is operating at a small scale, so we can afford to do this by hand. As you have more WireGuard nodes and more changes in nodes, you're not going to want to manually update things one by one, any more than you want to do that for other system administration work. Thus, you're going to want some sort of a provisioning system, where at a minimum you can say 'this is a new node' or 'this node has been removed' and all of your WireGuard configurations are regenerated, propagated to WireGuard nodes, trigger WireGuard configuration reloads, and so on. Some amount of this can be relatively generic in your configuration management system, but not all of it.
(Many configuration systems can propagate client-specific files to clients on changes and then trigger client side actions when the files are updated. But you have to build the per-client WireGuard configuration.)
PS: I haven't looked into systems that will do this for you, either as pure WireGuard provisioning systems or as bigger 'mesh networking using WireGuard' software, so I don't have any opinions on how you want to handle this. I don't even know if people have built and published things that are just WireGuard provisioning systems, or if everything out there is a 'mesh networking based on WireGuard' complex system.
Chosing between "it works for now" and "it works in the long term"
A comment on my entry about how Netplan can only have WireGuard peers in one file made me realize one of my implicit system administration views (it's the first one by Jon). That is the tradeoff between something that works now and something that not only works now but is likely to keep working in the long term. In system administration this is a tradeoff, not an obvious choice, because what you want is different depending on the circumstances.
Something that works now is, for example, something that works because of how Netplan's code is currently written, where you can hack around an issue by structuring your code, your configuration files, or your system in a particular way. As a system administrator I do a surprisingly large amount of these, for example to fix or work around issues in systemd units that people have written in less than ideal or simply mistaken ways.
Something that's going to keep working in the longer term is doing things 'correctly', which is to say in whatever way that the software wants you to do and supports. Sometimes this means doing things the hard way when the software doesn't actually implement some feature that would make your life better, even if you could work around it with something that works now but isn't necessarily guaranteed to keep working in the future.
When you need something to work and there's no other way to do it, you have to take a solution that (only) works now. Sometimes you take a 'works now' solution even if there's an alternative because you expect your works-now version to be good enough for the lifetime of this system, this OS release, or whatever; you'll revisit things for the next version (at least in theory, workarounds to get things going can last a surprisingly long time if they don't break anything). You can't always insist on a 'works now and in the future' solution.
On the other hand, sometimes you don't want to do a works-now thing even if you could. A works-now thing is in some sense technical debt, with all that that implies, and this particular situation isn't important enough to justify taking on such debt. You may solve the problem properly, or you may decide that the problem isn't big and important enough to solve at all and you'll leave things in their imperfect state. One of the things I think about when making this decision is how annoying it would be and how much would have to change if my works-now solution broke because of some update.
(Another is how ugly the works-now solution is, including how big of a note we're going to want to write for our future selves so we can understand what this peculiar load bearing thing is. The longer the note, the more I generally wind up questioning the decision.)
It can feel bad to not deal with a problem by taking a works-now solution. After all, it works, and otherwise you're stuck with the problem (or with less pleasant solutions). But sometimes it's the right option and the works-now solution is simply 'too clever'.
(I've undoubtedly made this decision many times over my career. But Jon's comment and my reply to it crystalized the distinction between a 'works now' and a 'works for the long term' solution in my mind in a way that I think I can sort of articulate.)
The complexity of mixing mesh networking and routes to subnets
One of the in things these days is encrypted (overlay) mesh networks, where you have a bunch of nodes and the nodes have encrypted connections to each other that they use for (at least) internal IP traffic. WireGuard is one of the things that can be used for this. A popular thing to add to such mesh network solutions is 'subnet routes', where nodes will act as gateways to specific subnets, not just endpoints in themselves. This way, if you have an internal network of servers at your cloud provider, you can establish a single node on your mesh network and route to the internal network through that node, rather than having to enroll every machine in the internal network.
(There are various reasons not to enroll every machine, including that on some of them it would be a security or stability risk.)
In simple configurations this is easy to reason about and easy to set up through the tools that these systems tend to give you. Unfortunately, our network configuration isn't simple. We have an environment with multiple internal networks, some of which are partially firewalled off from each other, and where people would want to enroll various internal machines in any mesh networking setup (partly so they can be reached directly). This creates problems for a simple 'every node can advertise some routes and you accept the whole bundle' model.
The first problem is what I'll call the direct subnet problem. Suppose that you have a subnet with a bunch of machines on it and two of them are nodes (call them A and B), with one of them (call it A) advertising a route to the subnet so that other machines in the mesh can reach it. The direct subnet problem is that you don't want B to ever send its traffic for the subnet to A; since it's directly connected to the subnet, it should send the traffic directly. Whether or not this happens automatically depends on various implementation choices the setup makes.
The second problem is the indirect subnet problem. Suppose that you have a collection of internal networks that can all talk to each other (perhaps through firewalls and somewhat selectively). Not all of the machines on all of the internal networks are part of the mesh, and you want people who are outside of your networks to be able to reach all of the internal machines, so you have a mesh node that advertises routes to all of your internal networks. However, if a mesh node is already inside your perimeter and can reach your internal networks, you don't want it to go through your mesh gateway; you want it to send its traffic directly.
(You especially want this if mesh nodes have different mesh IPs from their normal IPs, because you probably want the traffic to come from the normal IP, not the mesh IP.)
You can handle the direct subnet case with a general rule like 'if you're directly attached to this network, ignore a mesh subnet route to it', or by some automatic system like route priorities. The indirect subnet case can't be handled automatically because it requires knowledge about your specific network configuration and what can reach what without the mesh (and what you want to reach what without the mesh, since some traffic you want to go over the mesh even if there's a non-mesh route between the two nodes). As far as I can see, to deal with this you need the ability to selectively configure or accept (subnet) routes on a mesh node by mesh node basis.
(In a simple topology you can get away with accepting or not accepting all subnet routes, but in a more complex one you can't. You might have two separate locations, each with their own set of internal subnets. Mesh nodes in each location want the other location's subnet routes, but not their own location's subnet routes.)
Tailscale's surprising interaction of DNS settings and 'exit nodes'
Tailscale is a well regarded commercial mesh networking system, based on WireGuard, that can be pressed into service as a VPN as well. As part of its general features, it allows you to set up various sorts of DNS settings for your tailnet (your own particular Tailscale mesh network), including both DNS servers for specific (sub)domains (eg an 'internal.example.org') and all DNS as a whole. As part of optionally being VPN-like, Tailscale also lets you set up exit nodes, which let you route all traffic for the Internet out the exit node (if you want to route just some subnets to somewhere, that's a subnet router, a different thing). If you're a normal person, especially if you're a system administrator, you probably have a guess as to how these two features interact. Unfortunately, you may well be wrong.
As of today, if you use a Tailscale exit node, all of your DNS traffic is routed to the exit node regardless of Tailscale DNS settings. This applies to both DNS servers for specific subdomains and to any global DNS servers you've set for your tailnet (due to, for example, 'split horizon' DNS). Currently this is documented only in one little sentence in small type in the "Use Tailscale DNS settings" portion of the client preferences documentation.
In many Tailscale environments, all this does is make your DNS queries take an extra hop (from you to the exit node and then to the configured DNS servers). Your Tailscale exit nodes are part of your tailnet, so in ordinary configurations they will have your Tailscale DNS settings and be able to query your configured DNS servers (and they will probably get the same answers, although this isn't certain). However, if one of your exit nodes isn't set up this way, potential pain and suffering is ahead of you. Your tailnet nodes that are using this exit node will get wildly different DNS answers than you expect, potentially not resolving internal domains and maybe getting different answers than you'd expect (if you have split horizon DNS).
One reason that you might set an exit node machine to not use your Tailscale DNS settings (or subnet routes) is that you're only using it as an exit node, not as a regular participant in your tailnet. Your exit node machine might be placed on a completely different network (and in a completely different trust environment) than the rest of your tailnet, and you might have walled off its (less-trusted) traffic from the rest of your network. If the only thing the machine is supposed to be is an Internet gateway, there's no reason to have it use internal DNS settings, and it might not normally be able to reach your internal DNS servers (or the rest of your internal servers).
In my view, a consequence of this is that it's probably best to have any internal DNS servers directly on your tailnet, with their tailnet IP addresses. This makes them as reachable as possible to your nodes, independent of things like subnet routes.
PS: Routing general DNS queries through a tailnet exit node makes sense in this era of geographical DNS results, where you may get different answers depending on where in the world you are and you'd like these to match up with where your exit node is.
(I'm writing this entry because this issue was quite mysterious to us when we ran into it while testing Tailscale and I couldn't find much about it in online searches.)
How I install personal versions of programs (on Unix)
These days, Unixes are quite generous in what they make available through their packaging systems, so you can often get everything you want through packages that someone else worries about building, updating, managing, and so on. However, not everything is available that way; sometimes I want something that isn't packaged, and sometimes (especially on 'long term support' distributions) I want something that's more recent that the system provides (for example, Ubuntu 22.04 only has Emacs 27.1). Over time, I've evolved my own approach for managing my personal versions of such things, which is somewhat derived from the traditional approach for multi-architecture Unixes here.
The starting point is that I have a ~/lib/<architecture> directory tree. When I build something personally, I tell it that its install prefix is a per-program directory within this tree, for example, '/u/cks/lib/<arch>/emacs-30.1'. These days I only have one active architecture inside ~/lib, but old habits die hard, and someday we may start using ARM machines or FreeBSD. If I install a new version of the program, it goes in a different (versioned) subdirectory, so I have 'emacs-29.4' and 'emacs-30.1' directory trees.
I also have both a general ~/bin directory, for general scripts and other architecture independent things, and a ~/bin/bin.<arch> subdirectory, for architecture dependent things. When I install a program into ~/lib/<arch>/<whatever> and want to use it, I will make either a symbolic link or a cover script in ~/bin/bin.<arch> for it, such as '~/bin/bin.<arch>/emacs'. This symbolic link or cover script always points to what I want to use as the current version of the program, and I update it when I want to switch.
(If I'm building and installing something from the latest development tree, I'll often call the subdirectory something like 'fvwm3-git' and then rename it to have multiple versions around. This is not as good as real versioned subdirectories, but I tend to do this for things that I won't ever run two versions of at the same time; at most I'll switch back and forth.)
Some things I use, such as pipx, normally install programs (or symbolic links to them) into places like ~/.local/bin or ~/.cargo/bin. Because it's not worth fighting city hall on this one, I pretty much let them do so, but I don't add either directory to my $PATH. If I want to use a specific tool that they install and manage, I put in a symbolic link or a cover script in my ~/bin/bin.<arch>. The one exception to this is Go, where I do have ~/go/bin in my $PATH because I use enough Go based programs that it's the path of least resistance.
This setup isn't perfect, because right now I don't have a good
general approach for things that depend on the Ubuntu version (where
an Emacs 30.1 built on 22.04 doesn't run on 24.04). If I ran into
this a lot I'd probably make an addition ~/bin/bin.<something>
directory for the Ubuntu version and then put version specific
things there. And in general, Go and Cargo are not ready for my
home directory to be shared between different binary architectures.
For Go, I would probably wind up setting $GOPATH to something
like ~/lib/<arch>/go. Cargo has a similar system for deciding where
it puts stuff but I haven't looked into it in detail.
(From a quick skim of 'cargo help install' and my ~/.cargo, I
suspect that I'd point $CARGO_INSTALL_ROOT into my ~/lib/<arch>
but leave $CARGO_HOME unset, so that various bits of Cargo's
own data remain shared between architectures.)
(This elaborates a bit on a Fediverse conversation.)
PS: In theory I have a system for keeping track of the command lines used to build things (also, which I'd forgotten when I wrote the more recent entry on this system). In practice I've fallen out of the habit of using it when I build things for my ~/lib, although I should probably get back into it. For GNU Emacs, I put the ./configure command line into a file in ~/lib/<arch>, since I expected to build enough versions of Emacs over time.
Sorting out the ordering of OpenSSH configuration directives
As I discovered recently, OpenSSH makes some unusual choices for the ordering of configuration directives in its configuration files, both sshd_config and ssh_config (and files they include). Today I want to write down what I know about the result (which is partly things I've learned researching this entry).
For sshd_config, the situation is relatively straightforward.
There are what we could call 'global options' (things you set
normally, outside of 'Match' blocks) and 'matching Match options' (things set
in Match blocks that actually matched). Both of them are 'first
mention wins', but Match options take priority over global options
regardless of where the Match option block is in the (aggregate)
configuration file. Sshd makes 'first mention win' work in the
presence of including files from /etc/ssh/sshd_config.d/ by
doing the inclusion at the start of /etc/ssh/sshd_config.
So here's an example with a Match statement:
PasswordAuthentication no Match Address 127.0.0.0/8,192.168.0.0/16 PasswordAuthentication yes
Password authentication is turned off as a global option but then overridden in the address-based Match block to enable it for connections from the local network. If we had a (Unix) group for logins that we wanted to never use passwords even if they were coming from the local network, I believe that we would have to write it like this, which looks somewhat odd:
PasswordAuthentication no Match Group neverpassword PasswordAuthentication no Match Address 127.0.0.0/8,192.168.0.0/16 PasswordAuthentication yes
Then a 'neverpassword' person logging in from the local network would match both Match blocks, and the first block (the group block) would have 'PasswordAuthentication no' win over the second block's 'PasswordAuthentication yes'. Equivalently, you could put the global 'PasswordAuthentication no' after both Match blocks, which might be clearer.
The situation with ssh and ssh_config is one that I find more confusing and harder to follow. The ssh_config manual page says:
Unless noted otherwise, for each parameter, the first obtained value will be used.
It's pretty clear how this works for the various sources of configurations; options on the command line take priority over everything else, and ~/.ssh/config options take priority over the global options from /etc/ssh/ssh_config and its included files. But within a file (such as ~/.ssh/config), I get a little confused.
What I believe this means for any specific option that you want to
give a default value to for all hosts but then override for specific
hosts is that you must put your Host * directive for it at the
end of your configuration file, and the more specific Host or Match directives first. I'm
not sure how this works for matches like 'Match canonical' or
'Match final' that happen 'late' in the processing of your
configuration; the natural reading would be that you have to make
sure that nothing earlier conflicts with them. If this is so, a
natural use for 'Match final' would then be options that you want
to be true defaults that only apply if nothing has overridden them.
Some ssh_config options are special in that you can provide
them multiple times and they'll be merged together; one example is
IdentityFile.
I think this applies even across multiple Host and Match blocks,
and also that there's no way to remove an IdentityFile once you've
added it (which might be an issue if
you have a lot of identity files, because SSH servers only let
you offer so many). Some options let you
modify the default state to, for example, add a non-default key
exchange algorithm;
I haven't tested to see if you can do this multiple times in Host
blocks or if you can only do it once.
(These days you can make things somewhat simpler with 'Match tagged
...' and 'Tag'; one handy
and clear explanation of what you can do with this is OpenSSH
Config Tags How To.)
Typically your /etc/ssh/ssh_config has no active options set
in it and includes /etc/ssh/ssh_config.d/* at the end. On
Debian-derived systems, it does have some options specified (for
'Host *', ie making them defaults), but the inclusion of
/etc/ssh/ssh_config.d/* has been moved to the start so you can
override them.
My own personal ~/.ssh/config setup starts with a 'Host *'
block, but as far as I can tell I don't try to override any of its
settings later in more specific Host blocks. I do have a final
'Host *' block with comments about how I want to do some things
by default if they haven't been set earlier, along with comments in
the file that I was finding all of this confusing. I may at some
point try to redo it into a 'Match tagged' / 'Tag' form to see if
that makes it clearer.
The order of files in /etc/ssh/sshd_config.d/ matters (and may surprise you)
Suppose, not entirely hypothetically, that you have an Ubuntu 24.04
server system where you want to disable SSH passwords for the
Internet but allow them for your local LAN. This looks straightforward
based on sshd_config,
given the PasswordAuthentication and
Match directives:
PasswordAuthentication no Match Address 127.0.0.0/8,192.168.0.0/16 PasswordAuthentication yes
Since I'm an innocent person, I put this in a file in
/etc/ssh/sshd_config.d/ with a nice high ordering number, say
'60-no-passwords.conf'. Then I restarted the SSH daemon and
was rather confused when it didn't work (and I wound up resorting
to manipulating AuthenticationMethods, which
also works).
The culprit is two things combined together. The first is this sentence at the start of sshd_config:
[...] Unless noted otherwise, for each keyword, the first obtained value will be used. [...]
Some configuration systems are 'first mention wins', but I think it's more common to be either 'last mention wins' or 'if it's mentioned more than once, it's an error'. Certainly I was vaguely expecting sshd_config and the files in sshd_config.d to be 'last mention wins', because that would be the obvious way to let you easily override things specified in sshd_config itself. But OpenSSH doesn't work this way.
(You can still override things in sshd_config, because the global sshd_config includes all of sshd_config.d/* at the start, before it sets anything, rather than at the end, how you often see this.)
The second culprit is that at least in our environment, Ubuntu 24.04 writes out a '50-cloud-init.conf' file that contains one deadly (for this) line:
PasswordAuthentication yes
Since '50-cloud-init.conf' was read by sshd before my '60-no-passwords.conf', it forced password authentication to be on. My new configuration file was more or less silently ignored.
Renaming my configuration file to be '10-no-passwords.conf' fixed my problem and made things work like I expected.
Our simple view of 'identity' for our (Unix) accounts
When I wrote about how it's complicated to count how many professors are in our department, I mentioned that the issues involved would definitely complicate the life of any IAM system that tried to understand all of this, but that we had a much simpler view of things. Today I'm going to explain that, with a little bit on its historical evolution (as I understand it).
All Unix accounts on our have to be 'sponsored' by someone, their 'sponsor'. Roughly speaking, all professors who supervise graduate students in the department and all professors who are in the department are or can be sponsors, and there are some additional special sponsors (for example, technical and administrative staff also have sponsors). Your sponsor has to approve your account request before it can be created, although some of the time the approval is more or less automatic (for example, for incoming graduate students, who are automatically sponsored by their supervisor).
At one level this requires us to track 'who is a professor'. At another level, we outsource this work; when new professors show up, the administrative staff side of the department will ask us to set up an account for them, at which point we know to either enable them as a sponsor or schedule it in the future at their official start date. And ultimately, 'who can sponsor accounts' is a political decision that's made (if necessary) by the department (generally by the Chair). We're never called on to evaluate the 'who is a professor in the department' question ourselves.
I believe that one reason we use this model is that what is today the department's general research side computing environment originated in part from an earlier organization that included only a subset of the professors here, so that not everyone in the department could get a Unix account on 'CSRI' systems. To get a CSRI account, a professor who was explicitly part of CSRI had to say 'yes, I want this person to have an account', sponsoring it. When this older, more restricted environment expanded to become the department's general research side computing environment, carrying over the same core sponsorship model was natural (or so I believe).
(Back in the days there were other research groups around the department, involving other professors, and they generally had similar policies for who could get an account.)
Using SimpleSAMLphp to set up an identity provider with Duo support
My university has standardized on an institutional MFA system that's based on institutional identifiers and Duo (a SaaS company, as is commonly necessary these days to support push MFA). We have our own logins and passwords, but wanted to add full Duo MFA authentication to (as a first step) various of our web applications. We were eventually able to work out how to do this, which I'm going to summarize here because although this is a very specific need, maybe someone else in the world also has it.
The starting point is SimpleSAMLphp, which we already had an instance of that authenticated only with login and password against an existing LDAP server we had. SSP is a SAML IdP, but there's a third party module for OIDC OP support, and we wound up using it to make our new IdP support both SAML and OIDC. For Duo support we found a third party module, but to work with SSP 2.x, you need to use a feature branch. We run the entire collective stack of things under Apache, because we're already familiar with that.
A rough version of the install process is:
- Set up Apache so it can run PHP and etc etc.
- Obtain SimpleSAMLphp 2.x from the upstream releases. You almost certainly can't use a version packaged by your Linux distribution, because you need to be able to use the 'composer' PHP package manager to add packages to it.
- Unpack this release somewhere, conventionally
/var/simplesamlphp. - Install the 'composer' PHP package manager if it's not already
available.
- Install the third party Duo module from
the alternate branch. At the top level of your SimpleSAMLphp install,
run:
composer require 0x0fbc/simplesamlphp-module-duouniversal:dev-feature
- Optionally install the OIDC module:
composer require simplesamlphp/simplesamlphp-module-oidc
Now you can configure SimpleSAMLphp, the Duo module, and the OIDC module following their respective instructions (which are not 'simple' despite the name). If you're using the OIDC module, remember that you'll need to set up the Duo module (and the other things we'll need) in two places, not just one, and you'll almost certainly want to add an Apache alias for '/.well-known/openid-configuration' that redirects it to the actual URL that the OIDC module uses.
At this point we need to deal with the mismatch between our local logins and the institutional identifiers that Duo uses for MFA. There are at least three options to deal with this:
- Add a LDAP attribute (and schema) that will hold the Duo identifier
(let's call this the 'duoid') for everyone. This attribute will
(probably) be automatically available as a SAML attribute, making it
available to the Duo module.
(If you're not using LDAP for your SimpleSAMLphp authentication module, the module you're using may have its own way to add extra information.)
- Embed the duoid into your GECOS field in LDAP and write a
SimpleSAMLphp 'authproc' with
arbitrary PHP code to
extract the GECOS field and materialize it as a SAML attribute. This
has the advantage that you can share this GECOS field with the Duo PAM
module if you use that.
- Write a SimpleSAMLphp 'authproc' that uses arbitrary PHP code to look up the duoid for a particular login from some data source, which could be an actual database or simply a flat file that you open and search through. This is what we did, mostly because we had such a file sitting around for other reasons.
(Your new SAML attribute will normally be passed through to SAML SPs (clients) that use you as a SAML IdP, but it won't be passed through to OIDC RPs (also clients) unless you configure a new OIDC claim and scope for it and clients ask for that OIDC scope.)
You'll likely also want to augment the SSP Duo module with some additional logging, so you can tell when Duo MFA authentication is attempted for people and when it succeeds. Since the SSP Duo module is more or less moribund, we probably don't have too much to worry about as far as keeping up with upstream updates goes.
I've looked through the SSP Duo module's code and I'm not too worried about development having stopped some time ago. As far as I can see, the module is directly following Duo's guidance for how to use the current Duo Universal SDK and is basically simple glue code to sit between SimpleSAMLphp's API and the Duo SDK API.
Sidebar: Implications of how the Duo module is implemented
To simplify the technical situation, the MFA challenge created by the SSP Duo module is done as an extra step after SimpleSAMLphp has 'authenticated' your login and password against, say, your LDAP server. SSP as a whole has no idea that a person who's passed LDAP is not yet 'fully logged in', and so it will both log things and behave as if you're fully authenticated even before the Duo challenge succeeds. This is the big reason you need additional logging in the Duo module itself.
As far as I can tell, SimpleSAMLphp will also set its 'you are authenticated' IdP session cookie in your browser immediately after you pass LDAP. Conveniently (and critically), authprocs always run when you revisit SimpleSAMLphp even if you're not challenged for a login and password. This does mean that every time you revisit your IdP (for example because you're visiting another website that's protected by it), you'll be sent for a round trip through Duo's site. Generally this is harmless.
US sanctions and your VPN (and certain big US-based cloud providers)
As you may have heard (also) and to simplify, the US government requires US-based organizations to not 'do business with' certain countries and regions (what this means in practice depends in part which lawyer you ask, or more to the point, that the US-based organization asked). As a Canadian university, we have people from various places around the world, including sanctioned areas, and sometimes they go back home. Also, we have a VPN, and sometimes when people go back home, they use our VPN for various reasons (including that they're continuing to do various academic work while they're back at home). Like many VPNs, ours normally routes all of your traffic out of our VPN public exit IPs (because people want this, for good reasons).
Getting around geographical restrictions by using a VPN is a time honored Internet tradition. As a result of it being a time honored Internet tradition, a certain large cloud provider with a lot of expertise in browsers doesn't just determine what your country is based on your public IP; instead, as far as we can tell, it will try to sniff all sorts of attributes of your browser and your behavior and so on to tell if you're actually located in a sanctioned place despite what your public IP is. If this large cloud provider decides that you (the person operating through the VPN) actually are in a sanctioned region, it then seems to mark your VPN's public exit IP as 'actually this is in a sanctioned area' and apply the result to other people who are also working through the VPN.
(Well, I simplify. In real life the public IP involved may only be one part of a signature that causes the large cloud provider to decide that a particular connection or request is from a sanctioned area.)
Based on what we observed, this large cloud provider appears to deal with connections and HTTP requests from sanctioned regions by refusing to talk to you. Naturally this includes refusing to talk to your VPN's public exit IP when it has decided that your VPN's IP is really in a sanctioned country. When this sequence of events happened to us, this behavior provided us an interesting and exciting opportunity to discover how many companies hosted some part of their (web) infrastructure and assets (static or otherwise) on the large cloud provider, and also how hard to diagnose the resulting failures were. Some pages didn't load at all; some pages loaded only partially, or had stuff that was supposed to work but didn't (because fetching JavaScript had failed); with some places you could load their main landing page (on one website) but then not move to the pages (on another website at a subdomain) that you needed to use to get things done.
The partial good news (for us) was that this large cloud provider would reconsider its view of where your VPN's public exit IP 'was' after a day or two, at which point everything would go back to working for a while. This was also sort of the bad news, because it made figuring out what was going on somewhat more complicated and hit or miss.
If this is relevant to your work and your VPNs, all I can suggest is to get people to use different VPNs with different public exit IPs depending on where the are (or force them to, if you have some mechanism for that).
PS: This can presumably also happen if some of your people are merely traveling to and in the sanctioned region, either for work (including attending academic conferences) or for a vacation (or both).
(This is a sysadmin war story from a couple of years ago, but I have no reason to believe the situation is any different today. We learned some troubleshooting lessons from it.)
Three ways I know of to authenticate SSH connections with OIDC tokens
Suppose, not hypothetically, that you have an MFA equipped OIDC identity provider (an 'OP' in the jargon), and you would like to use it to authenticate SSH connections. Specifically, like with IMAP, you might want to do this through OIDC/OAuth2 tokens that are issued by your OP to client programs, which the client programs can then use to prove your identity to the SSH server(s). One reason you might want to do this is because it's hard to find non-annoying, MFA-enabled ways of authenticating SSH, and your OIDC OP is right there and probably already supports sessions and so on. So far I've found three different projects that will do this directly, each with their own clever approach and various tradeoffs.
(The bad news is that all of them require various amounts of additional software, including on client machines. This leaves SSH apps on phones and tablets somewhat out in the cold.)
The first is ssh-oidc, which is a joint effort of various European academic parties, although I believe it's also used elsewhere (cf). Based on reading the documentation, ssh-oidc works by directly passing the OIDC token to the server, I believe through a SSH 'challenge' as part of challenge/response authentication, and then verifying it on the server through a PAM module and associated tools. This is clever, but I'm not sure if you can continue to do plain password authentication (at least not without PAM tricks to selectively apply their PAM module depending on, eg, the network area the connection is coming from).
Second is Smallstep's DIY Single-Sign-On for SSH (also). This works by setting
up a SSH certificate authority and having the CA software issue
signed, short-lived SSH client certificates in exchange for OIDC
authentication from your OP. With client side software, these client
certificates will be automatically set up for use by ssh, and on
servers all you need is to trust your SSH CA. I believe you could
even set this up for personal use on servers you SSH to, since you
set up a personally trusted SSH CA. On the positive side, this
requires minimal server changes and no extra server software, and
preserves your ability to directly authenticate with passwords (and
perhaps some MFA challenge). On the negative side, you now have a
SSH CA you have to trust.
(One reason to care about still supporting passwords plus another MFA challenge is that it means that people without the client software can still log in with MFA, although perhaps somewhat painfully.)
The third option, which I've only recently become aware of, is
Cloudflare's recently open-sourced 'opkssh'
(via,
Github). OPKSSH builds on
something called OpenPubkey,
which uses a clever trick to embed a public key you provide in
(signed) OIDC tokens from your OP (for details see here).
OPKSSH uses this to put a basically regular SSH public key into
such an augmented OIDC token, then smuggles it from the client to
the server by embedding the entire token in a SSH (client) certificate;
on the server, it uses an AuthorizedKeysCommand to
verify the token, extract the public key, and tell the SSH server
to use the public key for verification (see How it works
for more details). If you want, as far as I can see OPKSSH still
supports using regular SSH public keys and also passwords (possibly
plus an MFA challenge).
(Right now OPKSSH is not ready for use with third party OIDC OPs. Like so many things it's started out by only supporting the big, established OIDC places.)
It's quite possible that there are other options for direct (ie, non-VPN) OIDC based SSH authentication. If there are, I'd love to hear about them.
(OpenBao may be another 'SSH CA that authenticates you via OIDC' option; see eg Signed SSH certificates and also here and here. In general the OpenBao documentation gives me the feeling that using it merely to bridge between OIDC and SSH servers would be swatting a fly with an awkwardly large hammer.)
Some notes on configuring Dovecot to authenticate via OIDC/OAuth2
Suppose, not hypothetically, that you have a relatively modern Dovecot server and a shiny new OIDC identity provider server ('OP' in OIDC jargon, 'IdP' in common usage), and you would like to get Dovecot to authenticate people's logins via OIDC. Ignoring certain practical problems, the way this is done is for your mail clients to obtain an OIDC token from your IdP, provide it to Dovecot via SASL OAUTHBEARER, and then for Dovecot to do the critical step of actually validating that token it received is good, still active, and contains all the information you need. Dovecot supports this through OAuth v2.0 authentication as a passdb (password database), but in the usual Dovecot fashion, the documentation on how to configure the parameters for validating tokens with your IdP is a little bit lacking in explanations. So here are some notes.
If you have a modern OIDC IdP, it will support OpenID Connect Discovery, including the provider configuration request on the path /.well-known/openid-configuration. Once you know this, if you're not that familiar with OIDC things you can request this URL from your OIDC IdP, feed the result through 'jq .', and then use it to pick out the specific IdP URLs you want to set up in things like the Dovecot file with all of the OAuth2 settings you need. If you do this, the only URL you want for Dovecot is the userinfo_endpoint URL. You will put this into Dovecot's introspection_url, and you'll leave introspection_mode set to the default of 'auth'.
You don't want to set tokeninfo_url to anything. This setting is (or was) used for validating tokens with OAuth2 servers before the introduction of RFC 7662. Back then, the defacto standard approach was to make a HTTP GET approach to some URL with the token pasted on the end (cf), and it's this URL that is being specified. This approach was replaced with RFC 7662 token introspection, and then replaced again with OpenID Connect UserInfo. If both tokeninfo_url and introspection_url are set, as in Dovecot's example for Google, the former takes priority.
(Since I've just peered deep into the Dovecot source code, it appears
that setting 'introspection_mode = post' actually performs an
(unauthenticated) token introspection request. The 'get' mode
seems to be the same as setting tokeninfo_url. I think that
if you set the 'post' mode, you also want to set active_attribute
and perhaps active_value, but I don't know what to, because
otherwise you aren't necessarily fully validating that the token
is still active. Does my head hurt? Yes. The moral here is that you
should use an OIDC IdP that supports OpenID Connect UserInfo.)
If your IdP serves different groups and provides different 'issuer'
('iss') values to them, you may want to set the Dovecot 'issuers
=' to the specific issuer that applies to you. You'll also want
to set 'username_attribute' to whatever OIDC claim is where
your IdP puts what you consider the Dovecot username, which might
be the email address or something else.
It would be nice if Dovecot could discover all of this for itself
when you set openid_configuration_url, but in the current
Dovecot, all this does is put that URL in the JSON of the error
response that's sent to IMAP clients when they fail OAUTHBEARER
authentication. IMAP clients may or may not do anything useful
with it.
As far as I can tell from the Dovecot source code, setting 'scope =' primarily requires that the token contains those scopes. I believe that this is almost entirely a guard against the IMAP client requesting a token without OIDC scopes that contain claims you need elsewhere in Dovecot. However, this only verifies OIDC scopes, it doesn't verify the presence of specific OIDC claims.
So what you want to do is check your OIDC IdP's /.well-known/openid-configuration URL to find out its collection of endpoints, then set:
# Modern OIDC IdP/OP settings
introspection_url = <userinfo_endpoint>
username_attribute = <some claim, eg 'email'>
# not sure but seems common in Dovecot configs?
pass_attrs = pass=%{oauth2:access_token}
# optionally:
openid_configuration_url = <stick in the URL>
# you may need:
tls_ca_cert_file = /etc/ssl/certs/ca-certificates.crt
The OIDC scopes that IMAP clients should request when getting tokens
should include a scope that gives the username_attribute claim,
which is 'email' if the claim is 'email', and also apparently the
requested scopes should include the offline_access scope.
If you want a test client to see if you've set up Dovecot correctly, one option is to appropriately modify a contributed Python program for Mutt (also the README), which has the useful property that it has an option to check all of IMAP, POP3, and authenticated SMTP once you've obtained a token. If you're just using it for testing purposes, you can change the 'gpg' stuff to 'cat' to just store the token with no fuss (and no security). Another option, which can be used for real IMAP clients too if you really want to, is an IMAP/etc OAuth2 proxy.
(If you want to use Mutt with OAuth2 with your IMAP server, see this article on it also, also, also. These days I would try quite hard to use age instead of GPG.)
How I got my nose rubbed in my screens having 'bad' areas for me
I wrote a while back about how my desktop screens now had areas that were 'good' and 'bad' for me, and mentioned that I had recently noticed this, calling it a story for another time. That time is now. What made me really notice this issue with my screens and where I had put some things on them was our central mail server (temporarily) stopping handling email because its load was absurdly high.
In theory I should have noticed this issue before a co-worker rebooted the mail server, because for a long time I've had an xload window from the mail server (among other machines, I have four xloads). Partly I did this so I could keep an eye on these machines and partly it's to help keep alive the shared SSH connection I also use for keeping an xrun on the mail server.
(In the past I had problems with my xrun SSH connections seeming to spontaneously close if they just sat there idle because, for example, my screen was locked. Keeping an xload running seemed to work around that; I assumed it was because xload keeps updating things even with the screen locked and so forced a certain amount of X-level traffic over the shared SSH connection.)
When the mail server's load went through the roof, I should have noticed that the xload for it had turned solid green (which is how xload looks under high load). However, I had placed the mail server's xload way off on the right side of my office dual screens, which put it outside my normal field of attention. As a result, I never noticed the solid green xload that would have warned me of the problem.
(This isn't where the xload was back on my 2011 era desktop, but at some point since then I moved it and some other xloads over to the right.)
In the aftermath of the incident, I relocated all of those xloads to a more central location, and also made my new Prometheus alert status monitor appear more or less centrally, where I'll definitely notice it.
(Some day I may do a major rethink about my entire screen layout, but most of the time that feels like yak shaving that I'd rather not touch until I have to, for example because I've been forced to switch to Wayland and an entirely different window manager.)
Sidebar: Why xload turns green under high load
Xload draws a horizontal tick line for every integer load average it needs to display the maximum load that fits in its moving histogram. If the highest load average is 1.5, there will be one tick; if the highest load average is 10.2, there will be ten. Ticks are normally drawn in green. This means that as the load average climbs, xload draws more and more ticks, and after a certain point the entire xload display is just solid green from all of the tick lines.
This has the drawback that you don't know the shape of the load average (all you know is that at some point it got quite high), but the advantage that it's quite visually distinctive and you know you have a problem.
A Prometheus gotcha with alerts based on counting things
Suppose, not entirely hypothetically, that you have some backup servers that use swappable HDDs as their backup media and expose that 'media' as mounted filesystems. Because you keep swapping media around, you don't automatically mount these filesystems and when you do manually try to mount them, it's possible to have some missing (if, for example, a HDD didn't get fully inserted and engaged with the hot-swap bay). To deal with this, you'd like to write a Prometheus alert for 'not all of our backup disks are mounted'. At first this looks simple:
count(
node_filesystem_size_bytes{
host = "backupserv",
mountpoint =~ "/dumps/tapes/slot.*" }
) != <some number>
This will work fine most of the time and then one day it will fail to alert you to the fact that none of the expected filesystems are mounted. The problem is the usual one of PromQL's core nature as a set-based query language (we've seen this before). As long as there's at least one HDD 'tape' filesystem mounted, you can count them, but once there are none, the result of counting them is not 0 but nothing. As a result this alert rule won't produce any results when there are no 'tape' filesystems on your backup server.
Unfortunately there's no particularly good fix, especially if you
have multiple identical backup servers and so the real version uses
'host =~ "bserv1|bserv2|..."'. In the single-host case, you can
use either absent()
or vector()
to provide a default value. There's no good solution in the multi-host
case, because there's no version of vector() that lets you set labels.
If there was, you could at least write:
count( ... ) by (host) or vector(0, "host", "bserv1") or vector(0, "host", "bserv2") ....
(Technically you can set labels via label_replace(). Let's not go there; it's a giant pain for simply adding labels, especially if you want to add more than one.)
In my particular case, our backup servers always have some additional
filesystems (like their root filesystem), so I can write a different
version of the count() based alert rule:
count(
node_filesystem_size_bytes{
host =~ "bserv1|bserv2|...",
fstype =~ "ext.*' }
) by (host) != <other number>
In theory this is less elegant because I'm not counting exactly what I care about (the number of 'tape' filesystems that are mounted) but instead something more general and potentially more variable (the number of extN filesystems that are mounted) that contains various assumptions about the systems. In practice the number is just as fixed as the number of 'taoe' filesystems, and the broader set of labels will always match something, producing a count of at least one for each host.
(This would change if the standard root filesystem type changed in a future version of Ubuntu, but if that happened, we'd notice.)
PS: This might sound all theoretical and not something a reasonably experienced Prometheus person would actually do. But I'm writing this entry partly because I almost wrote a version of my first example as our alert rule, until I realized what would happen when there were no 'tape' filesystems mounted at all, which is something that happens from time to time for reasons outside the scope of this entry.
What SimpleSAMLphp's core:AttributeAlter does with creating new attributes
SimpleSAMLphp is a SAML identity provider (and other stuff). It's of deep interest to us because it's about the only SAML or OIDC IdP I can find that will authenticate users and passwords against LDAP and has a plugin that will do additional full MFA authentication against the university's chosen MFA provider (although you need to use a feature branch). In the process of doing this MFA authentication, we need to extract the university identifier to use for MFA authentication from our local LDAP data. Conveniently, SimpleSAMLphp has a module called core:AttributeAlter (a part of authentication processing filters) that is intended to do this sort of thing. You can give it a source, a pattern, a replacement that includes regular expression group matches, and a target attribute. In the syntax of its examples, this looks like the following:
// the 65 is where this is ordered
65 => [
'class' => 'core:AttributeAlter',
'subject' => 'gecos',
'pattern' => '/^[^,]*,[^,]*,[^,]*,[^,]*,([^,]+)(?:,.*)?$/',
'target' => 'mfaid',
'replacement' => '\\1',
],
If you're an innocent person, you expect that your new 'mfaid' attribute will be undefined (or untouched) if the pattern does not match because the required GECOS field isn't set. This is not in fact what happens, and interested parties can follow along the rest of this in the source.
(All of this is as of SimpleSAMLphp version 2.3.6, the current release as I write this.)
The short version of what happens is that when the target is a
different attribute and the pattern doesn't match, the target will
wind up set but empty. Any previous value is lost. How this happens
(and what happens) starts with that 'attributes' here are actually
arrays of values under the covers (this is '$attributes'). When
core:AttributeAlter has a different target attribute than the source
attribute, it takes all of the source attribute's values, passes
each of them through a regular expression search and replace (using
your replacement), and then gathers up anything that changed and
sets the target attribute to this gathered collection. If the pattern
doesn't match any values of the attribute (in the normal case, a
single value), the array of changed things is empty and your target
attribute is set to an empty PHP array.
(This is implemented with an array_diff() between the results of preg_replace() and the original attribute value array.)
My personal view is that this is somewhere around a bug; if the pattern doesn't match, I expect nothing to happen. However, the existing documentation is ambiguous (and incomplete, as the use of capture groups isn't particularly documented), so it might not be considered a bug by SimpleSAMLphp. Even if it is considered a bug I suspect it's not going to be particularly urgent to fix, since this particular case is unusual (or people would have found it already).
For my situation, perhaps what I want to do is to write some PHP code to do this extraction operation by hand, through core:PHP. It would be straightforward to extract the necessary GECOS field (or otherwise obtain the ID we need) in PHP, without fooling around with weird pattern matching and module behavior.
(Since I just looked it up, I believe that in the PHP code that core:PHP runs for you, you can use a PHP 'return' to stop without errors but without changing anything. This is relevant in my case since not all GECOS entries have the necessary information.)
If you get the chance, always run more extra network fiber cabling
Some day, you may be in an organization that's about to add some more fiber cabling between two rooms in the same building, or maybe two close by buildings, and someone may ask you for your opinion about many fiber pairs should be run. My personal advice is simple: run more fiber than you think you need, ideally a bunch more (this generalizes to network cabling in general, but copper cabling is a lot more bulky and so harder to run (much) more of). There is an unreasonable amount of fiber to run, but mostly it comes up when you'd have to put in giant fiber patch panels.
The obvious reason to run more fiber is that you may well expand your need for fiber in the future. Someone will want to run a dedicated, private network connection between two locations; someone will want to trunk things to get more bandwidth; someone will want to run a weird protocol that requires its own network segment (did you know you can run HDMI over Ethernet?); and so on. It's relatively inexpensive to add some more fiber pairs when you're already running fiber but much more expensive to have to run additional fiber later, so you might as well give yourself room for growth.
The less obvious reason to run extra fiber is that every so often fiber pairs stop working, just like network cables go bad, and when this happens you'll need to replace them with spare fiber pairs, which means you need those spare fiber pairs. Some of the time this fiber failure is (probably) because a raccoon got into your machine room, but some of the time it just happens for reasons that no one is likely to ever explain to you. And when this happens, you don't necessarily lose only a single pair. Today, for example, we lost three fiber pairs that ran between two adjacent buildings and evidence suggests that other people at the university lost at least one more pair.
(There are a variety of possible causes for sudden loss of multiple pairs, probably all running through a common path, which I will leave to your imagination. These fiber runs are probably not important enough to cause anyone to do a detailed investigation of where the fault is and what happened.)
Fiber comes in two varieties, single mode and multi-mode. I don't know enough to know if you should make a point of running both (over distances where either can be used) as part of the whole 'run more fiber' thing. Locally we have both SM and MM fiber and have switched back and forth between them at times (and may have to do so as a result of the current failures).
PS: Possibly you work in an organization where broken inside-building fiber runs are regularly fixed or replaced. That is not our local experience; someone has to pay for fixing or replacing, and when you have spare fiber pairs left it's easier to switch over to them rather than try to come up with the money and so on.
(Repairing or replacing broken fiber pairs will reduce your long term need for additional fiber, but obviously not the short term need. If you lose N pairs of fiber, you need N spare pairs to get back into operation.)
MFA's "push notification" authentication method can be easier to integrate
For reasons outside the scope of this entry, I'm looking for an OIDC or SAML identity provider that supports primary user and password authentication against our own data and then MFA authentication through the university's SaaS vendor. As you'd expect, the university's MFA SaaS vendor supports all of the common MFA approaches today, covering push notifications through phones, one time codes from hardware tokens, and some other stuff. However, pretty much all of the MFA integrations I've been able to find only support MFA push notifications (eg, also). When I thought about it, this made a lot of sense, because it's often going to be much easier to add push notification MFA than any other form of it.
A while back I wrote about exploiting password fields for multi-factor authentication, where various bits of software hijacked password fields to let people enter things like MFA one time codes into systems (like OpenVPN) that were never set up for MFA in the first place. With most provider APIs, authentication through push notification can usually be inserted in a similar way, because from the perspective of the overall system it can be a synchronous operation. The overall system calls a 'check' function of some sort, the check function calls out the the provider's API and then possibly polls for a result for a while, and then it returns a success or a failure. There's no need to change the user interface of authentication or add additional high level steps.
(The exception is if the MFA provider's push authentication API only returns results to you by making a HTTP query to you. But I think that this would be a relatively weird API; a synchronous reply or at least a polled endpoint is generally much easier to deal with and is more or less required to integrate push authentication with non-web applications.)
By contrast, if you need to get a one time code from the person, you have to do things at a higher level and it may not fit well in the overall system's design (or at least the easily exposed points for plugins and similar things). Instead of immediately returning a successful or failed authentication, you now need to display an additional prompt (in many cases, a HTML page), collect the data, and only then can you say yes or no. In a web context (such as a SAML or OIDC IdP), the provider may want you to redirect the user to their website and then somehow call you back with a reply, which you'll have to re-associate with context and validate. All of this assumes that you can even interpose an additional prompt and reply, which isn't the case in some contexts unless you do extreme things.
(Sadly this means that if you have a system that only supports MFA push authentication and you need to also accept codes and so on, you may be in for some work with your chainsaw.)
JSON has become today's machine-readable output format (on Unix)
Recently, I needed to delete about 1,200 email messages to a
particular destination from the mail queue on one of our systems.
This turned out to be trivial, because this system was using Postfix
and modern versions of Postfix can output mail queue status information
in JSON format. So I could dump the mail queue status, select the
relevant messages and print the queue IDs with jq, and feed this to Postfix to delete the
messages. This experience has left me with the definite view that
everything should have the option to output JSON for 'machine-readable'
output, rather than some bespoke format. For new programs, I think
that you should only bother producing JSON as your machine readable
output format.
(If you strongly object to JSON, sure, create another machine readable output format too. But if you don't care one way or another, outputting only JSON is probably the easiest approach for programs that don't already have such a format of their own.)
This isn't because JSON is the world's best format (JSON is at
best the least bad format). Instead it's
because JSON has a bunch of pragmatic virtues on a modern Unix
system. In general, JSON provides a clear and basically unambiguous
way to represent text data and much numeric data, even if it has
relatively strange characters in it (ie, JSON has escaping rules
that everyone knows and all tools can deal with); it's also generally
extensible to add additional data without causing heartburn in tools
that are dealing with older versions of a program's output. And
on Unix there's an increasingly rich collection of tools to deal
with and process JSON, starting with jq itself (and hopefully
soon GNU Awk in common configurations). Plus, JSON can generally
be transformed to various other formats if you need them.
(JSON can also be presented and consumed in either multi-line or single line formats. Multi-line output is often much more awkward to process in other possible formats.)
There's nothing unique about JSON in all of this; it could have been any other format with similar virtues where everything lined up this way for the format. It just happens to be JSON at the moment (and probably well into the future), instead of (say) XML. For individual programs there are simpler 'machine readable' output formats, but they either have restrictions on what data they can represent (for example, no spaces or tabs in text), or require custom processing that goes well beyond basic grep and awk and other widely available Unix tools, or both. But JSON has become a "narrow waist" for Unix programs talking to each other, a common coordination point that means people don't have to invent another format.
(JSON is also partially self-documenting; you can probably look at a program's JSON output and figure out what various parts of it mean and how it's structured.)
PS: Using JSON also means that people writing programs don't have to design their own machine-readable output format. Designing a machine readable output format is somewhat more complicated than it looks, so I feel that the less of it people need to do, the better.
(I say this as a system administrator who's had to deal with a certain amount of output formats that have warts that make them unnecessarily hard to deal with.)
It's good to have offline contact information for your upstream networking
So I said something on the Fediverse:
Current status: it's all fun and games until the building's backbone router disappears.
A modest suggestion: obtain problem reporting/emergency contact numbers for your upstream in advance and post them on the wall somewhere. But you're on your own if you use VOIP desk phones.
(It's back now or I wouldn't be posting this, I'm in the office today. But it was an exciting 20 minutes.)
(I was somewhat modeling the modest suggestion after nuintari's Fediverse series of "rules of networking", eg, also.)
The disappearance of the building's backbone router took out all local networking in the particular building that this happened in (which is the building with our machine room), including the university wireless in the building. THe disappearance of the wireless was especially surprising, because the wireless SSID disappeared entirely.
(My assumption is that the university's enterprise wireless access points stopped advertising the SSID when they lost some sort of management connection to their control plane.)
In a lot of organizations you might have been able to relatively easily find the necessary information even with this happening. For example, people might have smartphones with data plans and laptops that they could tether to the smartphones, and then use this to get access to things like the university directory, the university's problem reporting system, and so on. For various reasons, we didn't really have any of this available, which left us somewhat at a loss when the external networking evaporated. Ironically we'd just managed to finally find some phone numbers and get in touch with people when things came back.
(One bit of good news is that our large scale alert system worked great to avoid flooding us with internal alert emails. My personal alert monitoring (also) did get rather noisy, but that also let me see right away how bad it was.)
Of course there's always things you could do to prepare, much like there are often too many obvious problems to keep track of them all. But in the spirit of not stubbing our toes on the same problem a second time, I suspect we'll do something to keep some problem reporting and contact numbers around and available.
Shared (Unix) hosting and the problem of managing resource limits
Yesterday I wrote about how one problem with shared Unix hosting was the lack of good support for resource limits in the Unixes of the time. But even once you have decent resource limits, you still have an interlinked set of what we could call 'business' problems. These are the twin problems of what resource limits you set on people and how you sell different levels of these resources limits to your customers.
(You may have the first problem even for purely internal resource allocation on shared hosts within your organization, and it's never a purely technical decision.)
The first problem is whether you overcommit what you sell and in general how you decide on the resource limits. Back in the big days of the shared hosting business, I believe that overcommitting was extremely common; servers were expensive and most people didn't use much resources on average. If you didn't overcommit your servers, you had to charge more and most people weren't interested in paying that. Some resources, such as CPU time, are 'flow' resources that can be rebalanced on the fly, restricting everyone to a fair share when the system is busy (even if that share is below what they're nominally entitled to), but it's quite difficult to take memory back (or disk space). If you overcommit memory, your systems might blow up under enough load. If you don't overcommit memory, either everyone has to pay more or everyone gets unpopularly low limits.
(You can also do fancy accounting for 'flow' resources, such as allowing bursts of high CPU but not sustained high CPU. This is harder to do gracefully for things like memory, although you can always do it ungracefully by terminating things.)
The other problem entwined with setting resource limits is how (and if) you sell different levels of resource limits to your customers. A single resource limit is simple but probably not what all of your customers want; some will want more and some will only need less. But if you sell different limits, you have to tell customers what they're getting, let them assess their needs (which isn't always clear in a shared hosting situation), deal with them being potentially unhappy if they think they're not getting what they paid for, and so on. Shared hosting is always likely to have complicated resource limits, which raises the complexity of selling them (and of understanding them, for the customers who have to pick one to buy).
Viewed from the right angle, virtual private servers (VPSes) are a great abstraction to sell different sets of resource limits to people in a way that's straightforward for them to understand (and which at least somewhat hides whether or not you're overcommitting resources). You get 'a computer' with these characteristics, and most of the time it's straightforward to figure out whether things fit (the usual exception is IO rates). So are more abstracted, 'cloud-y' ways of selling computation, database access, and so on (at least in areas where you can quantify what you're doing into some useful unit of work, like 'simultaneous HTTP requests').
It's my personal suspicion that even if the resource limitation problems had been fully solved much earlier, shared hosting would have still fallen out of fashion in favour of simpler to understand VPS-like solutions, where what you were getting and what you were using (and probably what you needed) were a lot clearer.
One problem with "shared Unix hosting" was the lack of resource limits
I recently read Comments on Shared Unix Hosting vs. the Cloud (via), which I will summarize as being sad about how old fashioned shared hosting on a (shared) Unix system has basically died out, and along with it web server technology like CGI. As it happens, I have a system administrator's view of why shared Unix hosting always had problems and was a down-market thing with various limitations, and why even today people aren't very happy with providing it. In my view, a big part of the issue was the lack of resource limits.
The problem with sharing a Unix machine with other people is that by default, those other people can starve you out. They can take up all of the available CPU time, memory, process slots, disk IO, and so on. On an unprotected shared web server, all you need is one person's runaway 'CGI' code (which might be PHP code or etc) or even an unusually popular dynamic site and all of the other people wind up having a bad time. Life gets worse if you allow people to log in, run things in the background, run things from cron, and so on, because all of these can add extra load. In order to make shared hosting be reliable and good, you need some way of forcing a fair sharing of resources and limiting how much resources a given customer can use.
Unfortunately, for much of the practical life of shared Unix hosting, Unixes did not have that. Some Unixes could create various sorts of security boundaries, but generally not resource usage limits that applied to an entire group of processes. Even once this became possibly to some degree in Linux through cgroup(s), the kernel features took some time to mature and then it took even longer for common software to support running things in isolated and resource controlled cgroups. Even today it's still not necessarily entirely there for things like running CGIs from your web server, never mind a potential shared database server to support everyone's database backed blog.
(A shared database server needs to implement its own internal resource limits for each customer, otherwise you have to worry about a customer gumming it up with expensive queries, a flood of queries, and so on. If they need separate database servers for isolation and resource control, now they need more server resources.)
My impression is that the lack of kernel supported resource limits forced shared hosting providers to roll their own ad-hoc ways of limiting how much resources their customers could use. In turn this created the array of restrictions that you used to see on such providers, with things like 'no background processes', 'your CGI can only run for so long before being terminated', 'your shell session is closed after N minutes', and so on. If shared hosting had been able to put real limits on each of their customers, this wouldn't have been as necessary; you could go more toward letting each customer blow itself up if it over-used resources.
(How much resources to give each customer is also a problem, but that's another entry.)
How you should respond to authentication failures isn't universal
A discussion broke out in the comments on my entry on how everything should be able to ratelimit authentication failures, and one thing that came up was the standard advice that when authentication fails, the service shouldn't give you any indication of why. You shouldn't react any differently if it's a bad password for an existing account, an account that doesn't exist any more (perhaps with the correct password for the account when it existed), an account that never existed, and so on. This is common and long standing advice, but like a lot of security advice I think that the real answer is that what you should do depends on your circumstances, priorities, and goals.
The overall purpose of the standard view is to not tell attackers what they got wrong, and especially not to tell them if the account doesn't even exist. What this potentially achieves is slowing down authentication guessing and making the attacker use up more resources with no chance of success, so that if you have real accounts with vulnerable passwords the attacker is less likely to succeed against them. However, you shouldn't have weak passwords any more and on the modern Internet, attackers aren't short of resources or likely to suffer any consequences for trying and trying against you (and lots of other people). In practice, much like delays on failed authentications, it's been a long time since refusing to say why something failed meaningfully impeded attackers who are probing standard setups for SSH, IMAP, authenticated SMTP, and other common things.
(Attackers are probing for default accounts and default passwords, but the fix there is not to have any, not to slow attackers down a bit. Attackers will find common default account setups, probably much sooner than you would like. Well informed attackers can also generally get a good idea of your valid accounts, and they certainly exist.)
If what you care about is your server resources and not getting locked out through side effects, it's to your benefit for attackers to stop early. In addition, attackers aren't the only people who will fail your authentication. Your own people (or ex-people) will also be doing a certain amount of it, and some amount of the time they won't immediately realize what's wrong and why their authentication attempt failed (in part because people are sadly used to systems simply being flaky, so retrying may make things work). It's strictly better for your people if you can tell them what was wrong with their authentication attempt, at least to a certain extent. Did they use a non-existent account name? Did they format the account name wrong? Are they trying to use an account that has now been disabled (or removed)? And so on.
(Some of this may require ingenious custom communication methods (and custom software). In the comments on my entry, BP suggested 'accepting' IMAP authentication for now-closed accounts and then providing them with only a read-only INBOX that had one new message that said 'your account no longer exists, please take it out of this IMAP client'.)
There's no universally correct trade-off between denying attackers information and helping your people. A lot of where your particular trade-offs fall will depend on your usage patterns, for example how many of your people make mistakes of various sorts (including 'leaving their account configured in clients after you've closed it'). Some of it will also depend on how much resources you have available to do a really good job of recognizing serious attacks and impeding attackers with measures like accurately recognizing 'suspicious' authentication patterns and blocking them.
(Typically you'll have no resources for this and will be using more or less out of the box rate-limiting and other measures in whatever software you use. Of course this is likely to limit your options for giving people special messages about why they failed authentication, but one of my hopes is that over time, software adds options to be more informative if you turn them on.)
Everything should be able to ratelimit sources of authentication failures
One of the things that I've come to believe in is that everything, basically without exception, should be able to rate-limit authentication failures, at least when you're authenticating people. Things don't have to make this rate-limiting mandatory, but it should be possible. I'm okay with basic per-IP or so rate limiting, although it would be great if systems could do better and be able to limit differently based on different criteria, such as whether the target login exists or not, or is different from the last attempt, or both.
(You can interpret 'sources' broadly here, if you want to; perhaps you should be able to ratelimit authentication by target login, not just by source IP. Or ratelimit authentication attempts to nonexistent logins. Exim has an interesting idea of a ratelimit 'key', which is normally the source IP in string form but which you can make be almost anything, which is quite flexible.)
I have come to feel that there are two reasons for this. The first reason, the obvious one, is that the Internet is full of brute force bulk attackers and if you don't put in rate-limits, you're donating CPU cycles and RAM to them (even if they have no chance of success and will always fail, for example because you require MFA after basic password authentication succeeds). This is one of the useful things that moving your services to non-standard ports helps with; you're not necessarily any more secure against a dedicated attacker, but you've stopped donating CPU cycles to the attackers that only poke the default port.
The second reason is that there are some number of people out there who will put a user name and a password (or the equivalent in the form of some kind of bearer token) into the configuration of some client program and then forget about it. Some of the programs these people are using will retry failed authentications incessantly, often as fast as you'll allow them. Even if the people check the results of the authentication initially (for example, because they want to get their IMAP mail), they may not keep doing so and so their program may keep trying incessantly even after events like their password changing or their account being closed (something that we've seen fairly vividly with IMAP clients). Without rate-limits, these programs have very little limits on their blind behavior; with rate limits, you can either slow them down (perhaps drastically) or maybe even provoke error messages that get the person's attention.
Unless you like potentially seeing your authentication attempts per second trending up endlessly, you want to have some way to cut these bad sources off, or more exactly make their incessant attempts inexpensive for you. The simple, broad answer is rate limiting.
(Actually getting rate limiting implemented is somewhat tricky, which in my view is one reason it's uncommon (at least as an integrated feature, instead of eg fail2ban). But that's another entry.)
PS: Having rate limits on failed authentications is also reassuring, at least for me.
The practical (Unix) problems with .cache and its friends
Over on the Fediverse, I said:
Dear everyone writing Unix programs that cache things in dot-directories (.cache, .local, etc): please don't. Create a non-dot directory for it. Because all of your giant cache (sub)directories are functionally invisible to many people using your programs, who wind up not understanding where their disk space has gone because almost nothing tells them about .cache, .local, and so on.
A corollary: if you're making a disk space usage tool, it should explicitly show ~/.cache, ~/.local, etc.
If you haven't noticed, there are an ever increasing number of programs that will cache a bunch of data, sometimes a very large amount of it, in various dot-directories in people's home directories. If you're lucky, these programs put their cache somewhere under ~/.cache; if you're semi-lucky, they use ~/.local, and if you're not lucky they invent their own directory, like ~/.cargo (used by Rust's standard build tool because it wants to be special). It's my view that this is a mistake and that everyone should put their big caches in a clearly visible directory or directory hierarchy, one that people can actually find in practice.
I will freely admit that we are in a somewhat unusual environment where we have shared fileservers, a now very atypical general multi-user environment, a compute cluster, and a bunch of people who are doing various sorts of modern GPU-based 'AI' research and learning (both AI datasets and AI software packages can get very big). In our environment, with our graduate students, it's routine for people to wind up with tens or even hundreds of GBytes of disk space used up for caches that they don't even realize are there because they don't show up in conventional ways to look for space usage.
As noted by Haelwenn /ΡΠ»Π²ΡΠ½/, a plain
'du' will find such dotfiles. The problem is that plain 'du'
is more or less useless for most people; to really take advantage
of it, you have to know the right trick
(not just the -h argument but feeding it to sort to find things).
How I think most people use 'du' to find space hogs is they start
in their home directory with 'du -s *' (or maybe 'du -hs *')
and then they look at whatever big things show up. This will
completely miss things in dot-directories in normal usage. And on
Linux desktops, I believe that common GUI file browsers will omit
dot-directories by default and may not even have a particularly
accessible option to change that (this is certainly the behavior
of Cinnamon's 'Files' application and I can't imagine that GNOME
is different, considering their attitude).
(I'm not sure what our graduate students use to try explore their disk usage, but I know that multiple graduate students have been unable to find space being eaten up in dot-directories and surprised that their home directory was using so much.)
Modern languages and bad packaging outcomes at scale
Recently I read Steinar H. Gunderson's Migrating away from bcachefs (via), where one of the mentioned issues was a strong disagreement between the author of bcachefs and the Debian Linux distribution about how to package and distribute some Rust-based tools that are necessary to work with bcachefs. In the technology circles that I follow, there's a certain amount of disdain for the Debian approach, so today I want to write up how I see the general problem from a system administrator's point of view.
(Saying that Debian shouldn't package the bcachefs tools if they can't follow the wishes of upstream is equivalent to saying that Debian shouldn't support bcachefs. Among other things, this isn't viable for something that's intended to be a serious mainstream Linux filesystem.)
If you're serious about building software under controlled circumstances (and Linux distributions certainly are, as are an increasing number of organizations in general), you want the software build to be both isolated and repeatable. You want to be able to recreate the same software (ideally exactly binary identical, a 'reproducible build') on a machine that's completely disconnected from the Internet and the outside world, and if you build the software again later you want to get the same result. This means that build process can't download things from the Internet, and if you run it three months from now you should get the same result even if things out there on the Internet have changed (such as third party dependencies releasing updated versions).
Unfortunately a lot of the standard build tooling for modern languages is not built to do this. Instead it's optimized for building software on Internet connected machines where you want the latest patchlevel or even entire minor version of your third party dependencies, whatever that happens to be today. You can sometimes lock down specific versions of all third party dependencies, but this isn't necessarily the default and so programs may not be set up this way from the start; you have to patch it in as part of your build customizations.
(Some languages are less optimistic about updating dependencies, but developers tend not to like that. For example, Go is controversial for its approach of 'minimum version selection' instead of 'maximum version selection'.)
The minimum thing that any serious packaging environment needs to do is contain all of the dependencies for any top level artifact, and to force the build process to use these (and only these), without reaching out to the Internet to fetch other things (well, you're going to block all external access from the build environment). How you do this depends on the build system, but it's usually possible; in Go you might 'vendor' all dependencies to give yourself a self-contained source tree artifact. This artifact never changes the dependency versions used in a build even if they change upstream because you've frozen them as part of the artifact creation process.
(Even if you're not a distribution but an organization building your own software using third-party dependencies, you do very much want to capture local copies of them. Upstream things go away or get damaged every so often, and it can be rather bad to not be able to build a new release of some important internal tool because an upstream decided to retire to goat farming rather than deal with the EU CRA. For that matter, you might want to have local copies of important but uncommon third party open source tools you use, assuming you can reasonably rebuild them.)
If you're doing this on a small scale for individual programs you care a lot about, you can stop there. If you're doing this on an distribution's scale you have an additional decision to make: do you allow each top level thing to have its own version of dependencies, or do you try to freeze a common version? If you allow each top level thing to have its own version, you get two problems. First, you're using up more disk space for at least your source artifacts. Second and worse, now you're on the hook for maintaining, checking, and patching multiple versions of a given dependency if it turns out to have a security issue (or a serious bug).
Suppose that you have program A using version 1.2.3 of a dependency, program B using 1.2.7, the current version is 1.2.12, and the upstream releases 1.2.13 to fix a security issue. You may have to investigate both 1.2.3 and 1.2.7 to see if they have the bug and then either patch both with backported fixes or force both program A and program B to be built with 1.2.13, even if the version of these programs that you're using weren't tested and validated with this version (and people routinely break things in patchlevel releases).
If you have a lot of such programs it's certainly tempting to put your foot down and say 'every program that uses dependency X will be set to use a single version of it so we only have to worry about that version'. Even if you don't start out this way you may wind up with it after a few security releases from the dependency and the packagers of programs A and B deciding that they will just force the use of 1.2.13 (or 1.2.15 or whatever) so that they can skip the repeated checking and backporting (especially if both programs are packaged by the same person, who has only so much time to deal with all of this). If you do this inside an organization, probably no one in the outside world knows. If you do this as a distribution, people yell at you.
(Within an organization you may also have more flexibility to update program A and program B themselves to versions that might officially support version 1.2.15 of that dependency, even if the program version updates are a little risky and change some behavior. In a distribution that advertises stability and has no way of contacting people using it to warn them or coordinate changes, things aren't so flexible.)
The tradeoffs of having an internal unauthenticated SMTP server
One of the reactions I saw to my story of being hit by an alarming well prepared phish spammer was surprise that we had an unauthenticated SMTP server, even if it was only available to our internal networks. Part of the reason we have such a server is historical, but I also feel that the tradeoffs involved are not as clear cut as you might think.
One fundamental problem is that people (actual humans) aren't the
only thing that needs to be able to send email. Unless you enjoy
building your own system problem notification system from scratch,
a whole lot of things will try to send you email to tell you about
problems. Cron jobs will email you output, you may want to get
similar email about systemd units,
both Linux software RAID and smartd will want to use email to
tell you about failures, you may have home-grown management systems, and so on. In addition to these programs
on your servers, you may have inconvenient devices like networked
multi-function photocopiers that have scan to email functionality
(and the people who bought them and need to use them have feelings
about being able to do so). In a university environment such as
ours, some of the machines
involved will be run by research groups, graduate students, and so
on, not your core system administrators (and it's a very good idea
if these machines can tell their owners about failed disks and the
like).
Most of these programs will submit their email through the local mailer facilities (whatever they are), and most local mail systems ('MTAs') can be configured to use authentication when they talk to whatever SMTP gateway you point them at. So in theory you could insist on authenticated SMTP for everything. However, this gives you a different problem, because now you must manage this authentication. Do you give each machine its own authentication identity and password, or have some degree of shared authentication? How do you distribute and update this authentication information? How much manual work are you going to need to do as research groups add and remove machines (and as your servers come and go)? Are you going to try to build a system that restricts where a given authentication identity can be used from, so that someone can't make off with the photocopier's SMTP authorization and reuse it from their desktop?
(If you instead authorize IP addresses without requiring SMTP authentication, you've simply removed the requirement for handling and distributing passwords; you're still going to be updating some form of access list. Also, this has issues if people can use your servers.)
You can solve all of these problems if you want to. But there is no current general, easily deployed solution for them, partly because we don't currently have any general system of secure machine and service identity that programs like MTAs can sit on top of. So system administrators have to build such things ourselves to let one MTA prove to another MTA who and what it is.
(There are various ways to do this other than SMTP authentication and some of them are generally used in some environments; I understand that mutual TLS is common in some places. And I believe that in theory Kerberos could solve this, if everything used it.)
Every custom piece of software or piece of your environment that you build is an overhead; it has to be developed, maintained, updated, documented, and so on. It's not wrong to look at the amount of work it would require in your environment to have only authenticated SMTP and conclude that the practical risks of having unauthenticated SMTP are low enough that you'll just do that.
PS: requiring explicit authentication or authorization for notifications is itself a risk, because it means that a machine that's in a sufficiently bad or surprising state can't necessarily tell you about it. Your emergency notification system should ideally fail open, not fail closed.
PPS: In general, there are ways to make an unauthenticated SMTP server less risky, depending on what you need it to do. For example, in many environments there's no need to directly send such system notification email to arbitrary addresses outside the organization, so you could restrict what destinations the server accepts, and maybe what sending addresses can be used with it.
Sometimes you need to (or have to) run old binaries of programs
Something that is probably not news to system administrators who've been doing this long enough is that sometimes, you need to or have to run old binaries of programs. I don't mean that you need to run old versions of things (although since the program binaries are old, they will be old versions); I mean that you literally need to run old binaries, ones that were built years ago.
The obvious situation where this can happen is if you have commercial software and the vendor either goes out of business or stops providing updates for the software. In some situations this can result in you needing to keep extremely old systems alive simply to run this old software, and there are lots of stories about 'business critical' software in this situation.
(One possibly apocryphal local story is that the central IT people had to keep a SPARC Solaris machine running for more than a decade past its feasible end of life because it was the only environment that ran a very special printer driver that was used to print payroll checks.)
However, you can also get into this situation with open source software too. Increasingly, rebuilding complex open source software projects is not for the faint of heart and requires complex build environments. Not infrequently, these build environments are 'fragile', in the sense that in practice they depend on and require specific versions of tools, supporting language interpreters and compilers, and so on. If you're trying to (re)build them on a modern version of the OS, you may find some issues (also). You can try to get and run the version of the tools they need, but this can rapidly send you down a difficult rabbit hole.
(If you go back far enough, you can run into 32-bit versus 64-bit issues. This isn't just compilation problems, where code isn't 64-bit safe; you can also have code that produces different results when built as a 64-bit binary.)
This can create two problems. First, historically, it complicates moving between CPU architectures. For a couple of decades that's been a non-issue for most Unix environments, because x86 was so dominant, but now ARM systems are starting to become more and more available and even attractive, and they generally don't run old x86 binaries very well. Second, there are some operating systems that don't promise long term binary compatibility to older versions of themselves; they will update system ABIs, removing the old version of the ABI after a while, and require you to rebuild software to use the new ABIs if you want to run it on the current version of the OS. If you have to use old binaries you're stuck with old versions of the OS and generally no security updates.
(If you think that this is absurd and no one would possibly do that, I will point you to OpenBSD, which does it regularly to help maintain and improve the security of the system. OpenBSD is neither wrong nor right to take their approach; they're making a different set of tradeoffs than, say, Linux, because they have different priorities.)
Some ways to restrict who can log in via OpenSSH and how they authenticate
In yesterday's entry on allowing password authentication from the
Internet for SSH, I mentioned that there
were ways to restrict who this was enabled for or who could log in
through SSH. Today I want to cover some of them, using settings in
/etc/ssh/sshd_config.
The simplest way is to globally restrict logins with AllowUsers, listing only
specific accounts you want to be accessed over SSH. If there are
too many such accounts or they change too often, you can switch to
AllowGroups
and allow only people in a specific group that you maintain, call
it 'sshlogins'.
If you want to allow logins generally but restrict, say, password
based authentication to only people that you expect, what you want
is a Match block
and setting AuthenticationMethods within
it. You would set it up something like this:
AuthenticationMethods publickeyMatch User cks AuthenticationMethods any
If you want to be able to log in using password from your local networks but not remotely, you could extend this with an additional Match directive that looked at the origin IP address:
Match Address 127.0.0.0/8,<your networks here> AuthenticationMethods any
In general, Match directives are your tool for doing relatively complex restrictions. You could, for example, arrange that accounts in a certain Unix group can only log in from the local network, never remotely. Or reverse this so that only logins in some Unix group can log in remotely, and everyone else is only allowed to use SSH within the local network.
However, any time you're doing complex things with Match blocks, you should make sure to test your configuration to make sure it's working the way you want. OpenSSH's sshd_config is a configuration file with some additional capabilities, not a programming language, and there are undoubtedly some subtle interactions and traps you can fall into.
(This is one reason I'm not giving a lot of examples here; I'd have to carefully test them.)
Sidebar: Restricting root logins via OpenSSH
If you permit root logins via OpenSSH at all, one fun thing to do is to restrict where you'll accept them from:
PermitRootLogin no Match Address 127.0.0.0/8,<your networks here> PermitRootLogin prohibit-password # or 'yes' for some places
A lot of Internet SSH probers direct most of their effort against the root account. With this setting you're assured that all of them will fail no matter what.
Thoughts on having SSH allow password authentication from the Internet
On the Fediverse, I recently saw a poll about whether people left SSH generally accessible on its normal port or if they moved it; one of the replies was that the person left SSH on the normal port but disallowed password based authentication and only allowed public key authentication. This almost led to me posting a hot take, but then I decided that things were a bit more nuanced than my first reaction.
As everyone with an Internet-exposed SSH daemon knows, attackers are constantly attempting password guesses against various accounts. But if you're using a strong password, the odds of an attacker guessing it are extremely low, since doing 'password cracking via SSH' has an extremely low guesses per second number (enforced by your SSH daemon). In this sense, not accepting passwords over the Internet is at most a tiny practical increase in security (with some potential downsides in unusual situations).
Not accepting passwords from the Internet protects you against three other risks, two relatively obvious and one subtle one. First, it stops an attacker that can steal and then crack your encrypted passwords; this risk should be very low if you use strong passwords. Second, you're not exposed if your SSH server turns out to have a general vulnerability in password authentication that can be remotely exploited before a successful authentication. This might not be an authentication bypass; it might be some sort of corruption that leads to memory leaks, code execution, or the like. In practice, (OpenSSH) password authentication is a complex piece of code that interacts with things like your system's random set of PAM modules.
The third risk is that some piece of software will create a generic account with a predictable login name and known default password. These seem to be not uncommon, based on the fact that attackers probe incessantly for them, checking login names like 'ubuntu', 'debian', 'admin', 'testftp', 'mongodb', 'gitlab', and so on. Of course software shouldn't do this, but if something does, not allowing password authenticated SSH from the Internet will block access to these bad accounts. You can mitigate this risk by only accepting password authentication for specific, known accounts, for example only your own account.
The potential downside of only accepting keypair authentication for access to your account is that you might need to log in to your account in a situation where you don't have your keypair available (or can't use it). This is something that I probably care about more than most people, because as a system administrator I want to be able to log in to my desktop even in quite unusual situations. As long as I can use password authentication, I can use anything trustworthy that has a keyboard. Most people probably will only log in to their desktops (or servers) from other machines that they own and control, like laptops, tablets, or phones.
(You can opt to completely disallow password authentication from all other machines, even local ones. This is an even stronger and potentially more limiting restriction, since now you can't even log in from another one of your machines unless that machine has a suitable keypair set up. As a sysadmin, I'd never do that on my work desktop, since I very much want to be able to log in to my regular account from the console of one of our servers if I need to.)
My bug reports are mostly done for work these days
These days, I almost entirely report bugs in open source software as part of my work. A significant part of this is that most of what I stumble over bugs in are things that work uses (such as Ubuntu or OpenBSD), or at least things that I mostly use as part of work. There are some consequences of this that I feel like noting today.
The first is that I do bug investigation and bug reporting on work time during work hours, and I don't work on "work bugs" outside of that, on evenings, weekends, and holidays. This sometimes meshes awkwardly with the time open source projects have available for dealing with bugs (which is often in people's personal time outside of work hours), so sometimes I will reply to things and do additional followup investigation out of hours to keep a bug report moving along, but I mostly avoid it. Certainly the initial investigation and filing of a work bug is a working hours activity.
(I'm not always successful in keeping it to that because there is always the temptation to spend a few more minutes digging a bit more into the problem. This is especially acute when working from home.)
The second thing is that bug filing work is merely one of the claims on my work time. I have a finite amount of work time and a variety of things to get done with varying urgency, and filing and updating bugs is not always the top of the list. And just like other work activity, filing a particular bug has to convince me that it's worth spending some of my limited work time on this particular activity. Work does not pay me to file bugs and make open source better; they pay me to make our stuff work. Sometimes filing a bug is a good way to do this but some of the time it's not, for example because the organization in question doesn't respond to most bug reports.
(Even when it's useful in general to file a bug report because it will result in the issue being fixed at some point in the future, we generally need to deal with the problem today, so filing the bug report may take a back seat to things like developing workarounds.)
Another consequence is that it's much easier for me to make informal Fediverse posts about bugs (often as I discover more and more disconcerting things) or write Wandering Thoughts posts about work bugs than it is to make an actual bug report. Writing for Wandering Thoughts is a personal thing that I do outside of work hours, although I write about stuff from work (and I can often use something to write about, so interesting work bugs are good grist).
(There is also that making bug reports is not necessarily pleasant, and making bad bug reports can be bad. This interacts unpleasantly with the open source valorization of public work. To be blunt, I'm more willing to do unpleasant things when work is paying me than when it's not, although often the bug reports that are unpleasant to make are also the ones that aren't very useful to make.)
PS: All of this leads to a surprisingly common pattern where I'll spend much of a work day running down a bug to the point where I feel I understand it reasonably well, come home after work, write the bug up as a Wandering Thoughts entry (often clarifying my understanding of the bug in the process), and then file a bug report at work the next work day.
IMAP clients can vary in their reactions to IMAP errors
For reasons outside of the scope of this entry, we recently modified our IMAP server so that it would only return 20,000 results from an IMAP LIST command (technically 20,001 results). In our environment, an IMAP LIST operation that generates this many results is because one of the people who can hit this have run into our IMAP server backward compatibility problem. When we made this change, we had a choice for what would happen when the limit was hit, and specifically we had a choice of whether to claim that the IMAP LIST operation had succeeded or had failed. In the end we decided it was better to report that the IMAP LIST operation had failed, which also allowed us to include a text message explaining what had happened (in IMAP these are relatively free form).
(The specifics of the situation are that the IMAP LIST command will report a stream of IMAP folders back to the client and then end the stream after 20,001 entries, with either an 'ok' result or an error result with text. So in the latter case, the IMAP client gets 20,001 folder entries and an error at the end.)
Unsurprisingly, after deploying this change we've seen that IMAP clients (both mail readers and things like server webmail code) vary in their behavior when this limit is hit. The behavior we'd like to see is that the client considers itself to have a partial result and uses it as much as possible, while also telling the person using it that something went wrong. I'm not sure any IMAP client actually does this. One webmail system that we use reports the entire output from the IMAP LIST command as an 'error' (or tries to); since the error message is the last part of the output, this means it's never visible. One mail client appears to throw away all of the LIST results and not report an error to the person using it, which in practice means that all of your folders disappear (apart from your inbox).
(Other mail clients appear to ignore the error and probably show the partial results they've received.)
Since the IMAP server streams the folder list from IMAP LIST to the client as it traverses the folders (ie, Unix directories), we don't immediately know if there are going to be too many results; we only find that out after we've already reported those 20,000 folders. But in hindsight, what we could have done is reported a final synthetic folder with a prominent explanatory name and then claimed that the command succeeded (and stopped). In practice this seems more likely to show something to the person using the mail client, since actually reporting the error text we provide is apparently not anywhere near as common as we might hope.
Using tcpdump to see only incoming or outgoing traffic
In the normal course of events, implementations of 'tcpdump'
report on packets going in both directions, which is to say it
reports both packets received and packets sent. Normally this isn't
confusing and you can readily tell one from the other, but sometimes
situations aren't normal and
you want to see only incoming packets or only outgoing packets
(this has come up before). Modern
versions of tcpdump can do this, but you have to know where to
look.
If you're monitoring regular network interfaces on Linux, FreeBSD, or OpenBSD, this behavior is controlled by a tcpdump command line switch. On modern Linux and on FreeBSD, this is '-Q in' or '-Q out', as covered in the Linux manpage and the FreeBSD manpage. On OpenBSD, you use a different command line switch, '-D in' or '-D out', per the OpenBSD manpage.
(The Linux and FreeBSD tcpdump use '-D' to mean 'list all interfaces'.)
There are network types where the in or out direction can be matched by tcpdump pcap filter rules, but plain Ethernet is not one of them. This implies that you can't write a pcap filter rule that matches some packets only inbound and some packets only outbound at the same time; instead you have to run two tcpdumps.
If you have a (software) bridge interface or bridged collection of interfaces, as far as I know on both OpenBSD and FreeBSD the 'in' and 'out' directions on the underlying physical interfaces work the way you expect. Which is to say, if you have ix0 and ix1 bridged together as bridge0, 'tcpdump -Q in -i ix0' shows packets that ix0 is receiving from the physical network and doesn't include packets forward out through ix0 by the bridge interface (which in some sense you could say are 'sent' to ix0 by the bridge).
The PF packet filter system on both OpenBSD and FreeBSD can log packets to a special network interface, normally 'pflog0'. When you tcpdump this interface, both OpenBSD and FreeBSD accept an 'on <interface>' (which these days is a synonym for 'ifname <interface>') clause in pcap filters, which I believe means that the packet was received on the specific interface (per my entry on various filtering options for OpenBSD). Both also have 'inbound' and 'outbound', which I believe match based on whether the particular PF rule that caused them to match was an 'in' or an 'out' rule.
(See the OpenBSD pcap-filter and the FreeBSD pcap-filter manual pages.)
I'm firmly attached to a mouse and (overlapping) windows
In the tech circles I follow, there are a number of people who are firmly in what I could call a 'text mode' camp (eg, also). Over on the Fediverse, I said something in an aside about my personal tastes:
(Having used Unix through serial terminals or modems+emulators thereof back in the days, I am not personally interested in going back to a single text console/window experience, but it is certainly an option for simplicity.)
(Although I didn't put it in my Fediverse post, my experience with this 'single text console' environment extends beyond Unix. Similarly, I've lived without a mouse and now I want one (although I have particular tastes in mice).)
On the surface I might seem like someone who is a good candidate for the single pane of text experience, since I do much of my work in text windows, either terminals or environments (like GNU Emacs) that ape them, and I routinely do odd things like read email from the command line. But under the surface, I'm very much not. I very much like having multiple separate blocks of text around, being able to organize these blocks spatially, having a core area where I mostly work from with peripheral areas for additional things, and being able to overlap these blocks and apply a stacking order to control what is completely visible and what's partly visible.
In one view, you could say that this works partly because I have enough screen space. In another view, it would be better to say that I've organized my computing environment to have this screen space (and the other aspects). I've chosen to use desktop computers instead of portable ones, partly for increased screen space, and I've consistently opted for relatively large screens when I could reasonably get them, steadily moving up in screen size (both physical and resolution wise) over time.
(Over the years I've gone out of my way to have this sort of environment, including using unusual window systems.)
The core reason I reach for windows and a mouse is simple: I find the pure text alternative to be too confining. I can work in it if I have to but I don't like to. Using finer grained graphical windows instead of text based ones (in a text windowing environment, which exist), and being able to use a mouse to manipulate things instead of always having to use keyboard commands, is nicer for me. This extends beyond shell sessions to other things as well; for example, generally I would rather start new (X) windows for additional Emacs or vim activities rather than try to do everything through the text based multi-window features that each has. Similarly, I almost never use screen (or tmux) within my graphical desktop; the only time I reach for either is when I'm doing something critical that I might be disconnected from.
(This doesn't mean that I use a standard Unix desktop environment for my main desktops; I have a quite different desktop environment. I've also written a number of tools to make various aspects of this multi-window environment be easy to use in a work environment that involves routine access to and use of a bunch of different machines.)
If I liked tiling based window environments, it would be easier to switch to a text (console) based environment with text based tiling of 'windows', and I would probably be less strongly attached to the mouse (although it's hard to beat the mouse for selecting text). However, tiling window environments don't appeal to me (also), either in graphical or in text form. I'll use tiling in environments where it's the natural choice (for example, in vim and emacs), but I consider it merely okay.
The TLS certificate multi-file problem (for automatic updates)
In a recent entry on short lived TLS certificates and graceful certificate rollover in web servers, I mentioned that one issue with software automatically reloading TLS certificates was that TLS certificates are almost always stored in multiple files. Typically this is either two files (the TLS certificate's key and a 'fullchain' file with the TLS certificate and intermediate certificates together) or three files (the key, the signed certificate, and a third file with the intermediate chain). The core problem this creates is the same one you have any time information is split across multiple files, namely making 'atomic' changes to the set of files, so that software never sees an inconsistent state with some updated files and some not.
With TLS certificates, a mismatch between the key and the signed certificate will cause the server to be unable to properly prove that it controls the private key for the TLS certificate it presented. Either it will load the new key and the old certificate or the old key and the new certificate, and in both cases they won't be able to generate the correct proof (assuming the secure case where your TLS certificate software generates a new key for each TLS certificate renewal, which you want to do since you want to guard against your private key having been compromised).
The potential for a mismatch is obvious if the file with the TLS key and the file with the TLS certificate are updated separately (or a new version is written out and swapped into place separately). At this point your mind might turn to clever tricks like writing all of the new files to a new directory and somehow swapping the whole directory in at once (this is certainly where mine went). Unfortunately, even this isn't good enough because the program has to open the two (or three) files separately, and the time gap between the opens creates an opportunity for a mismatch more or less no matter what we do.
(If the low level TLS software operates by, for example, first loading and parsing the TLS certificate, then loading the private key to verify that it matches, the time window may be bigger than you expect because the parsing may take a bit of time. The minimal time window comes about if you open the two files as close to each other as possible and defer all loading and processing until after both are opened.)
The only completely sure way to get around this is to put everything in one file (and then use an appropriate way to update the file atomically). Short of that, I believe that software could try to compensate by checking that the private key and the TLS certificate match after they're automatically reloaded, and if they don't, it should reload both.
(If you control both the software that will use the TLS certificates and the renewal software, you can do other things. For example, you can always update the files in a specific order and then make the server software trigger an automatic reload only when the timestamp changes on the last file to be updated. That way you know the update is 'done' by the time you're loading anything.)
Remembering to make my local changes emit log messages when they act
Over on the Fediverse, I said something:
Current status: respinning an Ubuntu package build (... painfully) because I forgot the golden rule that when I add a hack to something, I should always make it log when my hack was triggered. Even if I can observe the side effects in testing, we'll want to know it happened in production.
(Okay, this isn't applicable to all hacks, but.)
Every so often we change or augment some standard piece of software or standard part of the system to do something special under specific circumstances. A rule I keep forgetting and then either re-learning or reminding myself of is that even if the effects of my change triggering are visible to the person using the system, I want to make it log as well. There are at least two reasons for this.
The first reason is that my change may wind up causing some problem for people, even if we don't think it's going to. Should it cause such problems, it's very useful to have a log message (perhaps shortly before the problem happens) to the effect of 'I did this new thing'. This can save a bunch of troubleshooting, both at the time when we deploy this change and long afterward.
The second reason is that we may turn out to be wrong about how often our change triggers, which is to say how common the specific circumstances are. This can go either way. Our change can trigger a lot more than we expected, which may mean that it's overly aggressive and is affecting people more than we want, and cause us to look for other options. Or this could be because the issue we're trying to deal with could be more significant than we expect and justifies us doing even more. Alternately, our logging can trigger a lot less than we expect, which may mean we want to take the change out rather than have to maintain a local modification that doesn't actually do much (one that almost invariably makes the system more complex and harder to understand).
In the log message itself, I want to be clear and specific, although probably not as verbose as I would be for an infrequent error message. Especially for things I expect to trigger relatively infrequently, I should probably put as many details about the special circumstances as possible into the log message, because the log message is what me and my co-workers may have to work from in six months when we've forgotten the details.
PCIe cards we use and have used in our servers
In a comment on my entry on how common (desktop) motherboards are supporting more M.2 NVMe slots but fewer PCIe cards, jmassey was curious about what PCIe cards we needed and used. This is a good and interesting question, especially since some number of our 'servers' are actually built using desktop motherboards for various reasons (for example, a certain number of the GPU nodes in our SLURM cluster, and some of our older compute servers, which we put together ourselves using early generation AMD Threadrippers and desktop motherboards for them).
Today, we have three dominant patterns of PCIe cards. Our SLURM GPU nodes obviously have a GPU card (x16 PCIe lanes) and we've added a single port 10G-T card (which I believe are all PCIe x4) so they can pull data from our fileservers as fast as possible. Most of our firewalls have an extra dual-port 10G card (mostly 10G-T but a few use SFPs). And a number of machines have dual-port 1G cards because they need to be on more networks; our current stock of these cards are physically x4 PCIe, although I haven't looked to see if they use all the lanes.
(We also have single-port 1G cards lying around that sometimes get used in various machines; these are x1 cards. The dual-port 10G cards are probably some mix of x4 and x8, since online checks say they come in both varieties. We have and use a few quad-port 1G cards for semi-exotic situations, but I'm not sure how many PCIe lanes they want, physically or otherwise. In theory they could reasonably be x4, since a single 1G is fine at x1.)
In the past, one generation of our fileserver setup had some machines that needed to use PCIe SAS controller in order to be able to talk to all of the drives in their chassis, and I believe these cards were PCIe x8; these machines also used a dual 10G-T card. The current generation handles all of their drives through motherboard controllers, but we might need to move back to cards in future hardware configurations (depending on what the available server motherboards handle on the motherboard). The good news, for fileservers, is that modern server motherboards increasingly have at least one onboard 10G port. But in a worst case situation, a large fileserver might need two SAS controller cards and a 10G card.
It's possible that we'll want to add NVMe drives to some servers (parts of our backup system may be limited by SATA write and read speeds today). Since I don't believe any of our current servers support PCIe bifurcation, this would require one or two PCIe x4 cards and slots (two if we want to mirror this fast storage, one if we decide we don't care). Such a server would likely also want 10G; if it didn't have a motherboard 10G port, that would require another x4 card (or possibly a dual-port 10G card at x8).
The good news for us is that servers tend to make all of their available slots be physically large (generally large enough for x8 cards, and maybe even x16 these days), so you can fit in all these cards even if some of them don't get all the PCIe lanes they'd like. And modern server CPUs are also coming with more and more PCIe lanes, so probably we can actually drive many of those slots at their full width.
(I was going to say that modern server motherboards mostly don't design in M.2 slots that reduce the available PCIe lanes, but that seems to depend on what vendor you look at. A random sampling of Supermicro server motherboards suggests that two M.2 slots are not uncommon, while our Dell R350s have none.)
The modern world of server serial ports, BMCs, and IPMI Serial over LAN
Once upon a time, life was relatively simple in the x86 world. Most x86 compatible PCs theoretically had one or two UARTs, which were called COM1 and COM2 by MS-DOS and Windows, ttyS0 and ttyS1 by Linux, 'ttyu0' and 'ttyu1' by FreeBSD, and so on, based on standard x86 IO port addresses for them. Servers had a physical serial port on the back and wired the connector to COM1 (some servers might have two connectors). Then life became more complicated when servers implemented BMCs (Baseboard management controllers) and the IPMI specification added Serial over LAN, to let you talk to your server through what the server believed was a serial port but was actually a connection through the BMC, coming over your management network.
Early BMCs could take very brute force approaches to making this work. The circa 2008 era Sunfire X2200s we used in our first ZFS fileservers wired the motherboard serial port to the BMC and connected the BMC to the physical serial port on the back of the server. When you talked to the serial port after the machine powered on, you were actually talking to the BMC; to get to the server serial port, you had to log in to the BMC and do an arcane sequence to 'connect' to the server serial port. The BMC didn't save or buffer up server serial output from before you connected; such output was just lost.
(Given our long standing console server, we had feelings about having to manually do things to get the real server serial console to show up so we could start logging kernel console output.)
Modern servers and their BMCs are quite intertwined, so I suspect that often both server serial ports are basically implemented by the BMC (cf), or at least are wired to it. The BMC passes one serial port through to the physical connector (if your server has one) and handles the other itself to implement Serial over LAN. There are variants on this design possible; for example, we have one set of Supermicro hardware with no external physical serial connector, just one serial header on the motherboard and a BMC Serial over LAN port. To be unhelpful, the motherboard serial header is ttyS0 and the BMC SOL port is ttyS1.
When the BMC handles both server serial ports and passes one of them through to the physical serial port, it can decide which one to pass through and which one to use as the Serial over LAN port. Being able to change this in the BMC is convenient if you want to have a common server operating system configuration but use a physical serial port on some machines and use Serial over LAN on others. With the BMC switching which server serial port comes out on the external serial connector, you can tell all of the server OS installs to use 'ttyS0' as their serial console, then connect ttyS0 to either Serial over LAN or the physical serial port as you need.
Some BMCs (I'm looking at you, Dell) go to an extra level of indirection. In these, the BMC has an idea of 'serial device 1' and 'serial device 2', with you controlling which of the server's ttyS0 and ttyS1 maps to which 'serial device', and then it has a separate setting for which 'serial device' is mapped to the physical serial connector on the back. This helpfully requires you to look at two separate settings to know if your ttyS0 will be appearing on the physical connector or as a Serial over LAN console (and gives you two settings that can be wrong).
In theory a BMC could share a single server serial port between the physical serial connector and an IPMI Serial over LAN connection, sending output to both and accepting input from each. In practice I don't think most BMCs do this and there are obvious issues of two people interfering with each other that BMCs may not want to get involved in.
PS: I expect more and more servers to drop external serial ports over time, retaining at most an internal serial header on the motherboard. That might simplify BMC and BIOS settings.
My life has been improved by my quiet Prometheus alert status monitor
I recently created a setup to provide a backup for our email-based Prometheus alerts; the basic result is that if our current Prometheus alerts change, a window with a brief summary of current alerts will appear out of the way on my (X) desktop. Our alerts are delivered through email, and when I set up this system I imagined it as a backup, in case email delivery had problems that stopped me from seeing alerts. I didn't entirely realize that in the process, I'd created a simple, terse alert status monitor and summary display.
(This wasn't entirely a given. I could have done something more clever when the status of alerts changed, like only displaying new alerts or alerts that had been resolved. Redisplaying everything was just the easiest approach that minimized maintaining and checking state.)
After using my new setup for several days, I've ended up feeling that I'm more aware of our general status on an ongoing and global basis than I was before. Being more on top of things this way is a reassuring feeling in general. I know I'm not going to accidentally miss something or overlook something that's still ongoing, and I actually get early warning of situations before they trigger actual emails. To put it in trendy jargon, I feel like I have more situational awareness. At the same time this is a passive and unintrusive thing that I don't have to pay attention to if I'm busy (or pay much attention to in general, because it's easy to scan).
Part of this comes from how my new setup doesn't require me to do anything or remember to check anything, but does just enough to catch my eye if the alert situation is changing. Part of this comes from how it puts information about all current alerts into one spot, in a terse form that's easy to scan in the usual case. We have Grafana dashboards that present the same information (and a lot more), but it's more spread out (partly because I was able to do some relatively complex transformations and summarizations in my code).
My primary source for real alerts is still our email messages about alerts, which have gone through additional Alertmanager processing and which carry much more information than is in my terse monitor (in several ways, including explicitly noting resolved alerts). But our email is in a sense optimized for notification, not for giving me a clear picture of the current status, especially since we normally group alert notifications on a per-host basis.
(This is part of what makes having this status monitor nice; it's an alternate view of alerts from the email message view.)
My new solution for quiet monitoring of our Prometheus alerts
Our Prometheus setup delivers all alert messages through email, because we do everything through email (as a first approximation). As we saw yesterday, doing everything through email has problems when your central email server isn't responding; Prometheus raised alerts about the problems but couldn't deliver them via email because the core system necessary to deliver email wasn't doing so. Today, I built myself a little X based system to get around that, using the same approach as my non-interrupting notification of new email.
At a high level, what I now have is an xlbiff based notification of our current Prometheus alerts. If there are no alerts, everything is quiet. If new alerts appear, xlbiff will pop up a text window over in the corner of my screen with a summary of what hosts have what alerts; I can click the window to dismiss it. If the current set of alerts changes, xlbiff will re-display the alerts. I currently have xlbiff set to check the alerts every 45 seconds, and I may lengthen that at some point.
(The current frequent checking is because of what started all of this; if there are problems with our email alert notifications, I want to know about it pretty promptly.)
The work of fetching, checking, and formatting alerts is done by a Python program I wrote. To get the alerts, I directly query our Prometheus server rather than talking to Alertmanager; as a side effect, this lets me see pending alerts as well (although then I have to have the Python program ignore a bunch of pending alerts that are too flaky). I don't try to do the ignoring with clever PromQL queries; instead the Python program gets everything and does the filtering itself.
Pulling the current alerts directly from Prometheus means that I can't readily access the explanatory text we add as annotations (and that then appears in our alert notification emails), but for the purposes of a simple notification that these alerts exist, the name of the alert or other information from the labels is good enough. This isn't intended to give me full details about the alerts, just to let me know what's out there. Most of the time I'll get email about the alert (or alerts) soon anyway, and if not I can directly look at our dashboards and Alertmanager.
To support this sort of thing, xlbiff has the notion of a 'check'
program that can print out a number every time it runs, and will
get passed the last invocation's number on the command line (or '0'
at the start). Using this requires boiling down the state of the
current alerts to a single signed 32-bit number. I could have used
something like the count of current alerts, but me being me I decided
to be more clever. The program takes the start time of every current
alert (from the ALERTS_FOR_STATE Prometheus metric), subtracts
a starting epoch to make sure we're not going to overflow, and adds
them all up to be the state number (which I call a 'checksum' in
my code because I started out thinking about more complex tricks
like running my output text through CRC32).
(As a minor wrinkle, I add one second to the start time of every firing alert so that when alerts go from pending to firing the state changes and xlbiff will re-display things. I did this because pending and firing alerts are presented differently in the text output.)
To get both the start time and the alert state, we must use the usual trick for pulling in extra labels:
ALERTS_FOR_STATE * ignoring(alertstate) group_left(alertstate) ALERTS
I understand why ALERTS_FOR_STATE doesn't include the alert state,
but sometimes it does force you to go out of your way.
PS: If we had alerts going off all of the time, this would be far too obtrusive an approach. Instead, our default state is that there are no alerts happening, so this alert notifier spends most of its time displaying nothing (well, having no visible window, which is even better).
Our Prometheus alerting problem if our central mail server isn't working
Over on the Fediverse, I said something:
Ah yes, the one problem that our Prometheus based alert system can't send us alert email about: when the central mail server explodes. Who rings the bell to tell you that the bell isn't working?
(This is of course an aspect of monitoring your Prometheus setup itself, and also seeing if Alertmanager is truly healthy.)
There is a story here. The short version of the story is that today we wound up with a mail loop that completely swamped our central Exim mail server, briefly running its one minute load average up to a high water mark of 3,132 before a co-worker who'd noticed the problem forcefully power cycled it. Plenty of alerts fired during the incident, but since we do all of our alert notification via email and our central email server wasn't delivering very much email (on account of that load average, among other factors), we didn't receive any.
The first thing to note is that this is a narrow and short term problem for us (which is to say, me and my co-workers). On the short term side, we send and receive enough email that not receiving email for very long during working hours is unusual enough that someone would have noticed before too long, in fact my co-worker noticed the problems even without an alert actively being triggered. On the narrow side, I failed to notice this as it was going on because the system stayed up, it just wasn't responsive. Once the system was rebooting, I noticed almost immediately because I was in the office and some of the windows on my office desktop disappeared.
(In that old version of my desktop I would have
noticed the issue right away, because an xload for the machine
in question was right in the middle of these things. These days
it's way off to the right side, out of my routine view, but I could
change that back.)
One obvious approach is some additional delivery channel for alerts about our central mail server. Unfortunately, we're entirely email focused; we don't currently use Slack, Teams, or other online chatting systems, so sending selected alerts to any of them is out as a practical option. We do have work smartphones, so in theory we could send SMS messages; in practice, free email to SMS gateways have basically vanished, so we'd have to pay for something (either for direct SMS access and we'd build some sort of system on top, or for a SaaS provider who would take some sort of notification and arrange to deliver it via SMS).
For myself, I could probably build some sort of script or program that regularly polled our Prometheus server to see if there were any relevant alerts. If there were, the program would signal me somehow, either by changing the appearance of a status window in a relatively unobtrusive way (eg turning it red) or popping up some sort of notification (perhaps I could build something around a creative use of xlbiff to display recent alerts, although this isn't as simple as it looks).
(This particular idea is a bit of a trap, because I could spend a lot of time crafting a little X program that, for example, had a row of boxes that were green, yellow, or red depending on the alert state of various really important things.)
-
Chris's Wiki :: blog
- IPv6 networks do apparently get probed (and implications for address assignment)
IPv6 networks do apparently get probed (and implications for address assignment)
For reasons beyond the scope of this entry, my home ISP recently changed my IPv6 assignment from a /64 to a (completely different) /56. Also for reasons beyond the scope of this entry, they left my old /64 routing to me along with my new /56, and when I noticed I left my old IPv6 address on my old /64 active, because why not. Of course I changed my DNS immediately, and at this point it's been almost two months since my old /64 appeared in DNS. Today I decided to take a look at network traffic to my old /64, because I knew there was some (which is actually another entry), and to my surprise much more appeared than I expected.
On my old /64, I used ::1/64 and ::2/64 for static IP addresses,
of which the first was in DNS, and the other IPv6 addresses in it
were the usual SLAAC assignments. The first thing I discovered in
my tcpdump was a surprisingly large number of cloud-based IPv6
addresses that were pinging my ::1 address. Once I excluded that
traffic, I was left with enough volume of port probes that I could
easily see them in a casual tcpdump.
The somewhat interesting thing is that these IPv6 port probes were happening at all. Apparently there is enough out there on IPv6 that it's worth scraping IPv6 addresses from DNS and then probing potentially vulnerable ports on them to see if something responds. However, as I kept watching I discovered something else, which is that a significant number of these probes were not to my ::1 address (or to ::2). Instead they were directed to various (very) low-number addresses on my /64. Some went to the ::0 address, but I saw ones to ::3, ::5, ::7, ::a, ::b, ::c, ::f, ::15, and a (small) number of others. Sometimes a sequence of source addresses in the same /64 would probe the same port on a sequence of these addresses in my /64.
(Some of this activity is coming from things with DNS, such as various shadowserver.org hosts.)
As usual, I assume that people out there on the IPv6 Internet are doing this sort of scanning of low-numbered /64 IPv6 addresses because it works. Some number of people put additional machines on such low-numbered addresses and you can discover or probe them this way even if you can't find them in DNS.
One of the things that I take away from this is that I may not want to put servers on these low IPv6 addresses in the future. Certainly one should have firewalls and so on, even on IPv6, but even then you may want to be a little less obvious and easily found. Or at the least, only use these IPv6 addresses for things you're going to put in DNS anyway and don't mind being randomly probed.
PS: This may not be news to anyone who's actually been using IPv6 and paying attention to their traffic. I'm late to this particular party for various reasons.
Your options for displaying status over time in Grafana 11
A couple of years ago I wrote about your options for displaying status over time in Grafana 9, which discussed the problem of visualizing things how many (firing) Prometheus alerts there are of each type over time. Since then, some things have changed in the Grafana ecosystem, and especially some answers have recently become clearer to me (due to an old issue report), so I have some updates to that entry.
The generally best panel type you want to use for this is a state timeline panel, with 'merge equal consecutive values' turned on. State timelines are no longer 'beta' in Grafana 11 and they work for this, and I believe they're Grafana's more or less officially recommended solution for this problem. By default a state timeline panel will show all labels, but you can enable pagination. The good news (in some sense) is that Grafana is aware that people want a replacement for the old third party Discrete panel (1, 2, 3) and may at some point do more to move toward this.
You can also use bar graphs and line graphs, as mentioned back then, which continue to have the virtue that you can selectively turn on and off displaying the timelines of some alerts. Both bar graphs and line graphs continue to have their issues for this, although I think they're now different issues than they had in Grafana 9. In particular I think (stacked) line graphs are now clearly less usable and harder to read than stacked bar graphs, which is a pity because they used to work decently well apart from a few issues.
(I've been impressed, not in a good way, at how many different ways Grafana has found to make their new time series panel worse than the old graph panel in a succession of Grafana releases. All I can assume is that everyone using modern Grafana uses time series panels very differently than we do.)
As I found out, you don't want to use the status history panel for this. The status history panel isn't intended for this usage; it has limits on the number of results it can represent and it lacks the 'merge equal consecutive values' option. More broadly, Grafana is apparently moving toward merging all of the function of this panel into the Heatmap panel (also). If you do use the status history panel for anything, you want to set a general query limit on the number of results returned, and this limit is probably best set low (although how many points the panel will accept depends on its size in the browser, so life is fun here).
Since the status history panel is basically a variant of heatmaps, you don't really want to use heatmaps either. Using Heatmaps to visualize state over time in Grafana 11 continue to have the issues that I noted in Grafana 9, although some of them may be eliminated at some point in the future as the status history panel is moved further out. Today, if for some reason you have to choose between Heatmaps and Status History for this, I think you should use Status History with a query limit.
If we ever have to upgrade from our frozen Grafana version, I would expect to keep our line graph alert visualizations and replace our Discrete panel usage with State Timeline panels with pagination turned on.
Finding a good use for keep_firing_for in our Prometheus alerts
A while back (in 2.42.0), Prometheus
introduced a feature to artificially keep alerts firing for some
amount of time after their alert condition had cleared; this is
'keep_firing_for'. At the time, I said that I didn't really
see a use for it for us, but I now
have to change that. Not only do we have a use for it, it's one
that deals with a small problem in our large scale alerts.
Our 'there is something big going on' alerts exist only to inhibit
our regular alerts. They trigger when there seems to be 'too much'
wrong, ideally fast enough that their inhibition effect stops the
normal alerts from going out. Because normal alerts from big issues
being resolved don't necessarily clean out immediately, we want our
large scale alerts to linger on for some time after the amount of
problems we have drop below their trigger point. Among other things,
this avoids a gotcha with inhibitions and resolved alerts. Because we created these alerts
before v2.42.0, we implemented the effect of lingering on by using
max_over_time() on the alert conditions (this was the old
way of giving an alert a minimum duration).
The subtle problem with using max_over_time() this way is that it means you can't usefully use a 'for:' condition to de-bounce your large scale alert trigger conditions. For example, if one of the conditions is 'there are too many ICMP ping probe failures', you'd potentially like to only declare a large scale issue if this persisted for more than one round of pings; otherwise a relatively brief blip of a switch could trigger your large scale alert. But because you're using max_over_time(), no short 'for:' will help; once you briefly hit the trigger number, it's effectively latched for our large scale alert lingering time.
Switching to extending the large scale alert directly with
'keep_firing_for' fixes this issue, and also simplifies the
alert rule expression. Once we're no longer using max_over_time(),
we can set 'for: 1m' or another useful short number to de-bounce
our large scale alert trigger conditions.
(The drawback is that now we have a single de-bounce interval for all of the alert conditions, whereas before we could possibly have a more complex and nuanced set of conditions. For us, this isn't a big deal.)
I suspect that this may be generic to most uses of max_over_time() in alert rule expressions (fortunately, this was our only use of it). Possibly there are reasonable uses for it in sub-expressions, clever hacks, and maybe also using times and durations (eg, also, also).
Prometheus makes it annoyingly difficult to add more information to alerts
Suppose, not so hypothetically, that you have a special Prometheus meta-alert about large scale issues, that exists to avoid drowning you in alerts about individual hosts or whatever when you have a large scale issue. As part of that alert's notification message, you'd like to include some additional information about things like why you triggered the alert, how many down things you detected, and so on.
While Alertmanager creates
the actual notification messages by expanding (Go) templates, it
doesn't have direct access to Prometheus or any other source of
external information, for relatively straightforward reasons.
Instead, you need to pass any additional information from Prometheus
to Alertmanager in the form (generally) of alert annotations.
Alert annotations (and alert labels) also go through template
expansion,
and in the templates for alert annotations, you can directly make
Prometheus queries with the query function.
So on the surface this looks relatively simple, although you're
going to want to look carefully at YAML string quoting.
I did some brief experimentation with this today, and it was enough to convince me that there are some issues with doing this in practice. The first issue is that of quoting. Realistic PromQL queries often use " quotes because they involve label values, and the query you're doing has to be a (Go) template string, which probably means using Go raw quotes unless you're unlucky enough to need ` characters, and then there's YAML string quoting. At a minimum this is likely to be verbose.
A somewhat bigger problem is that straightforward use of Prometheus
template expansion (using a simple pipeline) is generally going to
complain in the error log if your query provides no results. If
you're doing the query to generate a value, there are some standard
PromQL hacks to get around this.
If you want to find a label, I think you need to use a more complex
template with operation; on
the positive side, this may let you format a message fragment with
multiple labels and even the value.
More broadly, if you want to pass multiple pieces of information from a single query into Alertmanager (for example, the query value and some labels), you have a collection of less than ideal approaches. If you create multiple annotations, one for each piece of information, you give your Alertmanager templates the maximum freedom but you have to repeat the query and its handling several times. If you create a text fragment with all of the information that Alertmanager will merely insert somewhere, you basically split writing your alerts between Alertmanager and Prometheus alert rules, And if you encode multiple pieces of information into a single annotation with some scheme, you can use one query in Prometheus and not lock yourself into how the Alertmanager template will use the information, but your Alertmanager template will have to parse that information out again with Go template functions.
What all of this is a symptom of is that there's no particularly good way to pass structured information between Prometheus and Alertmanager. Prometheus has structured information (in the form of query results) and your Alertmanager template would like to use it, but today you have to smuggle that through unstructured text. It would be nice if there was a better way.
(Prometheus doesn't quite pass through structured information from a single query, the alert rule query, but it does make all of the labels and annotations available to Alertmanager. You could imagine a version where this could be done recursively, so some annotations could themselves have labels and etc.)
Doing general address matching against varying address lists in Exim
In various Exim setups, you sometimes want to match an email address against a file (or in general a list) of addresses and some sort of address patterns; for example, you might have a file of addresses and so on that you will never accept as sender addresses. Exim has two different mechanisms for doing this, address lists and nwildlsearch lookups in files that are performed through the '${lookup}' string expansion item. Generally it's better to use address lists, because they have a wildcard syntax that's specifically focused on email addresses, instead of the less useful nwildlsearch lookup wildcarding.
Exim has specific features for matching address lists (including in file form) against certain addresses associated with the email message; for example, both ACLs and routers can match against the envelope sender address (the SMTP MAIL FROM) using 'senders = ...'. If you want to match against message addresses that are not available this way, you must use a generic 'condition =' operation and either '${lookup}' or '${if match_address {..}{...}}', depending on whether you want to use a nwildlsearch lookup or an actual address list (likely in a file). As mentioned, normally you'd prefer to use an actual address list.
Now suppose that your file of addresses is, for example, per-user. In a straight 'senders =' match this is no problem, you can just write 'senders = /some/where/$local_part_data/addrs'. Life is not as easy if you want to match a message address that is not directly supported, for example the email address of the 'From:' header. If you have the user (or whatever other varying thing) in $acl_m0_var, you would like to write:
condition = ${if match_address {${address:$h_from:}} {/a/dir/$acl_m0_var/fromaddrs} }
However, match_address (and its friends) have a deliberate limitation, which is that in common Exim build configurations they don't perform string expansion on their second argument.
The way around this turns out to be to use an explicitly defined and named 'addresslist' that has the string expansion:
addresslist badfromaddrs = /a/dir/$acl_m0_var/fromaddrs[...]condition = ${if match_address {${address:$h_from:}} {+badfromaddrs} }
This looks weird, since at the point we're setting up badfromaddrs the $acl_m0_var is not even vaguely defined, but it works. The important thing that makes this go is a little sentence at the start of the Exim documentation's Expansion of lists:
Each list is expanded as a single string before it is used. [...]
Although the second argument of match_address is not string-expanded when used, if it specifies a named address list, that address list is string expanded when used and so our $acl_m0_var variable is substituted in and everything works.
Speaking from personal experience, it's easy to miss this sentence and its importance, especially if you normally use address lists (and domain lists and so on) without any string expansion, with fixed arguments.
(Probably the only reason I found it was that I was in the process of writing a question to the Exim mailing list, which of course got me to look really closely at the documentation to make sure I wasn't asking a stupid question.)
Having rate-limits on failed authentication attempts is reassuring
A while back I added rate-limits to failed SMTP authentication attempts. Mostly I did it because I was irritated at seeing all of the failed (SMTP) authentication attempts in logs and activity summaries; I didn't think we were in any actual danger from the usual brute force mass password guessing attacks we see on the Internet. To my surprise, having this rate-limit in place has been quite reassuring, to the point where I no longer even bother looking at the overall rate of SMTP authentication failures or their sources. Attackers are unlikely to make much headway or have much of an impact on the system.
Similarly, we recently updated an OpenBSD machine that has its SSH port open to the Internet from OpenBSD 7.5 to OpenBSD 7.6. One of the things that OpenBSD 7.6 brings with it is the latest version of OpenSSH, 9.8, which has per-source authentication rate limits (although they're not quite described that way and the feature is more general). This was also a reassuring change. Attackers wouldn't be getting into the machine in any case, but I have seen the machine use an awful lot of CPU at times when attackers were pounding away, and now they're not going to be able to do that.
(We've long had firewall rate limits on connections, but they have to be set high for various reasons including that the firewall can't tell connections that fail to authenticate apart from brief ones that did.)
I can wave my hands about why it feels reassuring (and nice) to know that we have rate-limits in place for (some) commonly targeted authentication vectors. I know it doesn't outright eliminate the potential exposure, but I also know that it helps reduce various risks. Overall, I think of it as making things quieter, and in some sense we're no longer getting constantly attacked as much.
(It's also nice to hope that we're frustrating attackers and wasting their time. They do sort of have limits on how much time they have and how many machines they can use and so on, so our rate limits make attacking us more 'costly' and less useful, especially if they trigger our rate limits.)
PS: At the same time, this shows my irrationality, because for a long time I didn't even think about how many SSH or SMTP authentication attempts were being made against us. It was only after I put together some dashboards about this in our metrics system that I started thinking about it (and seeing temporary changes in SSH patterns and interesting SMTP and IMAP patterns). Had I never looked, I would have never thought about it.
Our various different types of Ubuntu installs
In my entry on how we have lots of local customizations I mentioned that the amount of customization we do to any particular Ubuntu server depends on what class or type of machine they are. That's a little abstract, so let's talk about how our various machines are split up by type.
Our general install framework has two pivotal questions that categorize machines. The first question is what degree of NFS mounting the machine will do, with the choices being all of the NFS filesystems from our fileservers (more or less), NFS mounting just our central administrative filesystem either with our full set of accounts or with just staff accounts, rsync'ing that central administrative filesystem (which implies only staff accounts), or being a completely isolated machine that doesn't have even the central administrative filesystem.
Servers that people will use have to have all of our NFS filesystems mounted, as do things like our Samba and IMAP servers. Our fileservers don't cross-mount NFS filesystems from each other, but they do need a replicated copy of our central administrative filesystem and they have to have our full collection of logins and groups for NFS reasons. Many of our more stand-alone, special purpose servers only need our central administrative filesystem, and will either NFS mount it or rsync it depending on how fast we want updates to propagate. For example, our local DNS resolvers don't particularly need fast updates, but our external mail gateway needs to be up to date on what email addresses exist, which is propagated through our central administrative filesystem.
On machines that have all of our NFS mounts, we have a further type choice; we can install them either as a general login server (called an 'apps' server for historical reasons), as a 'comps' compute server (which includes our SLURM nodes), or only install a smaller 'base' set of packages on them (which is not all that small; we used to try to have a 'core' package set and a larger 'base' package set but over time we found we never installed machines with only the 'core' set). These days the only difference between general login servers and compute servers is some system settings, but in the past they used to have somewhat different package sets.
The general login servers and compute servers are mostly not further customized (there are a few exceptions, and SLURM nodes need a bit of additional setup). Almost all machines that get only the base package set are further customized with additional packages and specific configuration for their purpose, because the base package set by itself doesn't make the machine do anything much or be particularly useful. These further customizations mostly aren't scripted (or otherwise automated) for various reasons. The one big exception is installing our NFS fileservers, which we decided was both large enough and we had enough of that we wanted to script it so that everything came out the same.
As a practical matter, the choice between NFS mounting our central administrative filesystem (with only staff accounts) and rsync'ing it makes almost no difference to the resulting install. We tend to think of the two types of servers it creates as almost equivalent and mostly lump them together. So as far as operating our machines goes, we mostly have 'all NFS mounts' machines and 'only the administrative filesystem' machines, with a few rare machines that don't have anything (and our NFS fileservers, which are special in their own way).
(In the modern Linux world of systemd, much of our customizations aren't Ubuntu specific, or even specific to Debian and derived systems that use apt-get. We could probably switch to Debian relatively easily with only modest changes, and to an RPM based distribution with more work.)
We have lots of local customizations (and how we keep track of them)
In a comment on my entry on forgetting some of our local changes to our Ubuntu installs, pk left an interesting and useful comment on how they manage changes so that the changes are readily visible in one place. This is a very good idea and we do something similar to it, but a general limitation of all such approaches is that it's still hard to remember all of your changes off the top of your head once you've made enough of them. Once you're changing enough things, you generally can't put them all in one directory that you can simply 'ls' to be reminded of everything you change; at best, you're looking at a list of directories where you change things.
Our system for customizing Ubuntu stores the master version of customizations in our central administrative filesystem, although split across several places for convenience. We broadly have one directory hierarchy for Ubuntu release specific files (or at least ones that are potentially version specific; in practice a lot are the same between different Ubuntu releases), a second hierarchy (or two) for files that are generic across Ubuntu versions (or should be), and then a per-machine hierarchy for things specific to a single machine. Each hierarchy mirrors the final filesystem location, so that our systemd unit files will be in, for example, <hierarchy root>/etc/systemd/system.
Our current setup embeds the knowledge of what files will or won't be installed on any particular class of machines into the Ubuntu release specific 'postinstall' script that we run to customize machines, in the form of a whole bunch of shell commands to copy each of the files (or collections of files). This gives us straightforward handling of files that aren't always installed (or that vary between types of machines), at the cost of making it a little unclear if a particular file in the master hierarchy will actually be installed. We could try to do something more clever, but it would be less obvious that tne current straightforward approach where the postinstall script has a lot of 'cp -a <src>/etc/<file> /etc/<file>' and it's easy to see what you need to do to add one or specially handle one.
(The obvious alternate approach would be to have a master file that listed all of the files to be installed on each type of machine. However, one advantage of the current approach is that it's easy to have various commentary about the files being installed and why, and it's also easy to run commands, install packages, and so on in between installing various files. We don't install them all at once.)
Based on some brute force approximation, it appears that we install around 100 customization files on a typical Ubuntu machine (we install more on some types of machines than on other types, depending on whether the machine will have all of our NFS mounts and whether or not it's a machine regular people will log in to). Specific machines can be significantly customized beyond this; for example, our ZFS fileservers get an additional scripted customization pass.
PS: The reason we have this stuff scripted and stored in a central filesystem is that we have over a hundred servers and a lot of them are basically identical to each other (most obviously, our SLURM nodes). In aggregate, we install and reinstall a fair number of machines and almost all of them have this common core.
Our local changes to standard (Ubuntu) installs are easy to forget
We have been progressively replacing a number of old one-off Linux machines with up to date replacements that run Ubuntu and so are based on our standard Ubuntu install. One of those machines has a special feature where a group of people are allowed to use passworded sudo to gain access to a common holding account. After we deployed the updated machine, these people got in touch with us to report that something had gone wrong with the sudo system. This was weird to me, because I'd made sure to faithfully replicate the old system's sudo customizations to the new one. When I did some testing, things got weirder; I discovered that sudo was demanding the root password instead of my password. This was definitely not how things were supposed to work for this sudo access (especially since the people with sudo access don't know the root password for the machine).
Whether or not sudo
does this is controlled by the setting of 'rootpw' in sudoers or one of
the files it includes (at least with Ubuntu's standard sudo.conf). The stock
Ubuntu sudoers doesn't set 'rootpw', and of course this machine's
sudoers customizations didn't set them either. But when I looked
around, I discovered that we had long ago set up an /etc/sudoers.d
customization file to set 'rootpw' and made it part of our
standard Ubuntu install. When I rebuilt this machine based on our
standard Ubuntu setup, the standard install stuff had installed
this sudo customization. Since we'd long ago completely forgotten
about its existence, I hadn't remembered it while customizing the
machine to its new purpose, so it had stayed.
(We don't normally use passworded sudo, and we definitely want access to root to require someone to know the special root password, not just the password to a sysadmin's account.)
There are probably a lot of things that we've added to our standard install over the years that are like this sudo customization. They exist to make things work (or not work), and as long as they keep quietly doing their jobs it's very easy to forget them and their effects. Then we do something exceptional on a machine and they crop up, whether it's preventing sudo from working like we want it to or almost giving us a recursive syslog server.
(I don't have any particular lesson to draw from this, except that it's surprisingly difficult to de-customize a machine. One might think the answer is to set up the machine from scratch outside our standard install framework, but the reality is that there's a lot from the standard framework that we still want on such machines. Even with issues like this, it's probably easier to install them normally and then fix the issues than do a completely stock Ubuntu server install.)
Some thoughts on why 'inetd activation' didn't catch on
Inetd is a traditional Unix 'super-server' that listens on multiple (IP) ports and runs programs in response to activity on them; it dates from the era of 4.3 BSD. In theory inetd can act as a service manager of sorts for daemons like the BSD r* commands, saving them from having to implement things like daemonization, and in fact it turns out that one version of this is how these daemons were run in 4.3 BSD. However, running daemons under inetd never really caught on (even in 4.3 BSD some important daemons ran outside of inetd), and these days it's basically dead. You could ask why, and I have some thoughts on that.
The initial version of inetd only officially supported running TCP services in a mode where each connection ran a new instance of the program (call this the CGI model). On the machines of the 1980s and 1990s, this wasn't a particularly attractive way to run anything but relatively small and simple programs (and ones that didn't have to do much work on startup). In theory you could possibly run TCP services in a mode where they were passed the server socket and then accepted new connections themselves for a while; in practice, no one seems to have really written daemons that supported this. Daemons that supported an 'inetd mode' generally meant the 'run a copy of the program for each connection' mode.
(Possibly some of them supported both modes of inetd operation, but system administrators would pretty much assume that if a daemon's documentation said just 'inetd mode' that it meant the CGI model.)
Another issue is that inetd is not a service manager. It will start things for you, but that's it; it won't shut down things for you (although you can get it to stop listening on a port), and it won't tell you what's running (you get to inspect the process list). On Unixes with a System V init system or something like it, running your daemons as standalone things gave you access to start, stop, restart, status, and so on service management options that might even work (depending on the quality of the init.d scripts involved). Since daemons had better usability when run as standalone services, system administrators and others had relatively little reason to push for inetd support, especially in the second mode.
In general, running any important daemon under inetd has many of the same downside as systemd socket activation of services. As a practical matter, system administrators like to know that important daemons are up and running right away, and they don't have some hidden issue that will cause them to fail to start just when you want them. The normal CGI-like inetd mode also means that any changes to configuration files and the like take effect right away, which may not be what you want; system administrators tend to like controlling when daemons restart with new configurations.
All of this is likely tied to what we could call 'cultural factors'. I suspect that authors of daemons perceived running standalone as the more serious and prestigious option, the one for serious daemons like named and sendmail, and inetd activation to be at most a secondary feature. If you wrote a daemon that only worked with inetd activation, you'd practically be proclaiming that you saw your program as a low importance thing. This obviously reinforces itself, to the degree that I'm surprised sshd even has an option to run under inetd.
(While some Linuxes are now using systemd socket activation for sshd, they aren't doing it via its '-i' option.)
PS: There are some services that do still generally run under inetd (or xinetd, often the modern replacement, cf). For example, I'm not sure if the Amanda backup system even has an option to run its daemons as standalone things.