Normal view
New Article on BSD Cafe Journal: WordPress on FreeBSD with BastilleBSD
Realizing we needed two sorts of alerts for our temperature monitoring
We have a long standing system to monitor the temperatures of our machine rooms and alert us if there are problems. A recent discussion about the state of the temperature in one of them made me realize that we want to monitor and alert for two different problems, and because they're different we need two different sorts of alerts in our monitoring system.
The first, obvious problem is a machine room AC failure, where the AC shuts off or becomes almost completely ineffective. In our machine rooms, an AC failure causes a rapid and sustained rise in temperature to well above its normal maximum level (which is typically reached just before the AC starts its next cooling cycle). AC failures are high priority issues that we want to alert about rapidly, because we don't have much time before machines start to cook themselves (and they probably won't shut themselves down before the damage has been done).
The second problem is an AC unit that can't keep up with the room's heat load; perhaps its filters are (too) clogged, or it's not getting enough cooling from the roof chillers, or various other mysterious AC reasons. The AC hasn't failed and it is still able to cool things to some degree and keep the temperature from racing up, but over time the room's temperature steadily drifts upward. Often the AC will still be cycling on and off to some degree and we'll see the room temperature vary up and down as a result; at other things the room temperature will basically reach a level and more or less stay there, presumably with the AC running continuously.
One issue we ran into is that a fast triggering alert that was implicitly written for the AC failure case can wind up flapping up and down if insufficient AC has caused the room to slowly drift close to its triggering temperature level. As the AC works (and perhaps cycles on and off), the room temperature will shift above and then back below the trigger level, and the alert flaps.
We can't detect both situations with a single alert, so we need at least two. Currently, the 'AC is not keeping up' alert looks for sustained elevated temperatures with the temperature always at or above a certain level over (much) more time than the AC should take to bring it down, even if the AC has to avoid starting for a bit of time to not cycle too fast. The 'AC may have failed' alert looks for high temperatures over a relatively short period of time, although we may want to make this an average over a short period of time.
(The advantage of an average is that if the temperature is shooting up, it may trigger faster than a 'the temperature is above X for Y minutes' alert. The drawback is that an average can flap more readily than a 'must be above X for Y time' alert.)
Checklists are hard (but still a good thing)
We recently had a big downtime at work where part of the work was me doing a relatively complex and touchy thing. Naturally I made a checklist, but also naturally my checklist turned out to be incomplete, with some things I'd forgotten and some steps that weren't quite right or complete. This is a good illustration that checklists are hard to create.
Checklists are hard partly because they require us to try to remember, reconstruct, and understand everything in what's often a relatively complex system that is too big for us to hold in our mind. If your understanding is incomplete you can overlook something and so leave out a step or a part of a step, and even if you write down a step you may not fully remember (and record) why the step has to be there. My view is that this is especially likely in system administration where we may have any number of things that have been quietly sitting in the corner for some time, working away without problems, and so they've slipped out of our minds.
(For example, one of the issues that we ran into in this downtime was not remembering all of the hosts that ran crontab jobs that used one particular filesystem. Of course we thought we did know, so we didn't try to systematically look for such crontab jobs.)
To get a really solid checklist you have to be able to test it, much like all documentation needs testing. Unfortunately, a lot of the checklists I write (or don't write) are for one-off things that we can't really test in advance for various reasons, for example because they involve a large scale change to our live systems (that requires a downtime). If you're lucky you'll realize that you don't know something or aren't confident in something while writing the checklist, so you can investigate it and hopefully get it right, but some of the time you'll be confident you understand the problem but you're wrong.
Despite any imperfections, checklists are still a good thing. An imperfect written down checklist is better than relying on your memory and mind on the fly almost all of the time (the rare exceptions are when you wouldn't even dare do the operation without a checklist but an imperfect checklist tempts you into doing it and fumbling).
(You can try to improve the situation by keeping notes on what was missed in the checklist and then saving or publishing these notes somewhere. You can review these after the fact notes on what was missed in this specific checklist if you have to do the thing again, or look for specific types of things you tend to overlook and should specifically check for the next time you're making a checklist that touches on some area.)
People still use our old-fashioned Unix login servers
Every so often I think about random things, and today's random thing was how our environment might look if it was rebuilt from scratch as a modern style greenfield development. One of the obvious assumptions is that it'd involve a lot of use of containers, which led me to wondering how you handle traditional Unix style login servers. This is a relevant issue for us because we have such traditional login servers and somewhat to our surprise, they still see plenty of use.
We have two sorts of login servers. There's effectively one general purpose login server that people aren't supposed to do heavy duty computation on (and which uses per-user CPU and RAM limits to help with that), and four 'compute' login servers where they can go wild and use up all of the CPUs and memory they can get their hands on (with no guarantees that there will be any, those machines are basically first come, first served; for guaranteed CPUs and RAM people need to use our SLURM cluster). Usage of these servers has declined over time, but they still see a reasonable amount of use, including by people who have only recently joined the department (as graduate students or otherwise).
What people log in to our compute servers to do probably hasn't changed much, at least in one sense; people probably don't log in to a compute server to read their mail with their favorite text mode mail reader (yes, we have Alpine and Mutt users). What people use the general purpose 'application' login server for likely has changed a fair bit over time. It used to be that people logged in to run editors, mail readers, and other text and terminal based programs. However, now a lot of logins seem to be done either to SSH to other machines that aren't accessible from the outside world or to run the back-ends of various development environments like VSCode. Some people still use the general purpose login server for traditional Unix login things (me included), but I think it's rarer these days.
(Another use of both sorts of servers is to run cron jobs; various people have various cron jobs on one or the other of our login servers. We have to carefully preserve them when we reinstall these machines as part of upgrading Ubuntu releases.)
PS: I believe the reason people run IDE backends on our login servers is because they have their code on our fileservers, in their (NFS-mounted) home directories. And in turn I suspect people put the code there partly because they're going to run the code on either or both of our SLURM cluster or the general compute servers. But in general we're not well informed about what people are using our login servers for due to our support model.
What OSes we use here (as of July 2025)
About five years ago I wrote an entry on what OSes we were using at the time. Five years is both a short time and a long time here, and in that time some things have changed.
Our primary OS is still Ubuntu LTS; it's our default and we use it on almost everything. On the one hand, these days 'almost everything' covers somewhat more ground than it did in 2020, as some machines have moved from OpenBSD to Ubuntu. On the other hand, as time goes by I'm less and less confident that we'll still be using Ubuntu in five years, because I expect Canonical to start making (more) unfortunate and unacceptable changes any day now. Our most likely replacement Linux is Debian.
CentOS is dead here, killed by a combination of our desire to not have two Linux variants to deal with and CentOS Stream. We got rid of the last of our CentOS machines last year. Conveniently, our previous commercial anti-spam system vendor effectively got out of the business so we didn't have to find a new Unix that they supported.
We're still using OpenBSD, but it's increasingly looking like a legacy OS that's going to be replaced by FreeBSD as we rebuild the various machines that currently run OpenBSD. Our primary interests are better firewall performance and painless mirrored root disks, but if we're going to run some FreeBSD machines and it can do everything OpenBSD can, we'd like to run fewer Unixes so we'll probably replace all of the OpenBSD machines with FreeBSD ones over time. This is a shift in progress and we'll see how far it goes, but I don't expect the number of OpenBSD machines we run to go up any more; instead it's a question of how far down the number goes.
(Our opinions about not using Linux for firewalls haven't changed. We like PF, it's just we like FreeBSD as a host for it more than OpenBSD.)
We continue to not use containers so we don't have to think about a separate, minimal Linux for container images.
There are a lot of research groups here and they run a lot of machines, so research group machines are most likely running a wide assortment of Linuxes and Unixes. We know that Ubuntu (both LTS and non-LTS) is reasonably popular among research groups, but I'm sure there are people with other distributions and probably some use of FreeBSD, OpenBSD, and so on. I believe there may be a few people still using Solaris machines.
(My office desktop continues to run Fedora, but I wouldn't run it on any production server due to the frequent distribution version updates. We don't want to be upgrading distribution versions every six months.)
Overall I'd say we've become a bit more of an Ubuntu LTS monoculture than we were before, but it's not a big change, partly because we were already mostly Ubuntu. Given our views on things like firewalls, we're probably never going to be all-Ubuntu or all-Linux.
The easiest way to interact with programs is to run them in terminals
I recently wrote about a new little script of mine, which I use to start programs in terminals in a way that I can interact with them (to simplify it). Much of what I start with this tool doesn't need to run in a terminal window at all; the actual program will talk directly to the X server or arrange to talk to my Firefox or the like. I could in theory start them directly from my X session startup script, as I do with other things.
The reason I haven't put these things in my X session startup is that running things in shell sessions in terminal windows is the easiest way to interact with them in all sorts of ways. It's trivial to stop the program or restart it, to look at its output, to rerun it with slightly different arguments if I need to, it automatically inherits various aspects of my current X environment, and so on. You can do all of these things with programs in ways other than using shell sessions in terminals, but it's generally going to be more awkward.
(For instance, on systemd based Linuxes, I could make some of these programs into systemd user services, but I'd still have to use systemd commands to manipulate them. If I run them as standalone programs started from my X session script, it's even more work to stop them, start them again, and so on.)
For well established programs that I expect to never restart or want to look at output from, I'll run them from my X session startup script. But for new programs, like these, they get to spend a while in terminal windows because that's the easiest way. And some will be permanent terminal window occupants because they sometimes produce (text) output.
On the one hand, using terminal windows for this is simple and effective, and I could probably make it better by using a multi-tabbed terminal program, with one tab for each program (or the equivalent in a regular terminal program with screen or tmux). On the other hand, it feels a bit sad that in 2025, our best approach for flexible interaction with a program and monitoring its output is 'put it in a terminal'.
(It's also irritating that with some programs, the easiest and best way to make sure that they really exit when you want them to shut down, rather than "helpfully" lingering on in various ways, is to run them from a terminal and then Ctrl-C them when you're done with them. I have to use a certain video conferencing application that is quite eager to stay running if you tell it to 'quit', and this is my solution to it. Someday I may have to figure out how to put it in a systemd user unit so that it can't stage some sort of great escape into the background.)
On sysadmins (not) changing (OpenSSL) cipher suite strings
Recently I read Apps shouldnβt let users enter OpenSSL cipher-suite strings by Frank Denis (via), which advocates for providing at most a high level interface to people that lets them express intentions like 'forward secrecy is required' or 'I have to comply with FIPS 140-3'. As a system administrator, I've certainly been guilty of not keeping OpenSSL cipher suite strings up to date, so I have a good deal of sympathies for the general view of trusting the clients and the libraries (and also possibly the servers). But at the same time, I think that this approach has some issues. In particular, if you're only going to set generic intents, you have to trust that the programs and libraries have good defaults. Unfortunately, historically time when system administrators have most reached for setting specific OpenSSL cipher suite strings was when something came up all of a sudden and they didn't trust the library or program defaults to be up to date.
The obvious conclusion is that an application or library that wants people to only set high level options needs to commit to agility and fast updates so that it always has good defaults. This needs more than just the upstream developers making prompt updates when issues come up, because in practice a lot of people will get the program or library through their distribution or other packaging mechanism. A library that really wants people to trust it here needs to work with distributions to make sure that this sort of update can rapidly flow through, even for older distribution versions with older versions of the library and so on.
(For obvious reasons, people are generally pretty reluctant to touch TLS libraries and would like to do it as little as possible, leaving it to specialists and even then as much as possible to the upstream. Bad things can and have happened here.)
If I was doing this for a library, I would be tempted to give the library two sets of configuration files. One set, the official public set, would be the high level configuration that system administrators were supposed to use to express high level intents, as covered by Frank Denis. The other set would be internal configuration that expressed all of those low level details about cipher suite preferences, what cipher suites to use when, and so on, and was for use by the library developers and people packaging and distributing the library. The goal is to make it so that emergency cipher changes can be shipped as relatively low risk and easily backported internal configuration file changes, rather than higher risk (and thus slower to update) code changes. In an environment with reproducible binary builds, it'd be ideal if you could rebuild the library package with only the configuration files changed and get library shared objects and so on that were binary identical to the previous versions, so distributions could have quite high confidence in newly-built updates.
(System administrators who opted to edit these second set of files themselves would be on their own. In packaging systems like RPM and Debian .debs, I wouldn't even have these files marked as 'configuration files'.)
A new little shell script to improve my desktop environment
Recently on the Fediverse I posted a puzzle about a little shell script:
A silly little Unix shell thing that I've vaguely wanted for ages but only put together today. See if you can guess what it's for:
#!/bin/sh
trap 'exec $SHELL' 2
"$@"
exec $SHELL
(The use of this is pretty obscure and is due to my eccentric X environment.)
The actual version I now use wound up slightly more complicated,
and I call it 'thenshell'. What it does (as suggested by the name)
is to run something and then after the thing either exits or is
Ctrl-C'd, it runs a shell. This is pointless in normal circumstances
but becomes very relevant if you use this as the command for a
terminal window to run instead of your shell, as in 'xterm -e
thenshell <something>
'.
Over time, I've accumulated a number of things I want to run in my eccentric desktop environment, such as my system for opening URLs from remote machines and my alert monitoring. But some of the time I want to stop and restart these (or I need to restart them), and in general I want to notice if they produce some output, so I've been running them in terminal windows. Up until now I've had to manually start a terminal and run these programs each time I restart my desktop environment, which is annoying and sometimes I forget to do it for something. My new 'thenshell' shell script handles this; it runs whatever and then if it's interrupted or exits, starts a shell so I can see things, restart the program, or whatever.
Thenshell isn't quite a perfect duplicate of the manual version. One obvious limitation is that it doesn't put the command into the shell's command history, so I can't just cursor-up and hit return to restart it. But this is a small thing compared to having all of these things automatically started for me.
(Actually, I think I might be able to get this into a version of
thenshell
that knows exactly how my shell
and my environment handle history, but it would be more than a bit
of a hack. I may still try it, partly because it would be nifty.)
My pragmatic view on virtual screens versus window groups
I recently read z3bra's 2014 Avoid workspaces (via) which starts out with the tag "Virtual desktops considered harmful". At one level I don't disagree with z3bra's conclusion that you probably want flexible groupings of windows, and I also (mostly) don't use single-purpose virtual screens. But I do it another way, which I think is easier than z3bra's (2014) approach.
I've written about how I use virtual screens in my desktop environment, although a bit of that is now out of date. The short summary is that I mostly have a main virtual screen and then 'overflow' virtual screens where I move to if I need to do something else without cleaning up the main virtual screen (as a system administrator, I can be quite interrupt-driven or working on more than one thing at once). This sounds a lot like window groups, and I'm sure I could do it with them in another window manager. The advantage to me of fvwm's virtual screens is that it's very easy to move windows from one to another.
If I start a window in one virtual screen, for what I think is going to be one purpose, and it turns out that I need it for another purpose too, on another virtual screen, I don't have to fiddle around with, say, adding or changing its tags. Instead I can simply grab it and move it to the new virtual screen (or, for terminal windows and some others, iconify them on one screen, switch screens, and deiconify them). This makes it fast, fluid, and convenient to shuffle things around, especially for windows where I can do this by iconifying and deiconify them.
This is somewhat specific to (fvwm's idea of) virtual screens, where the screens have a spatial relationship to each other and you can grab windows and move them around to change their virtual screen (either directly or through FvwmPager). In particular, I don't have to switch between virtual screens to drag a window on to my current one; I can grab it in a couple of ways and yank it to where I am now.
In other words, it's the direct manipulation of window grouping that makes this work so nicely. Unfortunately I'm not sure how to get direct manipulation of currently not visible windows without something like virtual screens or virtual desktops. You could have a 'show all windows' feature, but that still requires bouncing between that all-windows view (to tag in new windows) and your regular view. Maybe that would work fluidly enough, especially with today's fast graphics.
Potential issues in running your own identity provider
Over on the Fediverse, Simon Tatham had a comment about (using) cloud identity providers that's sparked some discussion. Yesterday I wrote about the facets of identity providers. Today I'm sort of writing about why you might not want to run your own identity provider, despite the hazards of depending on the security of some outside third party. I'll do this by talking about what I see as being involved in the whole thing.
The hardcore option is to rely on no outside services at all, not even for multi-factor authentication. This pretty much reduces your choices for MFA down to TOTP and perhaps WebAuthn, either with devices or with hardware keys. And of course you're going to have to manage all aspects of your MFA yourself. I'm not sure if there's capable open source software here that will let people enroll multiple second factors, handle invalidating one, and so on.
One facet of being an identity provider is managing identities. There's a wide variety of ways to do this; there's Unix accounts, LDAP databases, and so on. But you need a central system for it, one that's flexible enough to cope with with real world, and that system is load bearing and security sensitive. You will need to keep it secure and you'll want to keep logs and audit records, and also backups so you can restore things if it explodes (or go all the way to redundant systems for this). If the identity service holds what's considered 'personal information' in various jurisdictions, you'll need to worry about an attacker being able to bulk-extract that information, and you'll need to build enough audit trails so you can tell to what extent that happened. Your identity system will need to be connected to other systems in your organization so it knows when people appear and disappear and can react appropriately; this can be complex and may require downstream integrations with other systems (either yours or third parties) to push updates to them.
Obviously you have to handle primary authentication yourself (usually through passwords). This requires you to build and operate a secure password store as well as a way of using it for authentication, either through existing technology like LDAP or something else (this may or may not be integrated with your identity service software, as passwords are often considered part of the identity). Like the identity service but more so, this system will need logs and audit trails so you can find out when and how people authenticated to it. The log and audit information emitted by open source software may not always meet your needs, in which case you may wind up doing some hacks. Depending on how exposed this primary authentication service is, it may need its own ratelimiting and alerting on signs of potential compromised accounts or (brute force) attacks. You will also definitely want to consider reacting in some way to accounts that pass primary authentication but then fail second-factor authentication.
Finally, you will need to operate the 'identity provider' portion of things, which will probably do either or both of OIDC and SAML (but maybe you (also) need Kerberos, or Active Directory, or other things). You will have to obtain the software for this, keep it up to date, worry about its security and the security of the system or systems it runs on, make sure it has logs and audit trails that you capture, and ideally make sure it has ratelimits and other things that monitor for and react to signs of attacks, because it's likely to be a fairly exposed system.
If you're a sufficiently big organization, some or all of these services probably need to be redundant, running on multiple servers (perhaps in multiple locations) so the failure of a single server doesn't lock you out of everything. In general, all of these expose you to all of the complexities of running your own servers and services, and each and all of them are load bearing and highly security sensitive, which probably means that you should be actively paying attention to them more or less all of the time.
If you're lucky you can find suitable all-in-one software that will handle all the facets you need (identity, primary authentication, OIDC/SAML/etc IdP, and perhaps MFA authentication) in a way that works for you and your organization. If not, you're going to have to integrate various different pieces of software, possibly leaving you with quite a custom tangle (this is our situation). The all in one software generally seems to have a reputation of being pretty complex to set up and operate, which is not surprising given how much ground it needs to cover (and how many protocols it may need to support to interoperate with other systems that want to either push data to it or pull data and authentication from it). As an all-consuming owner of identity and authentication, my impression is that such software is also something that's hard to add to an existing environment after the fact and hard to swap out for anything else.
(So when you pick an all in one open source software for this, you really have to hope that it stays good, reliable software for many years to come. This may mean you need to build up a lot of expertise before you commit so that you really understand your choices, and perhaps even do pilot projects to 'kick the tires' on candidate software. The modular DIY approach is more work but it's potentially easier to swap out the pieces as you learn more and your needs change.)
The obvious advantage of a good cloud identity provider is that they've already built all of these systems and they have the expertise and infrastructure to operate them well. Much like other cloud services, you can treat them as a (reliable) black box that just works. Because the cloud identity provider works at a much bigger scale than you do, they can also afford to invest a lot more into security and monitoring, and they have a lot more visibility into how attackers work and so on. In many organizations, especially smaller ones, looking after your own identity provider is a part time job for a small handful of technical people. In a cloud identity provider, it is the full time job of a bunch of developers, operations, and security specialists.
(This is much like the situation with email (also). The scale at which cloud providers operates dwarfs what you can manage. However, your identity provider is probably more security sensitive and the quality difference between doing it yourself and using a cloud identity provider may not be as large as it is with email.)
Thinking about facets of (cloud) identity providers
Over on the Fediverse, Simon Tatham had a comment about cloud identity providers, and this sparked some thoughts of my own. One of my thoughts is that in today's world, a sufficiently large organization may have a number of facets to its identity provider situation (which is certainly the case for my institution). Breaking up identity provision into multiple facets can leave it not clear if and to what extend you could be said to be using a 'cloud identity provider'.
First off, you may outsource 'multi-factor authentication', which is to say your additional factor, to a specialist SaaS provider who can handle the complexities of modern MFA options, such as phone apps for push-based authentication approval. This SaaS provider can turn off your ability to authenticate, but they probably can't authenticate as a person all by themselves because you 'own' the first factor authentication. Well, unless you have situations where people only authenticate via their additional factor and so your password or other first factor authentication is bypassed.
Next is the potential distinction between an identity provider and an authentication source. The identity provider implements things like OIDC and SAML, and you may have to use a big one in order to get MFA support for things like IMAP. However, the identity provider can delegate authenticating people to something else you run using some technology (which might be OIDC or SAML but also could be something else). In some cases this delegation can be quite visible to people authenticating; they will show up to the cloud identity provider, enter their email address, and wind up on your web-based single sign on system. You can even have multiple identity providers all working from the same authentication source. The obvious exposure here is that a compromised identity provider can manufacture attested identities that never passed through your authentication source.
Along with authentication, someone needs to be (or at least should be) the 'system of record' as to what people actually exist within your organization, what relevant information you know about them, and so on. Your outsourced MFA SaaS and your (cloud) identity providers will probably have their own copies of this data where you push updates to them. Depending on how systems consume the IdP information and what other data sources they check (eg, if they check back in with your system of record), a compromised identity provider could invent new people in your organization out of thin air, or alter the attributes of existing people.
(Small IdP systems often delegate both password validation and knowing who exists and what attributes they have to other systems, like LDAP servers. One practical difference is whether the identity provider system asks you for the password or whether it sends you to something else for that.)
If you have no in-house authentication or 'who exists' identity system and you've offloaded all of these to some external provider (or several external providers that you keep in sync somehow), you're clearly at the mercy of that cloud identity provider. Otherwise, it's less clear and a lot more situational as to when you could be said to be using a cloud identity provider and thus how exposed you are. I think one useful line to look at is to ask whether a particular identity provider is used by third party services or if it's only used to for that provider's own services. Or to put it in concrete terms, as an example, do you use Github identities only as part of using Github, or do you authenticate other things through your Github identities?
(With that said, the blast radius of just a Github (identity) compromise might be substantial, or similarly for Google, Microsoft, or whatever large provider of lots of different services that you use.)
I have divided (and partly uninformed) views on OpenTelemetry
OpenTelemetry ('OTel') is one of the current in things in the broad metrics and monitoring space. As I understand it, it's fundamentally a set of standards (ie, specifications) for how things can emit metrics, logs, and traces; the intended purpose is (presumably) so that people writing programs can stop having to decide if they expose Prometheus format metrics, or Influx format metrics, or statsd format metrics, or so on. They expose one standard format, OpenTelemetry, and then everything (theoretically) can consume it. All of this has come on to my radar because Prometheus can increasingly ingest OpenTelemetry format metrics and we make significant use of Prometheus.
If OpenTelemetry is just another metrics format that things will produce and Prometheus will consume just as it consumes Prometheus format metrics today, that seems perfectly okay. I'm pretty indifferent to the metrics formats involved, presuming that they're straightforward to generate and I never have to drop everything and convert all of our things that generate (Prometheus format) metrics to generating OpenTelemetry metrics. This would be especially hard because OpenTelemtry seems to require either Protobuf or (complex) JSON, while the Prometheus metrics format is simple text.
However, this is where I start getting twitchy. OpenTelemetry certainly gives off the air of being a complex ecosystem, and on top of that it also seems to be an application focused ecosystem, not a system focused one. I don't think that metrics are as highly regarded in application focused ecosystems as logs and traces are, while we care a lot about metrics and not very much about the others, at least in an OpenTelemtry context. To the extent that OpenTelemtry diverts people away from producing simple, easy to use and consume metrics, I'm going to wind up being unhappy with it. If what 'OpenTelemtry support' turns out to mean in practice is that more and more things have minimal metrics but lots of logs and traces, that will be a loss for us.
Or to put it another way, I worry that an application focused OpenTelemetry will pull the air away from the metrics focused things that I care about. I don't know how realistic this worry is. Hopefully it's not.
(Partly I'm underinformed about OpenTelemetry because, as mentioned I often feel disconnected from the mainstream of 'observability', so I don't particularly try to keep up with it.)
Things are different between system and application monitoring
We mostly run systems, not applications, due to our generally different system administration environment. Many organizations instead run applications. Although these applications may be hosted on some number of systems, the organizations don't care about the systems, not really; they care about how the applications work (and the systems only potentially matter if the applications have problems). It's my increasing feeling that this has created differences in the general field of monitoring such systems (as well as alerting), which is a potential issue for us because most of the attention is focused on the application area of things.
When you run your own applications, you get to give them all of the 'three pillars of observability' (metrics, traces, and logs, see here for example). In fact, emitting logs is sort of the default state of affairs for applications, and you may have to go out of your way to add metrics (my understanding is that traces can be easier). Some people even process logs to generate metrics, something that's supported by various log ingestion pipelines these days. And generally you can send your monitoring output to wherever you want, in whatever format you want, and often you can do things like structuring them.
When what you run is systems, life is a lot different. Your typical Unix system will most easily provide low level metrics about things. To the extent that the kernel and standard applications emit logs, these logs come in a variety of formats that are generally beyond your control and are generally emitted to only a few places, and the overall logs of what's happening on the system are often extremely incomplete (partly because 'what's happening on the system' is a very high volume thing). You can basically forget about having traces. In the modern Linux world of eBPF it's possible to do better if you try hard, but you'll probably be building custom tooling for your extra logs and traces so they'd better be sufficiently important (and you need the relevant expertise, which may include reading kernel and program source code).
The result is that for people like us who run systems, our first stop for monitoring is metrics and they're what we care most about; our overall unstructured logs are at best a secondary thing, and tracing some form of activity is likely to be something done only to troubleshoot problems. Meanwhile, my strong impression is that application people focus on logs and if they have them, traces, with metrics only a distant and much less important third (especially in the actual applications, since metrics can be produced by third party tools from their logs).
(This is part of why I'm so relatively indifferent to smart log searching systems. Our central syslog server is less about searching logs and much more about preserving them in one place for investigations.)
Our Grafana and Loki installs have quietly become 'legacy software' here
At this point we've been running Grafana for quite some time (since late 2018), and (Grafana) Loki for rather less time and on a more ad-hoc and experimental basis. However, over time both have become 'legacy software' here, by which I mean that we (I) have frozen their versions and don't update them any more, and we (I) mostly or entirely don't touch their configurations any more (including, with Grafana, building or changing dashboards).
We froze our Grafana version due to backward compatibility issues. With Loki I could say that I ran out of enthusiasm for going through updates, but part of it was that Loki explicitly deprecated 'promtail' in favour of a more complex solution ('Alloy') that seemed to mostly neglect the one promtail feature we seriously cared about, namely reading logs from the systemd/journald complex. Another factor was it became increasingly obvious that Loki was not intended for our simple setup and future versions of Loki might well work even worse in it than our current version does.
Part of Grafana and Loki going without updates and becoming 'legacy' is that any future changes in them would be big changes. If we ever have to update our Grafana version, we'll likely have to rebuild a significant number of our current dashboards, because they use panels that aren't supported any more and the replacements have a quite different look and effect, requiring substantial dashboard changes for the dashboards to stay decently usable. With Loki, if the current version stopped working I'd probably either discard the idea entirely (which would make me a bit sad, as I've done useful things through Loki) or switch to something else that had similar functionality. Trying to navigate the rapids of updating to a current Loki is probably roughly as much work (and has roughly as much chance of requiring me to restart our log collection from scratch) as moving to another project.
(People keep mentioning VictoriaLogs (and I know people have had good experiences with it), but my motivation for touching any part of our Loki environment is very low. It works, it hasn't eaten the server it's on and shows no sign of doing that any time soon, and I'm disinclined to do any more work with smart log collection until a clear need shows up. Our canonical source of history for logs continues to be our central syslog server.)
The five platforms we have to cover when planning systems
Suppose, not entirely hypothetically, that you're going to need a 'VPN' system that authenticates through OIDC. What platforms do you need this VPN system to support? In our environment, the answer is that we have five platforms that we need to care about, and they're the obvious four plus one more: Windows, macOS, iOS, Android, and Linux.
We need to cover these five platforms because people here use our services from all of those platforms. Both Windows and macOS are popular on laptops (and desktops, which still linger around), and there's enough people who use Linux to be something we need to care about. On mobile devices (phones and tablets), obviously iOS and Android are the two big options, with people using either or both. We don't usually worry about the versions of Windows and macOS and suggest that people to stick to supported ones, but that may need to change with Windows 10.
Needing to support mobile devices unquestionably narrows our options for what we can use, at least in theory, because there are certain sorts of things you can semi-reasonably do on Linux, macOS, and Windows that are infeasible to do (at least for us) on mobile devices. But we have to support access to various of our services even on iOS and Android, which constrains us to certain sorts of solutions, and ideally ones that can deal with network interruptions (which are quite common on mobile devices in Toronto, as anyone who takes our subways is familiar with).
(And obviously it's easier for open source systems to support Linux, macOS, and Windows than it is for them to extend this support to Android and especially iOS. This extends to us patching and rebuilding them for local needs; with various modern languages, we can produce Windows or macOS binaries from modified open source projects. Not so much for mobile devices.)
In an ideal world it would be easy to find out the support matrix of platforms (and features) for any given project. In this world, the information can sometimes be obscure, especially for what features are supported on what platforms. One of my resolutions to myself is that when I find interesting projects but they seem to have platform limitations, I should note down where in their documentation they discuss this, so I can find it later to see if things have changed (or to discuss with people why certain projects might be troublesome).
Two broad approaches to having Multi-Factor Authentication everywhere
In this modern age, more and more people are facing more and more pressure to have pervasive Multi-Factor Authentication, with every authentication your people perform protected by MFA in some way. I've come to feel that there are two broad approaches to achieving this and one of them is more realistic than the other, although it's also less appealing in some ways and less neat (and arguably less secure).
The 'proper' way to protect everything with MFA is to separately and individually add MFA to everything you have that does authentication. Ideally you will have a central 'single sign on' system, perhaps using OIDC, and certainly your people will want you to have only one form of MFA even if it's not all run through your SSO. What this implies is that you need to add MFA to every service and protocol you have, which ranges from generally easy (websites) through being annoying to people or requiring odd things (SSH) to almost impossible at the moment (IMAP, authenticated SMTP, and POP3). If you opt to set it up with no exemptions for internal access, this approach to MFA insures that absolutely everything is MFA protected without any holes through which an un-MFA'd authentication can be done.
The other way is to create some form of MFA-protected network access (a VPN, a mesh network, a MFA-authenticated SSH jumphost, there are many options) and then restrict all non-MFA access to coming through this MFA-protected network access. For services where it's easy enough, you might support additional MFA authenticated access from outside your special network. For other services where MFA isn't easy or isn't feasible, they're only accessible from the MFA-protected environment and a necessary step for getting access to them is to bring up your MFA-protected connection. This approach to MFA has the obvious problem that if someone gets access to your MFA-protected network, they have non-MFA access to everything else, and the not as obvious problem that attackers might be able to MFA as one person to the network access and then do non-MFA authentication as another person on your systems and services.
The proper way is quite appealing to system administrators. It gives us an array of interesting challenges to solve, neat technology to poke at, and appealingly strong security guarantees. Unfortunately the proper way has two downsides; there's essentially no chance of it covering your IMAP and authenticated SMTP services any time soon (unless you're willing to accept some significant restrictions), and it requires your people to learn and use a bewildering variety of special purpose, one-off interfaces and sometimes software (and when it needs software, there may be restrictions on what platforms the software is readily available on). Although it's less neat and less nominally secure, the practical advantage of the MFA protected network access approach is that it's universal and it's one single thing for people to deal with (and by extension, as long as the network system itself covers all platforms you care about, your services are fully accessible from all platforms).
(In practice the MFA protected network approach will probably be two things for people to deal with, not one, since if you have websites the natural way to protect them is with OIDC (or if you have to, SAML) through your single sign on system. Hopefully your SSO system is also what's being used for the MFA network access, so people only have to sign on to it once a day or whatever.)
Our need for re-provisioning support in mesh networks (and elsewhere)
In a comment on my entry on how WireGuard mesh networks need a provisioning system, vcarceler pointed me to Innernet (also), an interesting but opinionated provisioning system for WireGuard. However, two bits of it combined made me twitch a bit; Innernet only allows you to provision a given node once, and once a node is assigned an internal IP, that IP is never reused. This lack of support for re-provisioning machines would be a problem for us and we'd likely have to do something about it, one way or another. Nor is this an issue unique to Innernet, as a number of mesh network systems have it.
Our important servers have fixed, durable identities, and in practice these identities are both DNS names and IP addresses (we have some generic machines, but they aren't as important). We also regularly re-provision these servers, which is to say that we reinstall them from scratch, usually on new hardware. In the usual course of events this happens roughly every two years or every four years, depending on whether we're upgrading the machine for every Ubuntu LTS release or every other one. Over time this is a lot of re-provisionings, and we need the re-provisioned servers to keep their 'identity' when this happens.
We especially need to be able to rebuild a dead server as an identical replacement if its hardware completely breaks and eats its system disks. We're already in a crisis, we don't want to have a worse crisis because other things need to be updated because we can't exactly replace the server but instead have to build a new server that fills the same role, or will once DNS is updated, configurations are updated, etc etc.
This is relatively straightforward for regular Linux servers with regular networking; there's the issue of SSH host keys, but there's several solutions. But obviously there's a problem if the server is also a mesh network node and the mesh network system will not let it be re-provisioned under the same name or the same internal IP address. Accepting this limitation would make it difficult to use the mesh network for some things, especially things where we don't want to depend on DNS working (for example, sending system logs via syslog). Working around the limitation requires reverse engineering where the mesh network system stores local state and hopefully being able to save a copy elsewhere and restore it; among other things, this has implications for the mesh network system's security model.
For us, it would be better if mesh networking systems explicitly allowed this re-provisioning. They could make it a non-default setting that took explicit manual action on the part of the network administrator (and possibly required nodes to cooperate and extend more trust than normal to the central provisioning system). Or a system like Innernet could have a separate class of IP addresses, call them 'service addresses', that could be assigned and reassigned to nodes by administrators. A node would always have its unique identity but could also be assigned one or more service addresses.
(Of course our other option is to not use a mesh network system that imposes this restriction, even if it would otherwise make our lives easier. Unless we really need the system for some other reason or its local state management is explicitly documented, this is our more likely choice.)
PS: The other problem with permanently 'consuming' IP addresses as machines are re-provisioned is that you run out of them sooner or later unless you use gigantic network blocks that are many times larger than the number of servers you'll ever have (well, in IPv4, but we're not going to switch to IPv6 just to enable a mesh network provisioning system).
Using WireGuard seriously as a mesh network needs a provisioning system
One thing that my recent experience expanding our WireGuard mesh network has driven home to me is how (and why) WireGuard needs a provisioning system, especially if you're using it as a mesh networking system. In fact I think that if you use a mesh WireGuard setup at any real scale, you're going to wind up either adopting or building such a provisioning system.
In a 'VPN' WireGuard setup with a bunch of clients and one or a small number of gateway servers, adding a new client is mostly a matter of generating and giving it some critical information. However, it's possible to more or less automate this and make it relatively easy for people who want to connect to you to do this. You'll still need to update your WireGuard VPN server too, but at least you only have one of them (probably), and it may well be the host where you generate the client configuration and provide it to the client's owner.
The extra problem with adding a new client to a WireGuard mesh network is that there's many more WireGuard nodes that need to be updated (and also the new client needs a lot more information; it needs to know about all of the other nodes it's supposed to talk to). More broadly, every time you change the mesh network configuration, every node needs to update with the new information. If you add a client, remove a client, a client changes its keys for some reason (perhaps it had to be re-provisioned because the hardware died), all of these means nodes need updates (or at least the nodes that talk to the changed node). In the VPN model, only the VPN server node (and the new client) needed updates.
Our little WireGuard mesh is operating at a small scale, so we can afford to do this by hand. As you have more WireGuard nodes and more changes in nodes, you're not going to want to manually update things one by one, any more than you want to do that for other system administration work. Thus, you're going to want some sort of a provisioning system, where at a minimum you can say 'this is a new node' or 'this node has been removed' and all of your WireGuard configurations are regenerated, propagated to WireGuard nodes, trigger WireGuard configuration reloads, and so on. Some amount of this can be relatively generic in your configuration management system, but not all of it.
(Many configuration systems can propagate client-specific files to clients on changes and then trigger client side actions when the files are updated. But you have to build the per-client WireGuard configuration.)
PS: I haven't looked into systems that will do this for you, either as pure WireGuard provisioning systems or as bigger 'mesh networking using WireGuard' software, so I don't have any opinions on how you want to handle this. I don't even know if people have built and published things that are just WireGuard provisioning systems, or if everything out there is a 'mesh networking based on WireGuard' complex system.
Chosing between "it works for now" and "it works in the long term"
A comment on my entry about how Netplan can only have WireGuard peers in one file made me realize one of my implicit system administration views (it's the first one by Jon). That is the tradeoff between something that works now and something that not only works now but is likely to keep working in the long term. In system administration this is a tradeoff, not an obvious choice, because what you want is different depending on the circumstances.
Something that works now is, for example, something that works because of how Netplan's code is currently written, where you can hack around an issue by structuring your code, your configuration files, or your system in a particular way. As a system administrator I do a surprisingly large amount of these, for example to fix or work around issues in systemd units that people have written in less than ideal or simply mistaken ways.
Something that's going to keep working in the longer term is doing things 'correctly', which is to say in whatever way that the software wants you to do and supports. Sometimes this means doing things the hard way when the software doesn't actually implement some feature that would make your life better, even if you could work around it with something that works now but isn't necessarily guaranteed to keep working in the future.
When you need something to work and there's no other way to do it, you have to take a solution that (only) works now. Sometimes you take a 'works now' solution even if there's an alternative because you expect your works-now version to be good enough for the lifetime of this system, this OS release, or whatever; you'll revisit things for the next version (at least in theory, workarounds to get things going can last a surprisingly long time if they don't break anything). You can't always insist on a 'works now and in the future' solution.
On the other hand, sometimes you don't want to do a works-now thing even if you could. A works-now thing is in some sense technical debt, with all that that implies, and this particular situation isn't important enough to justify taking on such debt. You may solve the problem properly, or you may decide that the problem isn't big and important enough to solve at all and you'll leave things in their imperfect state. One of the things I think about when making this decision is how annoying it would be and how much would have to change if my works-now solution broke because of some update.
(Another is how ugly the works-now solution is, including how big of a note we're going to want to write for our future selves so we can understand what this peculiar load bearing thing is. The longer the note, the more I generally wind up questioning the decision.)
It can feel bad to not deal with a problem by taking a works-now solution. After all, it works, and otherwise you're stuck with the problem (or with less pleasant solutions). But sometimes it's the right option and the works-now solution is simply 'too clever'.
(I've undoubtedly made this decision many times over my career. But Jon's comment and my reply to it crystalized the distinction between a 'works now' and a 'works for the long term' solution in my mind in a way that I think I can sort of articulate.)
The complexity of mixing mesh networking and routes to subnets
One of the in things these days is encrypted (overlay) mesh networks, where you have a bunch of nodes and the nodes have encrypted connections to each other that they use for (at least) internal IP traffic. WireGuard is one of the things that can be used for this. A popular thing to add to such mesh network solutions is 'subnet routes', where nodes will act as gateways to specific subnets, not just endpoints in themselves. This way, if you have an internal network of servers at your cloud provider, you can establish a single node on your mesh network and route to the internal network through that node, rather than having to enroll every machine in the internal network.
(There are various reasons not to enroll every machine, including that on some of them it would be a security or stability risk.)
In simple configurations this is easy to reason about and easy to set up through the tools that these systems tend to give you. Unfortunately, our network configuration isn't simple. We have an environment with multiple internal networks, some of which are partially firewalled off from each other, and where people would want to enroll various internal machines in any mesh networking setup (partly so they can be reached directly). This creates problems for a simple 'every node can advertise some routes and you accept the whole bundle' model.
The first problem is what I'll call the direct subnet problem. Suppose that you have a subnet with a bunch of machines on it and two of them are nodes (call them A and B), with one of them (call it A) advertising a route to the subnet so that other machines in the mesh can reach it. The direct subnet problem is that you don't want B to ever send its traffic for the subnet to A; since it's directly connected to the subnet, it should send the traffic directly. Whether or not this happens automatically depends on various implementation choices the setup makes.
The second problem is the indirect subnet problem. Suppose that you have a collection of internal networks that can all talk to each other (perhaps through firewalls and somewhat selectively). Not all of the machines on all of the internal networks are part of the mesh, and you want people who are outside of your networks to be able to reach all of the internal machines, so you have a mesh node that advertises routes to all of your internal networks. However, if a mesh node is already inside your perimeter and can reach your internal networks, you don't want it to go through your mesh gateway; you want it to send its traffic directly.
(You especially want this if mesh nodes have different mesh IPs from their normal IPs, because you probably want the traffic to come from the normal IP, not the mesh IP.)
You can handle the direct subnet case with a general rule like 'if you're directly attached to this network, ignore a mesh subnet route to it', or by some automatic system like route priorities. The indirect subnet case can't be handled automatically because it requires knowledge about your specific network configuration and what can reach what without the mesh (and what you want to reach what without the mesh, since some traffic you want to go over the mesh even if there's a non-mesh route between the two nodes). As far as I can see, to deal with this you need the ability to selectively configure or accept (subnet) routes on a mesh node by mesh node basis.
(In a simple topology you can get away with accepting or not accepting all subnet routes, but in a more complex one you can't. You might have two separate locations, each with their own set of internal subnets. Mesh nodes in each location want the other location's subnet routes, but not their own location's subnet routes.)
Tailscale's surprising interaction of DNS settings and 'exit nodes'
Tailscale is a well regarded commercial mesh networking system, based on WireGuard, that can be pressed into service as a VPN as well. As part of its general features, it allows you to set up various sorts of DNS settings for your tailnet (your own particular Tailscale mesh network), including both DNS servers for specific (sub)domains (eg an 'internal.example.org') and all DNS as a whole. As part of optionally being VPN-like, Tailscale also lets you set up exit nodes, which let you route all traffic for the Internet out the exit node (if you want to route just some subnets to somewhere, that's a subnet router, a different thing). If you're a normal person, especially if you're a system administrator, you probably have a guess as to how these two features interact. Unfortunately, you may well be wrong.
As of today, if you use a Tailscale exit node, all of your DNS traffic is routed to the exit node regardless of Tailscale DNS settings. This applies to both DNS servers for specific subdomains and to any global DNS servers you've set for your tailnet (due to, for example, 'split horizon' DNS). Currently this is documented only in one little sentence in small type in the "Use Tailscale DNS settings" portion of the client preferences documentation.
In many Tailscale environments, all this does is make your DNS queries take an extra hop (from you to the exit node and then to the configured DNS servers). Your Tailscale exit nodes are part of your tailnet, so in ordinary configurations they will have your Tailscale DNS settings and be able to query your configured DNS servers (and they will probably get the same answers, although this isn't certain). However, if one of your exit nodes isn't set up this way, potential pain and suffering is ahead of you. Your tailnet nodes that are using this exit node will get wildly different DNS answers than you expect, potentially not resolving internal domains and maybe getting different answers than you'd expect (if you have split horizon DNS).
One reason that you might set an exit node machine to not use your Tailscale DNS settings (or subnet routes) is that you're only using it as an exit node, not as a regular participant in your tailnet. Your exit node machine might be placed on a completely different network (and in a completely different trust environment) than the rest of your tailnet, and you might have walled off its (less-trusted) traffic from the rest of your network. If the only thing the machine is supposed to be is an Internet gateway, there's no reason to have it use internal DNS settings, and it might not normally be able to reach your internal DNS servers (or the rest of your internal servers).
In my view, a consequence of this is that it's probably best to have any internal DNS servers directly on your tailnet, with their tailnet IP addresses. This makes them as reachable as possible to your nodes, independent of things like subnet routes.
PS: Routing general DNS queries through a tailnet exit node makes sense in this era of geographical DNS results, where you may get different answers depending on where in the world you are and you'd like these to match up with where your exit node is.
(I'm writing this entry because this issue was quite mysterious to us when we ran into it while testing Tailscale and I couldn't find much about it in online searches.)
How I install personal versions of programs (on Unix)
These days, Unixes are quite generous in what they make available through their packaging systems, so you can often get everything you want through packages that someone else worries about building, updating, managing, and so on. However, not everything is available that way; sometimes I want something that isn't packaged, and sometimes (especially on 'long term support' distributions) I want something that's more recent that the system provides (for example, Ubuntu 22.04 only has Emacs 27.1). Over time, I've evolved my own approach for managing my personal versions of such things, which is somewhat derived from the traditional approach for multi-architecture Unixes here.
The starting point is that I have a ~/lib/<architecture> directory tree. When I build something personally, I tell it that its install prefix is a per-program directory within this tree, for example, '/u/cks/lib/<arch>/emacs-30.1'. These days I only have one active architecture inside ~/lib, but old habits die hard, and someday we may start using ARM machines or FreeBSD. If I install a new version of the program, it goes in a different (versioned) subdirectory, so I have 'emacs-29.4' and 'emacs-30.1' directory trees.
I also have both a general ~/bin directory, for general scripts and other architecture independent things, and a ~/bin/bin.<arch> subdirectory, for architecture dependent things. When I install a program into ~/lib/<arch>/<whatever> and want to use it, I will make either a symbolic link or a cover script in ~/bin/bin.<arch> for it, such as '~/bin/bin.<arch>/emacs'. This symbolic link or cover script always points to what I want to use as the current version of the program, and I update it when I want to switch.
(If I'm building and installing something from the latest development tree, I'll often call the subdirectory something like 'fvwm3-git' and then rename it to have multiple versions around. This is not as good as real versioned subdirectories, but I tend to do this for things that I won't ever run two versions of at the same time; at most I'll switch back and forth.)
Some things I use, such as pipx, normally install programs (or symbolic links to them) into places like ~/.local/bin or ~/.cargo/bin. Because it's not worth fighting city hall on this one, I pretty much let them do so, but I don't add either directory to my $PATH. If I want to use a specific tool that they install and manage, I put in a symbolic link or a cover script in my ~/bin/bin.<arch>. The one exception to this is Go, where I do have ~/go/bin in my $PATH because I use enough Go based programs that it's the path of least resistance.
This setup isn't perfect, because right now I don't have a good
general approach for things that depend on the Ubuntu version (where
an Emacs 30.1 built on 22.04 doesn't run on 24.04). If I ran into
this a lot I'd probably make an addition ~/bin/bin.<something>
directory for the Ubuntu version and then put version specific
things there. And in general, Go and Cargo are not ready for my
home directory to be shared between different binary architectures.
For Go, I would probably wind up setting $GOPATH
to something
like ~/lib/<arch>/go. Cargo has a similar system for deciding where
it puts stuff but I haven't looked into it in detail.
(From a quick skim of 'cargo help install
' and my ~/.cargo, I
suspect that I'd point $CARGO_INSTALL_ROOT
into my ~/lib/<arch>
but leave $CARGO_HOME
unset, so that various bits of Cargo's
own data remain shared between architectures.)
(This elaborates a bit on a Fediverse conversation.)
PS: In theory I have a system for keeping track of the command lines used to build things (also, which I'd forgotten when I wrote the more recent entry on this system). In practice I've fallen out of the habit of using it when I build things for my ~/lib, although I should probably get back into it. For GNU Emacs, I put the ./configure command line into a file in ~/lib/<arch>, since I expected to build enough versions of Emacs over time.
Sorting out the ordering of OpenSSH configuration directives
As I discovered recently, OpenSSH makes some unusual choices for the ordering of configuration directives in its configuration files, both sshd_config and ssh_config (and files they include). Today I want to write down what I know about the result (which is partly things I've learned researching this entry).
For sshd_config, the situation is relatively straightforward.
There are what we could call 'global options' (things you set
normally, outside of 'Match' blocks) and 'matching Match
options' (things set
in Match blocks that actually matched). Both of them are 'first
mention wins', but Match options take priority over global options
regardless of where the Match option block is in the (aggregate)
configuration file. Sshd makes 'first mention win' work in the
presence of including files from /etc/ssh/sshd_config.d/ by
doing the inclusion at the start of /etc/ssh/sshd_config.
So here's an example with a Match statement:
PasswordAuthentication no Match Address 127.0.0.0/8,192.168.0.0/16 PasswordAuthentication yes
Password authentication is turned off as a global option but then overridden in the address-based Match block to enable it for connections from the local network. If we had a (Unix) group for logins that we wanted to never use passwords even if they were coming from the local network, I believe that we would have to write it like this, which looks somewhat odd:
PasswordAuthentication no Match Group neverpassword PasswordAuthentication no Match Address 127.0.0.0/8,192.168.0.0/16 PasswordAuthentication yes
Then a 'neverpassword' person logging in from the local network would match both Match blocks, and the first block (the group block) would have 'PasswordAuthentication no' win over the second block's 'PasswordAuthentication yes'. Equivalently, you could put the global 'PasswordAuthentication no' after both Match blocks, which might be clearer.
The situation with ssh and ssh_config is one that I find more confusing and harder to follow. The ssh_config manual page says:
Unless noted otherwise, for each parameter, the first obtained value will be used.
It's pretty clear how this works for the various sources of configurations; options on the command line take priority over everything else, and ~/.ssh/config options take priority over the global options from /etc/ssh/ssh_config and its included files. But within a file (such as ~/.ssh/config), I get a little confused.
What I believe this means for any specific option that you want to
give a default value to for all hosts but then override for specific
hosts is that you must put your Host *
directive for it at the
end of your configuration file, and the more specific Host
or Match
directives first. I'm
not sure how this works for matches like 'Match canonical
' or
'Match final
' that happen 'late' in the processing of your
configuration; the natural reading would be that you have to make
sure that nothing earlier conflicts with them. If this is so, a
natural use for 'Match final
' would then be options that you want
to be true defaults that only apply if nothing has overridden them.
Some ssh_config options are special in that you can provide
them multiple times and they'll be merged together; one example is
IdentityFile
.
I think this applies even across multiple Host and Match blocks,
and also that there's no way to remove an IdentityFile once you've
added it (which might be an issue if
you have a lot of identity files, because SSH servers only let
you offer so many). Some options let you
modify the default state to, for example, add a non-default key
exchange algorithm;
I haven't tested to see if you can do this multiple times in Host
blocks or if you can only do it once.
(These days you can make things somewhat simpler with 'Match tagged
...' and 'Tag
'; one handy
and clear explanation of what you can do with this is OpenSSH
Config Tags How To.)
Typically your /etc/ssh/ssh_config has no active options set
in it and includes /etc/ssh/ssh_config.d/* at the end. On
Debian-derived systems, it does have some options specified (for
'Host *
', ie making them defaults), but the inclusion of
/etc/ssh/ssh_config.d/* has been moved to the start so you can
override them.
My own personal ~/.ssh/config setup starts with a 'Host *
'
block, but as far as I can tell I don't try to override any of its
settings later in more specific Host blocks. I do have a final
'Host *
' block with comments about how I want to do some things
by default if they haven't been set earlier, along with comments in
the file that I was finding all of this confusing. I may at some
point try to redo it into a 'Match tagged' / 'Tag' form to see if
that makes it clearer.
The order of files in /etc/ssh/sshd_config.d/ matters (and may surprise you)
Suppose, not entirely hypothetically, that you have an Ubuntu 24.04
server system where you want to disable SSH passwords for the
Internet but allow them for your local LAN. This looks straightforward
based on sshd_config,
given the PasswordAuthentication
and
Match
directives:
PasswordAuthentication no Match Address 127.0.0.0/8,192.168.0.0/16 PasswordAuthentication yes
Since I'm an innocent person, I put this in a file in
/etc/ssh/sshd_config.d/ with a nice high ordering number, say
'60-no-passwords.conf'. Then I restarted the SSH daemon and
was rather confused when it didn't work (and I wound up resorting
to manipulating AuthenticationMethods
, which
also works).
The culprit is two things combined together. The first is this sentence at the start of sshd_config:
[...] Unless noted otherwise, for each keyword, the first obtained value will be used. [...]
Some configuration systems are 'first mention wins', but I think it's more common to be either 'last mention wins' or 'if it's mentioned more than once, it's an error'. Certainly I was vaguely expecting sshd_config and the files in sshd_config.d to be 'last mention wins', because that would be the obvious way to let you easily override things specified in sshd_config itself. But OpenSSH doesn't work this way.
(You can still override things in sshd_config, because the global sshd_config includes all of sshd_config.d/* at the start, before it sets anything, rather than at the end, how you often see this.)
The second culprit is that at least in our environment, Ubuntu 24.04 writes out a '50-cloud-init.conf' file that contains one deadly (for this) line:
PasswordAuthentication yes
Since '50-cloud-init.conf' was read by sshd before my '60-no-passwords.conf', it forced password authentication to be on. My new configuration file was more or less silently ignored.
Renaming my configuration file to be '10-no-passwords.conf' fixed my problem and made things work like I expected.
Our simple view of 'identity' for our (Unix) accounts
When I wrote about how it's complicated to count how many professors are in our department, I mentioned that the issues involved would definitely complicate the life of any IAM system that tried to understand all of this, but that we had a much simpler view of things. Today I'm going to explain that, with a little bit on its historical evolution (as I understand it).
All Unix accounts on our have to be 'sponsored' by someone, their 'sponsor'. Roughly speaking, all professors who supervise graduate students in the department and all professors who are in the department are or can be sponsors, and there are some additional special sponsors (for example, technical and administrative staff also have sponsors). Your sponsor has to approve your account request before it can be created, although some of the time the approval is more or less automatic (for example, for incoming graduate students, who are automatically sponsored by their supervisor).
At one level this requires us to track 'who is a professor'. At another level, we outsource this work; when new professors show up, the administrative staff side of the department will ask us to set up an account for them, at which point we know to either enable them as a sponsor or schedule it in the future at their official start date. And ultimately, 'who can sponsor accounts' is a political decision that's made (if necessary) by the department (generally by the Chair). We're never called on to evaluate the 'who is a professor in the department' question ourselves.
I believe that one reason we use this model is that what is today the department's general research side computing environment originated in part from an earlier organization that included only a subset of the professors here, so that not everyone in the department could get a Unix account on 'CSRI' systems. To get a CSRI account, a professor who was explicitly part of CSRI had to say 'yes, I want this person to have an account', sponsoring it. When this older, more restricted environment expanded to become the department's general research side computing environment, carrying over the same core sponsorship model was natural (or so I believe).
(Back in the days there were other research groups around the department, involving other professors, and they generally had similar policies for who could get an account.)
Using SimpleSAMLphp to set up an identity provider with Duo support
My university has standardized on an institutional MFA system that's based on institutional identifiers and Duo (a SaaS company, as is commonly necessary these days to support push MFA). We have our own logins and passwords, but wanted to add full Duo MFA authentication to (as a first step) various of our web applications. We were eventually able to work out how to do this, which I'm going to summarize here because although this is a very specific need, maybe someone else in the world also has it.
The starting point is SimpleSAMLphp, which we already had an instance of that authenticated only with login and password against an existing LDAP server we had. SSP is a SAML IdP, but there's a third party module for OIDC OP support, and we wound up using it to make our new IdP support both SAML and OIDC. For Duo support we found a third party module, but to work with SSP 2.x, you need to use a feature branch. We run the entire collective stack of things under Apache, because we're already familiar with that.
A rough version of the install process is:
- Set up Apache so it can run PHP and etc etc.
- Obtain SimpleSAMLphp 2.x from the upstream releases. You almost certainly can't use a version packaged by your Linux distribution, because you need to be able to use the 'composer' PHP package manager to add packages to it.
- Unpack this release somewhere, conventionally
/var/simplesamlphp
. - Install the 'composer' PHP package manager if it's not already
available.
- Install the third party Duo module from
the alternate branch. At the top level of your SimpleSAMLphp install,
run:
composer require 0x0fbc/simplesamlphp-module-duouniversal:dev-feature
- Optionally install the OIDC module:
composer require simplesamlphp/simplesamlphp-module-oidc
Now you can configure SimpleSAMLphp, the Duo module, and the OIDC module following their respective instructions (which are not 'simple' despite the name). If you're using the OIDC module, remember that you'll need to set up the Duo module (and the other things we'll need) in two places, not just one, and you'll almost certainly want to add an Apache alias for '/.well-known/openid-configuration' that redirects it to the actual URL that the OIDC module uses.
At this point we need to deal with the mismatch between our local logins and the institutional identifiers that Duo uses for MFA. There are at least three options to deal with this:
- Add a LDAP attribute (and schema) that will hold the Duo identifier
(let's call this the 'duoid') for everyone. This attribute will
(probably) be automatically available as a SAML attribute, making it
available to the Duo module.
(If you're not using LDAP for your SimpleSAMLphp authentication module, the module you're using may have its own way to add extra information.)
- Embed the duoid into your GECOS field in LDAP and write a
SimpleSAMLphp 'authproc' with
arbitrary PHP code to
extract the GECOS field and materialize it as a SAML attribute. This
has the advantage that you can share this GECOS field with the Duo PAM
module if you use that.
- Write a SimpleSAMLphp 'authproc' that uses arbitrary PHP code to look up the duoid for a particular login from some data source, which could be an actual database or simply a flat file that you open and search through. This is what we did, mostly because we had such a file sitting around for other reasons.
(Your new SAML attribute will normally be passed through to SAML SPs (clients) that use you as a SAML IdP, but it won't be passed through to OIDC RPs (also clients) unless you configure a new OIDC claim and scope for it and clients ask for that OIDC scope.)
You'll likely also want to augment the SSP Duo module with some additional logging, so you can tell when Duo MFA authentication is attempted for people and when it succeeds. Since the SSP Duo module is more or less moribund, we probably don't have too much to worry about as far as keeping up with upstream updates goes.
I've looked through the SSP Duo module's code and I'm not too worried about development having stopped some time ago. As far as I can see, the module is directly following Duo's guidance for how to use the current Duo Universal SDK and is basically simple glue code to sit between SimpleSAMLphp's API and the Duo SDK API.
Sidebar: Implications of how the Duo module is implemented
To simplify the technical situation, the MFA challenge created by the SSP Duo module is done as an extra step after SimpleSAMLphp has 'authenticated' your login and password against, say, your LDAP server. SSP as a whole has no idea that a person who's passed LDAP is not yet 'fully logged in', and so it will both log things and behave as if you're fully authenticated even before the Duo challenge succeeds. This is the big reason you need additional logging in the Duo module itself.
As far as I can tell, SimpleSAMLphp will also set its 'you are authenticated' IdP session cookie in your browser immediately after you pass LDAP. Conveniently (and critically), authprocs always run when you revisit SimpleSAMLphp even if you're not challenged for a login and password. This does mean that every time you revisit your IdP (for example because you're visiting another website that's protected by it), you'll be sent for a round trip through Duo's site. Generally this is harmless.
US sanctions and your VPN (and certain big US-based cloud providers)
As you may have heard (also) and to simplify, the US government requires US-based organizations to not 'do business with' certain countries and regions (what this means in practice depends in part which lawyer you ask, or more to the point, that the US-based organization asked). As a Canadian university, we have people from various places around the world, including sanctioned areas, and sometimes they go back home. Also, we have a VPN, and sometimes when people go back home, they use our VPN for various reasons (including that they're continuing to do various academic work while they're back at home). Like many VPNs, ours normally routes all of your traffic out of our VPN public exit IPs (because people want this, for good reasons).
Getting around geographical restrictions by using a VPN is a time honored Internet tradition. As a result of it being a time honored Internet tradition, a certain large cloud provider with a lot of expertise in browsers doesn't just determine what your country is based on your public IP; instead, as far as we can tell, it will try to sniff all sorts of attributes of your browser and your behavior and so on to tell if you're actually located in a sanctioned place despite what your public IP is. If this large cloud provider decides that you (the person operating through the VPN) actually are in a sanctioned region, it then seems to mark your VPN's public exit IP as 'actually this is in a sanctioned area' and apply the result to other people who are also working through the VPN.
(Well, I simplify. In real life the public IP involved may only be one part of a signature that causes the large cloud provider to decide that a particular connection or request is from a sanctioned area.)
Based on what we observed, this large cloud provider appears to deal with connections and HTTP requests from sanctioned regions by refusing to talk to you. Naturally this includes refusing to talk to your VPN's public exit IP when it has decided that your VPN's IP is really in a sanctioned country. When this sequence of events happened to us, this behavior provided us an interesting and exciting opportunity to discover how many companies hosted some part of their (web) infrastructure and assets (static or otherwise) on the large cloud provider, and also how hard to diagnose the resulting failures were. Some pages didn't load at all; some pages loaded only partially, or had stuff that was supposed to work but didn't (because fetching JavaScript had failed); with some places you could load their main landing page (on one website) but then not move to the pages (on another website at a subdomain) that you needed to use to get things done.
The partial good news (for us) was that this large cloud provider would reconsider its view of where your VPN's public exit IP 'was' after a day or two, at which point everything would go back to working for a while. This was also sort of the bad news, because it made figuring out what was going on somewhat more complicated and hit or miss.
If this is relevant to your work and your VPNs, all I can suggest is to get people to use different VPNs with different public exit IPs depending on where the are (or force them to, if you have some mechanism for that).
PS: This can presumably also happen if some of your people are merely traveling to and in the sanctioned region, either for work (including attending academic conferences) or for a vacation (or both).
(This is a sysadmin war story from a couple of years ago, but I have no reason to believe the situation is any different today. We learned some troubleshooting lessons from it.)
Three ways I know of to authenticate SSH connections with OIDC tokens
Suppose, not hypothetically, that you have an MFA equipped OIDC identity provider (an 'OP' in the jargon), and you would like to use it to authenticate SSH connections. Specifically, like with IMAP, you might want to do this through OIDC/OAuth2 tokens that are issued by your OP to client programs, which the client programs can then use to prove your identity to the SSH server(s). One reason you might want to do this is because it's hard to find non-annoying, MFA-enabled ways of authenticating SSH, and your OIDC OP is right there and probably already supports sessions and so on. So far I've found three different projects that will do this directly, each with their own clever approach and various tradeoffs.
(The bad news is that all of them require various amounts of additional software, including on client machines. This leaves SSH apps on phones and tablets somewhat out in the cold.)
The first is ssh-oidc, which is a joint effort of various European academic parties, although I believe it's also used elsewhere (cf). Based on reading the documentation, ssh-oidc works by directly passing the OIDC token to the server, I believe through a SSH 'challenge' as part of challenge/response authentication, and then verifying it on the server through a PAM module and associated tools. This is clever, but I'm not sure if you can continue to do plain password authentication (at least not without PAM tricks to selectively apply their PAM module depending on, eg, the network area the connection is coming from).
Second is Smallstep's DIY Single-Sign-On for SSH (also). This works by setting
up a SSH certificate authority and having the CA software issue
signed, short-lived SSH client certificates in exchange for OIDC
authentication from your OP. With client side software, these client
certificates will be automatically set up for use by ssh
, and on
servers all you need is to trust your SSH CA. I believe you could
even set this up for personal use on servers you SSH to, since you
set up a personally trusted SSH CA. On the positive side, this
requires minimal server changes and no extra server software, and
preserves your ability to directly authenticate with passwords (and
perhaps some MFA challenge). On the negative side, you now have a
SSH CA you have to trust.
(One reason to care about still supporting passwords plus another MFA challenge is that it means that people without the client software can still log in with MFA, although perhaps somewhat painfully.)
The third option, which I've only recently become aware of, is
Cloudflare's recently open-sourced 'opkssh'
(via,
Github). OPKSSH builds on
something called OpenPubkey,
which uses a clever trick to embed a public key you provide in
(signed) OIDC tokens from your OP (for details see here).
OPKSSH uses this to put a basically regular SSH public key into
such an augmented OIDC token, then smuggles it from the client to
the server by embedding the entire token in a SSH (client) certificate;
on the server, it uses an AuthorizedKeysCommand
to
verify the token, extract the public key, and tell the SSH server
to use the public key for verification (see How it works
for more details). If you want, as far as I can see OPKSSH still
supports using regular SSH public keys and also passwords (possibly
plus an MFA challenge).
(Right now OPKSSH is not ready for use with third party OIDC OPs. Like so many things it's started out by only supporting the big, established OIDC places.)
It's quite possible that there are other options for direct (ie, non-VPN) OIDC based SSH authentication. If there are, I'd love to hear about them.
(OpenBao may be another 'SSH CA that authenticates you via OIDC' option; see eg Signed SSH certificates and also here and here. In general the OpenBao documentation gives me the feeling that using it merely to bridge between OIDC and SSH servers would be swatting a fly with an awkwardly large hammer.)
Some notes on configuring Dovecot to authenticate via OIDC/OAuth2
Suppose, not hypothetically, that you have a relatively modern Dovecot server and a shiny new OIDC identity provider server ('OP' in OIDC jargon, 'IdP' in common usage), and you would like to get Dovecot to authenticate people's logins via OIDC. Ignoring certain practical problems, the way this is done is for your mail clients to obtain an OIDC token from your IdP, provide it to Dovecot via SASL OAUTHBEARER, and then for Dovecot to do the critical step of actually validating that token it received is good, still active, and contains all the information you need. Dovecot supports this through OAuth v2.0 authentication as a passdb (password database), but in the usual Dovecot fashion, the documentation on how to configure the parameters for validating tokens with your IdP is a little bit lacking in explanations. So here are some notes.
If you have a modern OIDC IdP, it will support OpenID Connect Discovery, including the provider configuration request on the path /.well-known/openid-configuration. Once you know this, if you're not that familiar with OIDC things you can request this URL from your OIDC IdP, feed the result through 'jq .', and then use it to pick out the specific IdP URLs you want to set up in things like the Dovecot file with all of the OAuth2 settings you need. If you do this, the only URL you want for Dovecot is the userinfo_endpoint URL. You will put this into Dovecot's introspection_url, and you'll leave introspection_mode set to the default of 'auth'.
You don't want to set tokeninfo_url to anything. This setting is (or was) used for validating tokens with OAuth2 servers before the introduction of RFC 7662. Back then, the defacto standard approach was to make a HTTP GET approach to some URL with the token pasted on the end (cf), and it's this URL that is being specified. This approach was replaced with RFC 7662 token introspection, and then replaced again with OpenID Connect UserInfo. If both tokeninfo_url and introspection_url are set, as in Dovecot's example for Google, the former takes priority.
(Since I've just peered deep into the Dovecot source code, it appears
that setting 'introspection_mode = post
' actually performs an
(unauthenticated) token introspection request. The 'get' mode
seems to be the same as setting tokeninfo_url. I think that
if you set the 'post' mode, you also want to set active_attribute
and perhaps active_value
, but I don't know what to, because
otherwise you aren't necessarily fully validating that the token
is still active. Does my head hurt? Yes. The moral here is that you
should use an OIDC IdP that supports OpenID Connect UserInfo.)
If your IdP serves different groups and provides different 'issuer'
('iss') values to them, you may want to set the Dovecot 'issuers
=
' to the specific issuer that applies to you. You'll also want
to set 'username_attribute
' to whatever OIDC claim is where
your IdP puts what you consider the Dovecot username, which might
be the email address or something else.
It would be nice if Dovecot could discover all of this for itself
when you set openid_configuration_url
, but in the current
Dovecot, all this does is put that URL in the JSON of the error
response that's sent to IMAP clients when they fail OAUTHBEARER
authentication. IMAP clients may or may not do anything useful
with it.
As far as I can tell from the Dovecot source code, setting 'scope =' primarily requires that the token contains those scopes. I believe that this is almost entirely a guard against the IMAP client requesting a token without OIDC scopes that contain claims you need elsewhere in Dovecot. However, this only verifies OIDC scopes, it doesn't verify the presence of specific OIDC claims.
So what you want to do is check your OIDC IdP's /.well-known/openid-configuration URL to find out its collection of endpoints, then set:
# Modern OIDC IdP/OP settings introspection_url = <userinfo_endpoint> username_attribute = <some claim, eg 'email'> # not sure but seems common in Dovecot configs? pass_attrs = pass=%{oauth2:access_token} # optionally: openid_configuration_url = <stick in the URL> # you may need: tls_ca_cert_file = /etc/ssl/certs/ca-certificates.crt
The OIDC scopes that IMAP clients should request when getting tokens
should include a scope that gives the username_attribute
claim,
which is 'email' if the claim is 'email', and also apparently the
requested scopes should include the offline_access
scope.
If you want a test client to see if you've set up Dovecot correctly, one option is to appropriately modify a contributed Python program for Mutt (also the README), which has the useful property that it has an option to check all of IMAP, POP3, and authenticated SMTP once you've obtained a token. If you're just using it for testing purposes, you can change the 'gpg' stuff to 'cat' to just store the token with no fuss (and no security). Another option, which can be used for real IMAP clients too if you really want to, is an IMAP/etc OAuth2 proxy.
(If you want to use Mutt with OAuth2 with your IMAP server, see this article on it also, also, also. These days I would try quite hard to use age instead of GPG.)
How I got my nose rubbed in my screens having 'bad' areas for me
I wrote a while back about how my desktop screens now had areas that were 'good' and 'bad' for me, and mentioned that I had recently noticed this, calling it a story for another time. That time is now. What made me really notice this issue with my screens and where I had put some things on them was our central mail server (temporarily) stopping handling email because its load was absurdly high.
In theory I should have noticed this issue before a co-worker rebooted the mail server, because for a long time I've had an xload window from the mail server (among other machines, I have four xloads). Partly I did this so I could keep an eye on these machines and partly it's to help keep alive the shared SSH connection I also use for keeping an xrun on the mail server.
(In the past I had problems with my xrun SSH connections seeming to spontaneously close if they just sat there idle because, for example, my screen was locked. Keeping an xload running seemed to work around that; I assumed it was because xload keeps updating things even with the screen locked and so forced a certain amount of X-level traffic over the shared SSH connection.)
When the mail server's load went through the roof, I should have noticed that the xload for it had turned solid green (which is how xload looks under high load). However, I had placed the mail server's xload way off on the right side of my office dual screens, which put it outside my normal field of attention. As a result, I never noticed the solid green xload that would have warned me of the problem.
(This isn't where the xload was back on my 2011 era desktop, but at some point since then I moved it and some other xloads over to the right.)
In the aftermath of the incident, I relocated all of those xloads to a more central location, and also made my new Prometheus alert status monitor appear more or less centrally, where I'll definitely notice it.
(Some day I may do a major rethink about my entire screen layout, but most of the time that feels like yak shaving that I'd rather not touch until I have to, for example because I've been forced to switch to Wayland and an entirely different window manager.)
Sidebar: Why xload turns green under high load
Xload draws a horizontal tick line for every integer load average it needs to display the maximum load that fits in its moving histogram. If the highest load average is 1.5, there will be one tick; if the highest load average is 10.2, there will be ten. Ticks are normally drawn in green. This means that as the load average climbs, xload draws more and more ticks, and after a certain point the entire xload display is just solid green from all of the tick lines.
This has the drawback that you don't know the shape of the load average (all you know is that at some point it got quite high), but the advantage that it's quite visually distinctive and you know you have a problem.
A Prometheus gotcha with alerts based on counting things
Suppose, not entirely hypothetically, that you have some backup servers that use swappable HDDs as their backup media and expose that 'media' as mounted filesystems. Because you keep swapping media around, you don't automatically mount these filesystems and when you do manually try to mount them, it's possible to have some missing (if, for example, a HDD didn't get fully inserted and engaged with the hot-swap bay). To deal with this, you'd like to write a Prometheus alert for 'not all of our backup disks are mounted'. At first this looks simple:
count( node_filesystem_size_bytes{ host = "backupserv", mountpoint =~ "/dumps/tapes/slot.*" } ) != <some number>
This will work fine most of the time and then one day it will fail to alert you to the fact that none of the expected filesystems are mounted. The problem is the usual one of PromQL's core nature as a set-based query language (we've seen this before). As long as there's at least one HDD 'tape' filesystem mounted, you can count them, but once there are none, the result of counting them is not 0 but nothing. As a result this alert rule won't produce any results when there are no 'tape' filesystems on your backup server.
Unfortunately there's no particularly good fix, especially if you
have multiple identical backup servers and so the real version uses
'host =~ "bserv1|bserv2|..."
'. In the single-host case, you can
use either absent()
or vector()
to provide a default value. There's no good solution in the multi-host
case, because there's no version of vector() that lets you set labels.
If there was, you could at least write:
count( ... ) by (host) or vector(0, "host", "bserv1") or vector(0, "host", "bserv2") ....
(Technically you can set labels via label_replace(). Let's not go there; it's a giant pain for simply adding labels, especially if you want to add more than one.)
In my particular case, our backup servers always have some additional
filesystems (like their root filesystem), so I can write a different
version of the count()
based alert rule:
count( node_filesystem_size_bytes{ host =~ "bserv1|bserv2|...", fstype =~ "ext.*' } ) by (host) != <other number>
In theory this is less elegant because I'm not counting exactly what I care about (the number of 'tape' filesystems that are mounted) but instead something more general and potentially more variable (the number of extN filesystems that are mounted) that contains various assumptions about the systems. In practice the number is just as fixed as the number of 'taoe' filesystems, and the broader set of labels will always match something, producing a count of at least one for each host.
(This would change if the standard root filesystem type changed in a future version of Ubuntu, but if that happened, we'd notice.)
PS: This might sound all theoretical and not something a reasonably experienced Prometheus person would actually do. But I'm writing this entry partly because I almost wrote a version of my first example as our alert rule, until I realized what would happen when there were no 'tape' filesystems mounted at all, which is something that happens from time to time for reasons outside the scope of this entry.
What SimpleSAMLphp's core:AttributeAlter does with creating new attributes
SimpleSAMLphp is a SAML identity provider (and other stuff). It's of deep interest to us because it's about the only SAML or OIDC IdP I can find that will authenticate users and passwords against LDAP and has a plugin that will do additional full MFA authentication against the university's chosen MFA provider (although you need to use a feature branch). In the process of doing this MFA authentication, we need to extract the university identifier to use for MFA authentication from our local LDAP data. Conveniently, SimpleSAMLphp has a module called core:AttributeAlter (a part of authentication processing filters) that is intended to do this sort of thing. You can give it a source, a pattern, a replacement that includes regular expression group matches, and a target attribute. In the syntax of its examples, this looks like the following:
// the 65 is where this is ordered 65 => [ 'class' => 'core:AttributeAlter', 'subject' => 'gecos', 'pattern' => '/^[^,]*,[^,]*,[^,]*,[^,]*,([^,]+)(?:,.*)?$/', 'target' => 'mfaid', 'replacement' => '\\1', ],
If you're an innocent person, you expect that your new 'mfaid' attribute will be undefined (or untouched) if the pattern does not match because the required GECOS field isn't set. This is not in fact what happens, and interested parties can follow along the rest of this in the source.
(All of this is as of SimpleSAMLphp version 2.3.6, the current release as I write this.)
The short version of what happens is that when the target is a
different attribute and the pattern doesn't match, the target will
wind up set but empty. Any previous value is lost. How this happens
(and what happens) starts with that 'attributes' here are actually
arrays of values under the covers (this is '$attributes
'). When
core:AttributeAlter has a different target attribute than the source
attribute, it takes all of the source attribute's values, passes
each of them through a regular expression search and replace (using
your replacement), and then gathers up anything that changed and
sets the target attribute to this gathered collection. If the pattern
doesn't match any values of the attribute (in the normal case, a
single value), the array of changed things is empty and your target
attribute is set to an empty PHP array.
(This is implemented with an array_diff() between the results of preg_replace() and the original attribute value array.)
My personal view is that this is somewhere around a bug; if the pattern doesn't match, I expect nothing to happen. However, the existing documentation is ambiguous (and incomplete, as the use of capture groups isn't particularly documented), so it might not be considered a bug by SimpleSAMLphp. Even if it is considered a bug I suspect it's not going to be particularly urgent to fix, since this particular case is unusual (or people would have found it already).
For my situation, perhaps what I want to do is to write some PHP code to do this extraction operation by hand, through core:PHP. It would be straightforward to extract the necessary GECOS field (or otherwise obtain the ID we need) in PHP, without fooling around with weird pattern matching and module behavior.
(Since I just looked it up, I believe that in the PHP code that core:PHP runs for you, you can use a PHP 'return' to stop without errors but without changing anything. This is relevant in my case since not all GECOS entries have the necessary information.)
If you get the chance, always run more extra network fiber cabling
Some day, you may be in an organization that's about to add some more fiber cabling between two rooms in the same building, or maybe two close by buildings, and someone may ask you for your opinion about many fiber pairs should be run. My personal advice is simple: run more fiber than you think you need, ideally a bunch more (this generalizes to network cabling in general, but copper cabling is a lot more bulky and so harder to run (much) more of). There is an unreasonable amount of fiber to run, but mostly it comes up when you'd have to put in giant fiber patch panels.
The obvious reason to run more fiber is that you may well expand your need for fiber in the future. Someone will want to run a dedicated, private network connection between two locations; someone will want to trunk things to get more bandwidth; someone will want to run a weird protocol that requires its own network segment (did you know you can run HDMI over Ethernet?); and so on. It's relatively inexpensive to add some more fiber pairs when you're already running fiber but much more expensive to have to run additional fiber later, so you might as well give yourself room for growth.
The less obvious reason to run extra fiber is that every so often fiber pairs stop working, just like network cables go bad, and when this happens you'll need to replace them with spare fiber pairs, which means you need those spare fiber pairs. Some of the time this fiber failure is (probably) because a raccoon got into your machine room, but some of the time it just happens for reasons that no one is likely to ever explain to you. And when this happens, you don't necessarily lose only a single pair. Today, for example, we lost three fiber pairs that ran between two adjacent buildings and evidence suggests that other people at the university lost at least one more pair.
(There are a variety of possible causes for sudden loss of multiple pairs, probably all running through a common path, which I will leave to your imagination. These fiber runs are probably not important enough to cause anyone to do a detailed investigation of where the fault is and what happened.)
Fiber comes in two varieties, single mode and multi-mode. I don't know enough to know if you should make a point of running both (over distances where either can be used) as part of the whole 'run more fiber' thing. Locally we have both SM and MM fiber and have switched back and forth between them at times (and may have to do so as a result of the current failures).
PS: Possibly you work in an organization where broken inside-building fiber runs are regularly fixed or replaced. That is not our local experience; someone has to pay for fixing or replacing, and when you have spare fiber pairs left it's easier to switch over to them rather than try to come up with the money and so on.
(Repairing or replacing broken fiber pairs will reduce your long term need for additional fiber, but obviously not the short term need. If you lose N pairs of fiber, you need N spare pairs to get back into operation.)
MFA's "push notification" authentication method can be easier to integrate
For reasons outside the scope of this entry, I'm looking for an OIDC or SAML identity provider that supports primary user and password authentication against our own data and then MFA authentication through the university's SaaS vendor. As you'd expect, the university's MFA SaaS vendor supports all of the common MFA approaches today, covering push notifications through phones, one time codes from hardware tokens, and some other stuff. However, pretty much all of the MFA integrations I've been able to find only support MFA push notifications (eg, also). When I thought about it, this made a lot of sense, because it's often going to be much easier to add push notification MFA than any other form of it.
A while back I wrote about exploiting password fields for multi-factor authentication, where various bits of software hijacked password fields to let people enter things like MFA one time codes into systems (like OpenVPN) that were never set up for MFA in the first place. With most provider APIs, authentication through push notification can usually be inserted in a similar way, because from the perspective of the overall system it can be a synchronous operation. The overall system calls a 'check' function of some sort, the check function calls out the the provider's API and then possibly polls for a result for a while, and then it returns a success or a failure. There's no need to change the user interface of authentication or add additional high level steps.
(The exception is if the MFA provider's push authentication API only returns results to you by making a HTTP query to you. But I think that this would be a relatively weird API; a synchronous reply or at least a polled endpoint is generally much easier to deal with and is more or less required to integrate push authentication with non-web applications.)
By contrast, if you need to get a one time code from the person, you have to do things at a higher level and it may not fit well in the overall system's design (or at least the easily exposed points for plugins and similar things). Instead of immediately returning a successful or failed authentication, you now need to display an additional prompt (in many cases, a HTML page), collect the data, and only then can you say yes or no. In a web context (such as a SAML or OIDC IdP), the provider may want you to redirect the user to their website and then somehow call you back with a reply, which you'll have to re-associate with context and validate. All of this assumes that you can even interpose an additional prompt and reply, which isn't the case in some contexts unless you do extreme things.
(Sadly this means that if you have a system that only supports MFA push authentication and you need to also accept codes and so on, you may be in for some work with your chainsaw.)
JSON has become today's machine-readable output format (on Unix)
Recently, I needed to delete about 1,200 email messages to a
particular destination from the mail queue on one of our systems.
This turned out to be trivial, because this system was using Postfix
and modern versions of Postfix can output mail queue status information
in JSON format. So I could dump the mail queue status, select the
relevant messages and print the queue IDs with jq
, and feed this to Postfix to delete the
messages. This experience has left me with the definite view that
everything should have the option to output JSON for 'machine-readable'
output, rather than some bespoke format. For new programs, I think
that you should only bother producing JSON as your machine readable
output format.
(If you strongly object to JSON, sure, create another machine readable output format too. But if you don't care one way or another, outputting only JSON is probably the easiest approach for programs that don't already have such a format of their own.)
This isn't because JSON is the world's best format (JSON is at
best the least bad format). Instead it's
because JSON has a bunch of pragmatic virtues on a modern Unix
system. In general, JSON provides a clear and basically unambiguous
way to represent text data and much numeric data, even if it has
relatively strange characters in it (ie, JSON has escaping rules
that everyone knows and all tools can deal with); it's also generally
extensible to add additional data without causing heartburn in tools
that are dealing with older versions of a program's output. And
on Unix there's an increasingly rich collection of tools to deal
with and process JSON, starting with jq
itself (and hopefully
soon GNU Awk in common configurations). Plus, JSON can generally
be transformed to various other formats if you need them.
(JSON can also be presented and consumed in either multi-line or single line formats. Multi-line output is often much more awkward to process in other possible formats.)
There's nothing unique about JSON in all of this; it could have been any other format with similar virtues where everything lined up this way for the format. It just happens to be JSON at the moment (and probably well into the future), instead of (say) XML. For individual programs there are simpler 'machine readable' output formats, but they either have restrictions on what data they can represent (for example, no spaces or tabs in text), or require custom processing that goes well beyond basic grep and awk and other widely available Unix tools, or both. But JSON has become a "narrow waist" for Unix programs talking to each other, a common coordination point that means people don't have to invent another format.
(JSON is also partially self-documenting; you can probably look at a program's JSON output and figure out what various parts of it mean and how it's structured.)
PS: Using JSON also means that people writing programs don't have to design their own machine-readable output format. Designing a machine readable output format is somewhat more complicated than it looks, so I feel that the less of it people need to do, the better.
(I say this as a system administrator who's had to deal with a certain amount of output formats that have warts that make them unnecessarily hard to deal with.)
It's good to have offline contact information for your upstream networking
So I said something on the Fediverse:
Current status: it's all fun and games until the building's backbone router disappears.
A modest suggestion: obtain problem reporting/emergency contact numbers for your upstream in advance and post them on the wall somewhere. But you're on your own if you use VOIP desk phones.
(It's back now or I wouldn't be posting this, I'm in the office today. But it was an exciting 20 minutes.)
(I was somewhat modeling the modest suggestion after nuintari's Fediverse series of "rules of networking", eg, also.)
The disappearance of the building's backbone router took out all local networking in the particular building that this happened in (which is the building with our machine room), including the university wireless in the building. THe disappearance of the wireless was especially surprising, because the wireless SSID disappeared entirely.
(My assumption is that the university's enterprise wireless access points stopped advertising the SSID when they lost some sort of management connection to their control plane.)
In a lot of organizations you might have been able to relatively easily find the necessary information even with this happening. For example, people might have smartphones with data plans and laptops that they could tether to the smartphones, and then use this to get access to things like the university directory, the university's problem reporting system, and so on. For various reasons, we didn't really have any of this available, which left us somewhat at a loss when the external networking evaporated. Ironically we'd just managed to finally find some phone numbers and get in touch with people when things came back.
(One bit of good news is that our large scale alert system worked great to avoid flooding us with internal alert emails. My personal alert monitoring (also) did get rather noisy, but that also let me see right away how bad it was.)
Of course there's always things you could do to prepare, much like there are often too many obvious problems to keep track of them all. But in the spirit of not stubbing our toes on the same problem a second time, I suspect we'll do something to keep some problem reporting and contact numbers around and available.
Shared (Unix) hosting and the problem of managing resource limits
Yesterday I wrote about how one problem with shared Unix hosting was the lack of good support for resource limits in the Unixes of the time. But even once you have decent resource limits, you still have an interlinked set of what we could call 'business' problems. These are the twin problems of what resource limits you set on people and how you sell different levels of these resources limits to your customers.
(You may have the first problem even for purely internal resource allocation on shared hosts within your organization, and it's never a purely technical decision.)
The first problem is whether you overcommit what you sell and in general how you decide on the resource limits. Back in the big days of the shared hosting business, I believe that overcommitting was extremely common; servers were expensive and most people didn't use much resources on average. If you didn't overcommit your servers, you had to charge more and most people weren't interested in paying that. Some resources, such as CPU time, are 'flow' resources that can be rebalanced on the fly, restricting everyone to a fair share when the system is busy (even if that share is below what they're nominally entitled to), but it's quite difficult to take memory back (or disk space). If you overcommit memory, your systems might blow up under enough load. If you don't overcommit memory, either everyone has to pay more or everyone gets unpopularly low limits.
(You can also do fancy accounting for 'flow' resources, such as allowing bursts of high CPU but not sustained high CPU. This is harder to do gracefully for things like memory, although you can always do it ungracefully by terminating things.)
The other problem entwined with setting resource limits is how (and if) you sell different levels of resource limits to your customers. A single resource limit is simple but probably not what all of your customers want; some will want more and some will only need less. But if you sell different limits, you have to tell customers what they're getting, let them assess their needs (which isn't always clear in a shared hosting situation), deal with them being potentially unhappy if they think they're not getting what they paid for, and so on. Shared hosting is always likely to have complicated resource limits, which raises the complexity of selling them (and of understanding them, for the customers who have to pick one to buy).
Viewed from the right angle, virtual private servers (VPSes) are a great abstraction to sell different sets of resource limits to people in a way that's straightforward for them to understand (and which at least somewhat hides whether or not you're overcommitting resources). You get 'a computer' with these characteristics, and most of the time it's straightforward to figure out whether things fit (the usual exception is IO rates). So are more abstracted, 'cloud-y' ways of selling computation, database access, and so on (at least in areas where you can quantify what you're doing into some useful unit of work, like 'simultaneous HTTP requests').
It's my personal suspicion that even if the resource limitation problems had been fully solved much earlier, shared hosting would have still fallen out of fashion in favour of simpler to understand VPS-like solutions, where what you were getting and what you were using (and probably what you needed) were a lot clearer.
One problem with "shared Unix hosting" was the lack of resource limits
I recently read Comments on Shared Unix Hosting vs. the Cloud (via), which I will summarize as being sad about how old fashioned shared hosting on a (shared) Unix system has basically died out, and along with it web server technology like CGI. As it happens, I have a system administrator's view of why shared Unix hosting always had problems and was a down-market thing with various limitations, and why even today people aren't very happy with providing it. In my view, a big part of the issue was the lack of resource limits.
The problem with sharing a Unix machine with other people is that by default, those other people can starve you out. They can take up all of the available CPU time, memory, process slots, disk IO, and so on. On an unprotected shared web server, all you need is one person's runaway 'CGI' code (which might be PHP code or etc) or even an unusually popular dynamic site and all of the other people wind up having a bad time. Life gets worse if you allow people to log in, run things in the background, run things from cron, and so on, because all of these can add extra load. In order to make shared hosting be reliable and good, you need some way of forcing a fair sharing of resources and limiting how much resources a given customer can use.
Unfortunately, for much of the practical life of shared Unix hosting, Unixes did not have that. Some Unixes could create various sorts of security boundaries, but generally not resource usage limits that applied to an entire group of processes. Even once this became possibly to some degree in Linux through cgroup(s), the kernel features took some time to mature and then it took even longer for common software to support running things in isolated and resource controlled cgroups. Even today it's still not necessarily entirely there for things like running CGIs from your web server, never mind a potential shared database server to support everyone's database backed blog.
(A shared database server needs to implement its own internal resource limits for each customer, otherwise you have to worry about a customer gumming it up with expensive queries, a flood of queries, and so on. If they need separate database servers for isolation and resource control, now they need more server resources.)
My impression is that the lack of kernel supported resource limits forced shared hosting providers to roll their own ad-hoc ways of limiting how much resources their customers could use. In turn this created the array of restrictions that you used to see on such providers, with things like 'no background processes', 'your CGI can only run for so long before being terminated', 'your shell session is closed after N minutes', and so on. If shared hosting had been able to put real limits on each of their customers, this wouldn't have been as necessary; you could go more toward letting each customer blow itself up if it over-used resources.
(How much resources to give each customer is also a problem, but that's another entry.)
How you should respond to authentication failures isn't universal
A discussion broke out in the comments on my entry on how everything should be able to ratelimit authentication failures, and one thing that came up was the standard advice that when authentication fails, the service shouldn't give you any indication of why. You shouldn't react any differently if it's a bad password for an existing account, an account that doesn't exist any more (perhaps with the correct password for the account when it existed), an account that never existed, and so on. This is common and long standing advice, but like a lot of security advice I think that the real answer is that what you should do depends on your circumstances, priorities, and goals.
The overall purpose of the standard view is to not tell attackers what they got wrong, and especially not to tell them if the account doesn't even exist. What this potentially achieves is slowing down authentication guessing and making the attacker use up more resources with no chance of success, so that if you have real accounts with vulnerable passwords the attacker is less likely to succeed against them. However, you shouldn't have weak passwords any more and on the modern Internet, attackers aren't short of resources or likely to suffer any consequences for trying and trying against you (and lots of other people). In practice, much like delays on failed authentications, it's been a long time since refusing to say why something failed meaningfully impeded attackers who are probing standard setups for SSH, IMAP, authenticated SMTP, and other common things.
(Attackers are probing for default accounts and default passwords, but the fix there is not to have any, not to slow attackers down a bit. Attackers will find common default account setups, probably much sooner than you would like. Well informed attackers can also generally get a good idea of your valid accounts, and they certainly exist.)
If what you care about is your server resources and not getting locked out through side effects, it's to your benefit for attackers to stop early. In addition, attackers aren't the only people who will fail your authentication. Your own people (or ex-people) will also be doing a certain amount of it, and some amount of the time they won't immediately realize what's wrong and why their authentication attempt failed (in part because people are sadly used to systems simply being flaky, so retrying may make things work). It's strictly better for your people if you can tell them what was wrong with their authentication attempt, at least to a certain extent. Did they use a non-existent account name? Did they format the account name wrong? Are they trying to use an account that has now been disabled (or removed)? And so on.
(Some of this may require ingenious custom communication methods (and custom software). In the comments on my entry, BP suggested 'accepting' IMAP authentication for now-closed accounts and then providing them with only a read-only INBOX that had one new message that said 'your account no longer exists, please take it out of this IMAP client'.)
There's no universally correct trade-off between denying attackers information and helping your people. A lot of where your particular trade-offs fall will depend on your usage patterns, for example how many of your people make mistakes of various sorts (including 'leaving their account configured in clients after you've closed it'). Some of it will also depend on how much resources you have available to do a really good job of recognizing serious attacks and impeding attackers with measures like accurately recognizing 'suspicious' authentication patterns and blocking them.
(Typically you'll have no resources for this and will be using more or less out of the box rate-limiting and other measures in whatever software you use. Of course this is likely to limit your options for giving people special messages about why they failed authentication, but one of my hopes is that over time, software adds options to be more informative if you turn them on.)
Everything should be able to ratelimit sources of authentication failures
One of the things that I've come to believe in is that everything, basically without exception, should be able to rate-limit authentication failures, at least when you're authenticating people. Things don't have to make this rate-limiting mandatory, but it should be possible. I'm okay with basic per-IP or so rate limiting, although it would be great if systems could do better and be able to limit differently based on different criteria, such as whether the target login exists or not, or is different from the last attempt, or both.
(You can interpret 'sources' broadly here, if you want to; perhaps you should be able to ratelimit authentication by target login, not just by source IP. Or ratelimit authentication attempts to nonexistent logins. Exim has an interesting idea of a ratelimit 'key', which is normally the source IP in string form but which you can make be almost anything, which is quite flexible.)
I have come to feel that there are two reasons for this. The first reason, the obvious one, is that the Internet is full of brute force bulk attackers and if you don't put in rate-limits, you're donating CPU cycles and RAM to them (even if they have no chance of success and will always fail, for example because you require MFA after basic password authentication succeeds). This is one of the useful things that moving your services to non-standard ports helps with; you're not necessarily any more secure against a dedicated attacker, but you've stopped donating CPU cycles to the attackers that only poke the default port.
The second reason is that there are some number of people out there who will put a user name and a password (or the equivalent in the form of some kind of bearer token) into the configuration of some client program and then forget about it. Some of the programs these people are using will retry failed authentications incessantly, often as fast as you'll allow them. Even if the people check the results of the authentication initially (for example, because they want to get their IMAP mail), they may not keep doing so and so their program may keep trying incessantly even after events like their password changing or their account being closed (something that we've seen fairly vividly with IMAP clients). Without rate-limits, these programs have very little limits on their blind behavior; with rate limits, you can either slow them down (perhaps drastically) or maybe even provoke error messages that get the person's attention.
Unless you like potentially seeing your authentication attempts per second trending up endlessly, you want to have some way to cut these bad sources off, or more exactly make their incessant attempts inexpensive for you. The simple, broad answer is rate limiting.
(Actually getting rate limiting implemented is somewhat tricky, which in my view is one reason it's uncommon (at least as an integrated feature, instead of eg fail2ban). But that's another entry.)
PS: Having rate limits on failed authentications is also reassuring, at least for me.
The practical (Unix) problems with .cache and its friends
Over on the Fediverse, I said:
Dear everyone writing Unix programs that cache things in dot-directories (.cache, .local, etc): please don't. Create a non-dot directory for it. Because all of your giant cache (sub)directories are functionally invisible to many people using your programs, who wind up not understanding where their disk space has gone because almost nothing tells them about .cache, .local, and so on.
A corollary: if you're making a disk space usage tool, it should explicitly show ~/.cache, ~/.local, etc.
If you haven't noticed, there are an ever increasing number of programs that will cache a bunch of data, sometimes a very large amount of it, in various dot-directories in people's home directories. If you're lucky, these programs put their cache somewhere under ~/.cache; if you're semi-lucky, they use ~/.local, and if you're not lucky they invent their own directory, like ~/.cargo (used by Rust's standard build tool because it wants to be special). It's my view that this is a mistake and that everyone should put their big caches in a clearly visible directory or directory hierarchy, one that people can actually find in practice.
I will freely admit that we are in a somewhat unusual environment where we have shared fileservers, a now very atypical general multi-user environment, a compute cluster, and a bunch of people who are doing various sorts of modern GPU-based 'AI' research and learning (both AI datasets and AI software packages can get very big). In our environment, with our graduate students, it's routine for people to wind up with tens or even hundreds of GBytes of disk space used up for caches that they don't even realize are there because they don't show up in conventional ways to look for space usage.
As noted by Haelwenn /ΡΠ»Π²ΡΠ½/, a plain
'du
' will find such dotfiles. The problem is that plain 'du
'
is more or less useless for most people; to really take advantage
of it, you have to know the right trick
(not just the -h argument but feeding it to sort to find things).
How I think most people use 'du
' to find space hogs is they start
in their home directory with 'du -s *
' (or maybe 'du -hs *
')
and then they look at whatever big things show up. This will
completely miss things in dot-directories in normal usage. And on
Linux desktops, I believe that common GUI file browsers will omit
dot-directories by default and may not even have a particularly
accessible option to change that (this is certainly the behavior
of Cinnamon's 'Files' application and I can't imagine that GNOME
is different, considering their attitude).
(I'm not sure what our graduate students use to try explore their disk usage, but I know that multiple graduate students have been unable to find space being eaten up in dot-directories and surprised that their home directory was using so much.)
Modern languages and bad packaging outcomes at scale
Recently I read Steinar H. Gunderson's Migrating away from bcachefs (via), where one of the mentioned issues was a strong disagreement between the author of bcachefs and the Debian Linux distribution about how to package and distribute some Rust-based tools that are necessary to work with bcachefs. In the technology circles that I follow, there's a certain amount of disdain for the Debian approach, so today I want to write up how I see the general problem from a system administrator's point of view.
(Saying that Debian shouldn't package the bcachefs tools if they can't follow the wishes of upstream is equivalent to saying that Debian shouldn't support bcachefs. Among other things, this isn't viable for something that's intended to be a serious mainstream Linux filesystem.)
If you're serious about building software under controlled circumstances (and Linux distributions certainly are, as are an increasing number of organizations in general), you want the software build to be both isolated and repeatable. You want to be able to recreate the same software (ideally exactly binary identical, a 'reproducible build') on a machine that's completely disconnected from the Internet and the outside world, and if you build the software again later you want to get the same result. This means that build process can't download things from the Internet, and if you run it three months from now you should get the same result even if things out there on the Internet have changed (such as third party dependencies releasing updated versions).
Unfortunately a lot of the standard build tooling for modern languages is not built to do this. Instead it's optimized for building software on Internet connected machines where you want the latest patchlevel or even entire minor version of your third party dependencies, whatever that happens to be today. You can sometimes lock down specific versions of all third party dependencies, but this isn't necessarily the default and so programs may not be set up this way from the start; you have to patch it in as part of your build customizations.
(Some languages are less optimistic about updating dependencies, but developers tend not to like that. For example, Go is controversial for its approach of 'minimum version selection' instead of 'maximum version selection'.)
The minimum thing that any serious packaging environment needs to do is contain all of the dependencies for any top level artifact, and to force the build process to use these (and only these), without reaching out to the Internet to fetch other things (well, you're going to block all external access from the build environment). How you do this depends on the build system, but it's usually possible; in Go you might 'vendor' all dependencies to give yourself a self-contained source tree artifact. This artifact never changes the dependency versions used in a build even if they change upstream because you've frozen them as part of the artifact creation process.
(Even if you're not a distribution but an organization building your own software using third-party dependencies, you do very much want to capture local copies of them. Upstream things go away or get damaged every so often, and it can be rather bad to not be able to build a new release of some important internal tool because an upstream decided to retire to goat farming rather than deal with the EU CRA. For that matter, you might want to have local copies of important but uncommon third party open source tools you use, assuming you can reasonably rebuild them.)
If you're doing this on a small scale for individual programs you care a lot about, you can stop there. If you're doing this on an distribution's scale you have an additional decision to make: do you allow each top level thing to have its own version of dependencies, or do you try to freeze a common version? If you allow each top level thing to have its own version, you get two problems. First, you're using up more disk space for at least your source artifacts. Second and worse, now you're on the hook for maintaining, checking, and patching multiple versions of a given dependency if it turns out to have a security issue (or a serious bug).
Suppose that you have program A using version 1.2.3 of a dependency, program B using 1.2.7, the current version is 1.2.12, and the upstream releases 1.2.13 to fix a security issue. You may have to investigate both 1.2.3 and 1.2.7 to see if they have the bug and then either patch both with backported fixes or force both program A and program B to be built with 1.2.13, even if the version of these programs that you're using weren't tested and validated with this version (and people routinely break things in patchlevel releases).
If you have a lot of such programs it's certainly tempting to put your foot down and say 'every program that uses dependency X will be set to use a single version of it so we only have to worry about that version'. Even if you don't start out this way you may wind up with it after a few security releases from the dependency and the packagers of programs A and B deciding that they will just force the use of 1.2.13 (or 1.2.15 or whatever) so that they can skip the repeated checking and backporting (especially if both programs are packaged by the same person, who has only so much time to deal with all of this). If you do this inside an organization, probably no one in the outside world knows. If you do this as a distribution, people yell at you.
(Within an organization you may also have more flexibility to update program A and program B themselves to versions that might officially support version 1.2.15 of that dependency, even if the program version updates are a little risky and change some behavior. In a distribution that advertises stability and has no way of contacting people using it to warn them or coordinate changes, things aren't so flexible.)
The tradeoffs of having an internal unauthenticated SMTP server
One of the reactions I saw to my story of being hit by an alarming well prepared phish spammer was surprise that we had an unauthenticated SMTP server, even if it was only available to our internal networks. Part of the reason we have such a server is historical, but I also feel that the tradeoffs involved are not as clear cut as you might think.
One fundamental problem is that people (actual humans) aren't the
only thing that needs to be able to send email. Unless you enjoy
building your own system problem notification system from scratch,
a whole lot of things will try to send you email to tell you about
problems. Cron jobs will email you output, you may want to get
similar email about systemd units,
both Linux software RAID and smartd
will want to use email to
tell you about failures, you may have home-grown management systems, and so on. In addition to these programs
on your servers, you may have inconvenient devices like networked
multi-function photocopiers that have scan to email functionality
(and the people who bought them and need to use them have feelings
about being able to do so). In a university environment such as
ours, some of the machines
involved will be run by research groups, graduate students, and so
on, not your core system administrators (and it's a very good idea
if these machines can tell their owners about failed disks and the
like).
Most of these programs will submit their email through the local mailer facilities (whatever they are), and most local mail systems ('MTAs') can be configured to use authentication when they talk to whatever SMTP gateway you point them at. So in theory you could insist on authenticated SMTP for everything. However, this gives you a different problem, because now you must manage this authentication. Do you give each machine its own authentication identity and password, or have some degree of shared authentication? How do you distribute and update this authentication information? How much manual work are you going to need to do as research groups add and remove machines (and as your servers come and go)? Are you going to try to build a system that restricts where a given authentication identity can be used from, so that someone can't make off with the photocopier's SMTP authorization and reuse it from their desktop?
(If you instead authorize IP addresses without requiring SMTP authentication, you've simply removed the requirement for handling and distributing passwords; you're still going to be updating some form of access list. Also, this has issues if people can use your servers.)
You can solve all of these problems if you want to. But there is no current general, easily deployed solution for them, partly because we don't currently have any general system of secure machine and service identity that programs like MTAs can sit on top of. So system administrators have to build such things ourselves to let one MTA prove to another MTA who and what it is.
(There are various ways to do this other than SMTP authentication and some of them are generally used in some environments; I understand that mutual TLS is common in some places. And I believe that in theory Kerberos could solve this, if everything used it.)
Every custom piece of software or piece of your environment that you build is an overhead; it has to be developed, maintained, updated, documented, and so on. It's not wrong to look at the amount of work it would require in your environment to have only authenticated SMTP and conclude that the practical risks of having unauthenticated SMTP are low enough that you'll just do that.
PS: requiring explicit authentication or authorization for notifications is itself a risk, because it means that a machine that's in a sufficiently bad or surprising state can't necessarily tell you about it. Your emergency notification system should ideally fail open, not fail closed.
PPS: In general, there are ways to make an unauthenticated SMTP server less risky, depending on what you need it to do. For example, in many environments there's no need to directly send such system notification email to arbitrary addresses outside the organization, so you could restrict what destinations the server accepts, and maybe what sending addresses can be used with it.
Sometimes you need to (or have to) run old binaries of programs
Something that is probably not news to system administrators who've been doing this long enough is that sometimes, you need to or have to run old binaries of programs. I don't mean that you need to run old versions of things (although since the program binaries are old, they will be old versions); I mean that you literally need to run old binaries, ones that were built years ago.
The obvious situation where this can happen is if you have commercial software and the vendor either goes out of business or stops providing updates for the software. In some situations this can result in you needing to keep extremely old systems alive simply to run this old software, and there are lots of stories about 'business critical' software in this situation.
(One possibly apocryphal local story is that the central IT people had to keep a SPARC Solaris machine running for more than a decade past its feasible end of life because it was the only environment that ran a very special printer driver that was used to print payroll checks.)
However, you can also get into this situation with open source software too. Increasingly, rebuilding complex open source software projects is not for the faint of heart and requires complex build environments. Not infrequently, these build environments are 'fragile', in the sense that in practice they depend on and require specific versions of tools, supporting language interpreters and compilers, and so on. If you're trying to (re)build them on a modern version of the OS, you may find some issues (also). You can try to get and run the version of the tools they need, but this can rapidly send you down a difficult rabbit hole.
(If you go back far enough, you can run into 32-bit versus 64-bit issues. This isn't just compilation problems, where code isn't 64-bit safe; you can also have code that produces different results when built as a 64-bit binary.)
This can create two problems. First, historically, it complicates moving between CPU architectures. For a couple of decades that's been a non-issue for most Unix environments, because x86 was so dominant, but now ARM systems are starting to become more and more available and even attractive, and they generally don't run old x86 binaries very well. Second, there are some operating systems that don't promise long term binary compatibility to older versions of themselves; they will update system ABIs, removing the old version of the ABI after a while, and require you to rebuild software to use the new ABIs if you want to run it on the current version of the OS. If you have to use old binaries you're stuck with old versions of the OS and generally no security updates.
(If you think that this is absurd and no one would possibly do that, I will point you to OpenBSD, which does it regularly to help maintain and improve the security of the system. OpenBSD is neither wrong nor right to take their approach; they're making a different set of tradeoffs than, say, Linux, because they have different priorities.)
Some ways to restrict who can log in via OpenSSH and how they authenticate
In yesterday's entry on allowing password authentication from the
Internet for SSH, I mentioned that there
were ways to restrict who this was enabled for or who could log in
through SSH. Today I want to cover some of them, using settings in
/etc/ssh/sshd_config
.
The simplest way is to globally restrict logins with AllowUsers
, listing only
specific accounts you want to be accessed over SSH. If there are
too many such accounts or they change too often, you can switch to
AllowGroups
and allow only people in a specific group that you maintain, call
it 'sshlogins'.
If you want to allow logins generally but restrict, say, password
based authentication to only people that you expect, what you want
is a Match
block
and setting AuthenticationMethods
within
it. You would set it up something like this:
AuthenticationMethods publickeyMatch User cks AuthenticationMethods any
If you want to be able to log in using password from your local networks but not remotely, you could extend this with an additional Match directive that looked at the origin IP address:
Match Address 127.0.0.0/8,<your networks here> AuthenticationMethods any
In general, Match directives are your tool for doing relatively complex restrictions. You could, for example, arrange that accounts in a certain Unix group can only log in from the local network, never remotely. Or reverse this so that only logins in some Unix group can log in remotely, and everyone else is only allowed to use SSH within the local network.
However, any time you're doing complex things with Match blocks, you should make sure to test your configuration to make sure it's working the way you want. OpenSSH's sshd_config is a configuration file with some additional capabilities, not a programming language, and there are undoubtedly some subtle interactions and traps you can fall into.
(This is one reason I'm not giving a lot of examples here; I'd have to carefully test them.)
Sidebar: Restricting root logins via OpenSSH
If you permit root logins via OpenSSH at all, one fun thing to do is to restrict where you'll accept them from:
PermitRootLogin no Match Address 127.0.0.0/8,<your networks here> PermitRootLogin prohibit-password # or 'yes' for some places
A lot of Internet SSH probers direct most of their effort against the root account. With this setting you're assured that all of them will fail no matter what.
Thoughts on having SSH allow password authentication from the Internet
On the Fediverse, I recently saw a poll about whether people left SSH generally accessible on its normal port or if they moved it; one of the replies was that the person left SSH on the normal port but disallowed password based authentication and only allowed public key authentication. This almost led to me posting a hot take, but then I decided that things were a bit more nuanced than my first reaction.
As everyone with an Internet-exposed SSH daemon knows, attackers are constantly attempting password guesses against various accounts. But if you're using a strong password, the odds of an attacker guessing it are extremely low, since doing 'password cracking via SSH' has an extremely low guesses per second number (enforced by your SSH daemon). In this sense, not accepting passwords over the Internet is at most a tiny practical increase in security (with some potential downsides in unusual situations).
Not accepting passwords from the Internet protects you against three other risks, two relatively obvious and one subtle one. First, it stops an attacker that can steal and then crack your encrypted passwords; this risk should be very low if you use strong passwords. Second, you're not exposed if your SSH server turns out to have a general vulnerability in password authentication that can be remotely exploited before a successful authentication. This might not be an authentication bypass; it might be some sort of corruption that leads to memory leaks, code execution, or the like. In practice, (OpenSSH) password authentication is a complex piece of code that interacts with things like your system's random set of PAM modules.
The third risk is that some piece of software will create a generic account with a predictable login name and known default password. These seem to be not uncommon, based on the fact that attackers probe incessantly for them, checking login names like 'ubuntu', 'debian', 'admin', 'testftp', 'mongodb', 'gitlab', and so on. Of course software shouldn't do this, but if something does, not allowing password authenticated SSH from the Internet will block access to these bad accounts. You can mitigate this risk by only accepting password authentication for specific, known accounts, for example only your own account.
The potential downside of only accepting keypair authentication for access to your account is that you might need to log in to your account in a situation where you don't have your keypair available (or can't use it). This is something that I probably care about more than most people, because as a system administrator I want to be able to log in to my desktop even in quite unusual situations. As long as I can use password authentication, I can use anything trustworthy that has a keyboard. Most people probably will only log in to their desktops (or servers) from other machines that they own and control, like laptops, tablets, or phones.
(You can opt to completely disallow password authentication from all other machines, even local ones. This is an even stronger and potentially more limiting restriction, since now you can't even log in from another one of your machines unless that machine has a suitable keypair set up. As a sysadmin, I'd never do that on my work desktop, since I very much want to be able to log in to my regular account from the console of one of our servers if I need to.)
My bug reports are mostly done for work these days
These days, I almost entirely report bugs in open source software as part of my work. A significant part of this is that most of what I stumble over bugs in are things that work uses (such as Ubuntu or OpenBSD), or at least things that I mostly use as part of work. There are some consequences of this that I feel like noting today.
The first is that I do bug investigation and bug reporting on work time during work hours, and I don't work on "work bugs" outside of that, on evenings, weekends, and holidays. This sometimes meshes awkwardly with the time open source projects have available for dealing with bugs (which is often in people's personal time outside of work hours), so sometimes I will reply to things and do additional followup investigation out of hours to keep a bug report moving along, but I mostly avoid it. Certainly the initial investigation and filing of a work bug is a working hours activity.
(I'm not always successful in keeping it to that because there is always the temptation to spend a few more minutes digging a bit more into the problem. This is especially acute when working from home.)
The second thing is that bug filing work is merely one of the claims on my work time. I have a finite amount of work time and a variety of things to get done with varying urgency, and filing and updating bugs is not always the top of the list. And just like other work activity, filing a particular bug has to convince me that it's worth spending some of my limited work time on this particular activity. Work does not pay me to file bugs and make open source better; they pay me to make our stuff work. Sometimes filing a bug is a good way to do this but some of the time it's not, for example because the organization in question doesn't respond to most bug reports.
(Even when it's useful in general to file a bug report because it will result in the issue being fixed at some point in the future, we generally need to deal with the problem today, so filing the bug report may take a back seat to things like developing workarounds.)
Another consequence is that it's much easier for me to make informal Fediverse posts about bugs (often as I discover more and more disconcerting things) or write Wandering Thoughts posts about work bugs than it is to make an actual bug report. Writing for Wandering Thoughts is a personal thing that I do outside of work hours, although I write about stuff from work (and I can often use something to write about, so interesting work bugs are good grist).
(There is also that making bug reports is not necessarily pleasant, and making bad bug reports can be bad. This interacts unpleasantly with the open source valorization of public work. To be blunt, I'm more willing to do unpleasant things when work is paying me than when it's not, although often the bug reports that are unpleasant to make are also the ones that aren't very useful to make.)
PS: All of this leads to a surprisingly common pattern where I'll spend much of a work day running down a bug to the point where I feel I understand it reasonably well, come home after work, write the bug up as a Wandering Thoughts entry (often clarifying my understanding of the bug in the process), and then file a bug report at work the next work day.
IMAP clients can vary in their reactions to IMAP errors
For reasons outside of the scope of this entry, we recently modified our IMAP server so that it would only return 20,000 results from an IMAP LIST command (technically 20,001 results). In our environment, an IMAP LIST operation that generates this many results is because one of the people who can hit this have run into our IMAP server backward compatibility problem. When we made this change, we had a choice for what would happen when the limit was hit, and specifically we had a choice of whether to claim that the IMAP LIST operation had succeeded or had failed. In the end we decided it was better to report that the IMAP LIST operation had failed, which also allowed us to include a text message explaining what had happened (in IMAP these are relatively free form).
(The specifics of the situation are that the IMAP LIST command will report a stream of IMAP folders back to the client and then end the stream after 20,001 entries, with either an 'ok' result or an error result with text. So in the latter case, the IMAP client gets 20,001 folder entries and an error at the end.)
Unsurprisingly, after deploying this change we've seen that IMAP clients (both mail readers and things like server webmail code) vary in their behavior when this limit is hit. The behavior we'd like to see is that the client considers itself to have a partial result and uses it as much as possible, while also telling the person using it that something went wrong. I'm not sure any IMAP client actually does this. One webmail system that we use reports the entire output from the IMAP LIST command as an 'error' (or tries to); since the error message is the last part of the output, this means it's never visible. One mail client appears to throw away all of the LIST results and not report an error to the person using it, which in practice means that all of your folders disappear (apart from your inbox).
(Other mail clients appear to ignore the error and probably show the partial results they've received.)
Since the IMAP server streams the folder list from IMAP LIST to the client as it traverses the folders (ie, Unix directories), we don't immediately know if there are going to be too many results; we only find that out after we've already reported those 20,000 folders. But in hindsight, what we could have done is reported a final synthetic folder with a prominent explanatory name and then claimed that the command succeeded (and stopped). In practice this seems more likely to show something to the person using the mail client, since actually reporting the error text we provide is apparently not anywhere near as common as we might hope.
Using tcpdump to see only incoming or outgoing traffic
In the normal course of events, implementations of 'tcpdump
'
report on packets going in both directions, which is to say it
reports both packets received and packets sent. Normally this isn't
confusing and you can readily tell one from the other, but sometimes
situations aren't normal and
you want to see only incoming packets or only outgoing packets
(this has come up before). Modern
versions of tcpdump can do this, but you have to know where to
look.
If you're monitoring regular network interfaces on Linux, FreeBSD, or OpenBSD, this behavior is controlled by a tcpdump command line switch. On modern Linux and on FreeBSD, this is '-Q in' or '-Q out', as covered in the Linux manpage and the FreeBSD manpage. On OpenBSD, you use a different command line switch, '-D in' or '-D out', per the OpenBSD manpage.
(The Linux and FreeBSD tcpdump use '-D' to mean 'list all interfaces'.)
There are network types where the in or out direction can be matched by tcpdump pcap filter rules, but plain Ethernet is not one of them. This implies that you can't write a pcap filter rule that matches some packets only inbound and some packets only outbound at the same time; instead you have to run two tcpdumps.
If you have a (software) bridge interface or bridged collection of interfaces, as far as I know on both OpenBSD and FreeBSD the 'in' and 'out' directions on the underlying physical interfaces work the way you expect. Which is to say, if you have ix0 and ix1 bridged together as bridge0, 'tcpdump -Q in -i ix0' shows packets that ix0 is receiving from the physical network and doesn't include packets forward out through ix0 by the bridge interface (which in some sense you could say are 'sent' to ix0 by the bridge).
The PF packet filter system on both OpenBSD and FreeBSD can log packets to a special network interface, normally 'pflog0'. When you tcpdump this interface, both OpenBSD and FreeBSD accept an 'on <interface>' (which these days is a synonym for 'ifname <interface>') clause in pcap filters, which I believe means that the packet was received on the specific interface (per my entry on various filtering options for OpenBSD). Both also have 'inbound' and 'outbound', which I believe match based on whether the particular PF rule that caused them to match was an 'in' or an 'out' rule.
(See the OpenBSD pcap-filter and the FreeBSD pcap-filter manual pages.)
I'm firmly attached to a mouse and (overlapping) windows
In the tech circles I follow, there are a number of people who are firmly in what I could call a 'text mode' camp (eg, also). Over on the Fediverse, I said something in an aside about my personal tastes:
(Having used Unix through serial terminals or modems+emulators thereof back in the days, I am not personally interested in going back to a single text console/window experience, but it is certainly an option for simplicity.)
(Although I didn't put it in my Fediverse post, my experience with this 'single text console' environment extends beyond Unix. Similarly, I've lived without a mouse and now I want one (although I have particular tastes in mice).)
On the surface I might seem like someone who is a good candidate for the single pane of text experience, since I do much of my work in text windows, either terminals or environments (like GNU Emacs) that ape them, and I routinely do odd things like read email from the command line. But under the surface, I'm very much not. I very much like having multiple separate blocks of text around, being able to organize these blocks spatially, having a core area where I mostly work from with peripheral areas for additional things, and being able to overlap these blocks and apply a stacking order to control what is completely visible and what's partly visible.
In one view, you could say that this works partly because I have enough screen space. In another view, it would be better to say that I've organized my computing environment to have this screen space (and the other aspects). I've chosen to use desktop computers instead of portable ones, partly for increased screen space, and I've consistently opted for relatively large screens when I could reasonably get them, steadily moving up in screen size (both physical and resolution wise) over time.
(Over the years I've gone out of my way to have this sort of environment, including using unusual window systems.)
The core reason I reach for windows and a mouse is simple: I find the pure text alternative to be too confining. I can work in it if I have to but I don't like to. Using finer grained graphical windows instead of text based ones (in a text windowing environment, which exist), and being able to use a mouse to manipulate things instead of always having to use keyboard commands, is nicer for me. This extends beyond shell sessions to other things as well; for example, generally I would rather start new (X) windows for additional Emacs or vim activities rather than try to do everything through the text based multi-window features that each has. Similarly, I almost never use screen (or tmux) within my graphical desktop; the only time I reach for either is when I'm doing something critical that I might be disconnected from.
(This doesn't mean that I use a standard Unix desktop environment for my main desktops; I have a quite different desktop environment. I've also written a number of tools to make various aspects of this multi-window environment be easy to use in a work environment that involves routine access to and use of a bunch of different machines.)
If I liked tiling based window environments, it would be easier to switch to a text (console) based environment with text based tiling of 'windows', and I would probably be less strongly attached to the mouse (although it's hard to beat the mouse for selecting text). However, tiling window environments don't appeal to me (also), either in graphical or in text form. I'll use tiling in environments where it's the natural choice (for example, in vim and emacs), but I consider it merely okay.
The TLS certificate multi-file problem (for automatic updates)
In a recent entry on short lived TLS certificates and graceful certificate rollover in web servers, I mentioned that one issue with software automatically reloading TLS certificates was that TLS certificates are almost always stored in multiple files. Typically this is either two files (the TLS certificate's key and a 'fullchain' file with the TLS certificate and intermediate certificates together) or three files (the key, the signed certificate, and a third file with the intermediate chain). The core problem this creates is the same one you have any time information is split across multiple files, namely making 'atomic' changes to the set of files, so that software never sees an inconsistent state with some updated files and some not.
With TLS certificates, a mismatch between the key and the signed certificate will cause the server to be unable to properly prove that it controls the private key for the TLS certificate it presented. Either it will load the new key and the old certificate or the old key and the new certificate, and in both cases they won't be able to generate the correct proof (assuming the secure case where your TLS certificate software generates a new key for each TLS certificate renewal, which you want to do since you want to guard against your private key having been compromised).
The potential for a mismatch is obvious if the file with the TLS key and the file with the TLS certificate are updated separately (or a new version is written out and swapped into place separately). At this point your mind might turn to clever tricks like writing all of the new files to a new directory and somehow swapping the whole directory in at once (this is certainly where mine went). Unfortunately, even this isn't good enough because the program has to open the two (or three) files separately, and the time gap between the opens creates an opportunity for a mismatch more or less no matter what we do.
(If the low level TLS software operates by, for example, first loading and parsing the TLS certificate, then loading the private key to verify that it matches, the time window may be bigger than you expect because the parsing may take a bit of time. The minimal time window comes about if you open the two files as close to each other as possible and defer all loading and processing until after both are opened.)
The only completely sure way to get around this is to put everything in one file (and then use an appropriate way to update the file atomically). Short of that, I believe that software could try to compensate by checking that the private key and the TLS certificate match after they're automatically reloaded, and if they don't, it should reload both.
(If you control both the software that will use the TLS certificates and the renewal software, you can do other things. For example, you can always update the files in a specific order and then make the server software trigger an automatic reload only when the timestamp changes on the last file to be updated. That way you know the update is 'done' by the time you're loading anything.)
Remembering to make my local changes emit log messages when they act
Over on the Fediverse, I said something:
Current status: respinning an Ubuntu package build (... painfully) because I forgot the golden rule that when I add a hack to something, I should always make it log when my hack was triggered. Even if I can observe the side effects in testing, we'll want to know it happened in production.
(Okay, this isn't applicable to all hacks, but.)
Every so often we change or augment some standard piece of software or standard part of the system to do something special under specific circumstances. A rule I keep forgetting and then either re-learning or reminding myself of is that even if the effects of my change triggering are visible to the person using the system, I want to make it log as well. There are at least two reasons for this.
The first reason is that my change may wind up causing some problem for people, even if we don't think it's going to. Should it cause such problems, it's very useful to have a log message (perhaps shortly before the problem happens) to the effect of 'I did this new thing'. This can save a bunch of troubleshooting, both at the time when we deploy this change and long afterward.
The second reason is that we may turn out to be wrong about how often our change triggers, which is to say how common the specific circumstances are. This can go either way. Our change can trigger a lot more than we expected, which may mean that it's overly aggressive and is affecting people more than we want, and cause us to look for other options. Or this could be because the issue we're trying to deal with could be more significant than we expect and justifies us doing even more. Alternately, our logging can trigger a lot less than we expect, which may mean we want to take the change out rather than have to maintain a local modification that doesn't actually do much (one that almost invariably makes the system more complex and harder to understand).
In the log message itself, I want to be clear and specific, although probably not as verbose as I would be for an infrequent error message. Especially for things I expect to trigger relatively infrequently, I should probably put as many details about the special circumstances as possible into the log message, because the log message is what me and my co-workers may have to work from in six months when we've forgotten the details.
PCIe cards we use and have used in our servers
In a comment on my entry on how common (desktop) motherboards are supporting more M.2 NVMe slots but fewer PCIe cards, jmassey was curious about what PCIe cards we needed and used. This is a good and interesting question, especially since some number of our 'servers' are actually built using desktop motherboards for various reasons (for example, a certain number of the GPU nodes in our SLURM cluster, and some of our older compute servers, which we put together ourselves using early generation AMD Threadrippers and desktop motherboards for them).
Today, we have three dominant patterns of PCIe cards. Our SLURM GPU nodes obviously have a GPU card (x16 PCIe lanes) and we've added a single port 10G-T card (which I believe are all PCIe x4) so they can pull data from our fileservers as fast as possible. Most of our firewalls have an extra dual-port 10G card (mostly 10G-T but a few use SFPs). And a number of machines have dual-port 1G cards because they need to be on more networks; our current stock of these cards are physically x4 PCIe, although I haven't looked to see if they use all the lanes.
(We also have single-port 1G cards lying around that sometimes get used in various machines; these are x1 cards. The dual-port 10G cards are probably some mix of x4 and x8, since online checks say they come in both varieties. We have and use a few quad-port 1G cards for semi-exotic situations, but I'm not sure how many PCIe lanes they want, physically or otherwise. In theory they could reasonably be x4, since a single 1G is fine at x1.)
In the past, one generation of our fileserver setup had some machines that needed to use PCIe SAS controller in order to be able to talk to all of the drives in their chassis, and I believe these cards were PCIe x8; these machines also used a dual 10G-T card. The current generation handles all of their drives through motherboard controllers, but we might need to move back to cards in future hardware configurations (depending on what the available server motherboards handle on the motherboard). The good news, for fileservers, is that modern server motherboards increasingly have at least one onboard 10G port. But in a worst case situation, a large fileserver might need two SAS controller cards and a 10G card.
It's possible that we'll want to add NVMe drives to some servers (parts of our backup system may be limited by SATA write and read speeds today). Since I don't believe any of our current servers support PCIe bifurcation, this would require one or two PCIe x4 cards and slots (two if we want to mirror this fast storage, one if we decide we don't care). Such a server would likely also want 10G; if it didn't have a motherboard 10G port, that would require another x4 card (or possibly a dual-port 10G card at x8).
The good news for us is that servers tend to make all of their available slots be physically large (generally large enough for x8 cards, and maybe even x16 these days), so you can fit in all these cards even if some of them don't get all the PCIe lanes they'd like. And modern server CPUs are also coming with more and more PCIe lanes, so probably we can actually drive many of those slots at their full width.
(I was going to say that modern server motherboards mostly don't design in M.2 slots that reduce the available PCIe lanes, but that seems to depend on what vendor you look at. A random sampling of Supermicro server motherboards suggests that two M.2 slots are not uncommon, while our Dell R350s have none.)
The modern world of server serial ports, BMCs, and IPMI Serial over LAN
Once upon a time, life was relatively simple in the x86 world. Most x86 compatible PCs theoretically had one or two UARTs, which were called COM1 and COM2 by MS-DOS and Windows, ttyS0 and ttyS1 by Linux, 'ttyu0' and 'ttyu1' by FreeBSD, and so on, based on standard x86 IO port addresses for them. Servers had a physical serial port on the back and wired the connector to COM1 (some servers might have two connectors). Then life became more complicated when servers implemented BMCs (Baseboard management controllers) and the IPMI specification added Serial over LAN, to let you talk to your server through what the server believed was a serial port but was actually a connection through the BMC, coming over your management network.
Early BMCs could take very brute force approaches to making this work. The circa 2008 era Sunfire X2200s we used in our first ZFS fileservers wired the motherboard serial port to the BMC and connected the BMC to the physical serial port on the back of the server. When you talked to the serial port after the machine powered on, you were actually talking to the BMC; to get to the server serial port, you had to log in to the BMC and do an arcane sequence to 'connect' to the server serial port. The BMC didn't save or buffer up server serial output from before you connected; such output was just lost.
(Given our long standing console server, we had feelings about having to manually do things to get the real server serial console to show up so we could start logging kernel console output.)
Modern servers and their BMCs are quite intertwined, so I suspect that often both server serial ports are basically implemented by the BMC (cf), or at least are wired to it. The BMC passes one serial port through to the physical connector (if your server has one) and handles the other itself to implement Serial over LAN. There are variants on this design possible; for example, we have one set of Supermicro hardware with no external physical serial connector, just one serial header on the motherboard and a BMC Serial over LAN port. To be unhelpful, the motherboard serial header is ttyS0 and the BMC SOL port is ttyS1.
When the BMC handles both server serial ports and passes one of them through to the physical serial port, it can decide which one to pass through and which one to use as the Serial over LAN port. Being able to change this in the BMC is convenient if you want to have a common server operating system configuration but use a physical serial port on some machines and use Serial over LAN on others. With the BMC switching which server serial port comes out on the external serial connector, you can tell all of the server OS installs to use 'ttyS0' as their serial console, then connect ttyS0 to either Serial over LAN or the physical serial port as you need.
Some BMCs (I'm looking at you, Dell) go to an extra level of indirection. In these, the BMC has an idea of 'serial device 1' and 'serial device 2', with you controlling which of the server's ttyS0 and ttyS1 maps to which 'serial device', and then it has a separate setting for which 'serial device' is mapped to the physical serial connector on the back. This helpfully requires you to look at two separate settings to know if your ttyS0 will be appearing on the physical connector or as a Serial over LAN console (and gives you two settings that can be wrong).
In theory a BMC could share a single server serial port between the physical serial connector and an IPMI Serial over LAN connection, sending output to both and accepting input from each. In practice I don't think most BMCs do this and there are obvious issues of two people interfering with each other that BMCs may not want to get involved in.
PS: I expect more and more servers to drop external serial ports over time, retaining at most an internal serial header on the motherboard. That might simplify BMC and BIOS settings.
My life has been improved by my quiet Prometheus alert status monitor
I recently created a setup to provide a backup for our email-based Prometheus alerts; the basic result is that if our current Prometheus alerts change, a window with a brief summary of current alerts will appear out of the way on my (X) desktop. Our alerts are delivered through email, and when I set up this system I imagined it as a backup, in case email delivery had problems that stopped me from seeing alerts. I didn't entirely realize that in the process, I'd created a simple, terse alert status monitor and summary display.
(This wasn't entirely a given. I could have done something more clever when the status of alerts changed, like only displaying new alerts or alerts that had been resolved. Redisplaying everything was just the easiest approach that minimized maintaining and checking state.)
After using my new setup for several days, I've ended up feeling that I'm more aware of our general status on an ongoing and global basis than I was before. Being more on top of things this way is a reassuring feeling in general. I know I'm not going to accidentally miss something or overlook something that's still ongoing, and I actually get early warning of situations before they trigger actual emails. To put it in trendy jargon, I feel like I have more situational awareness. At the same time this is a passive and unintrusive thing that I don't have to pay attention to if I'm busy (or pay much attention to in general, because it's easy to scan).
Part of this comes from how my new setup doesn't require me to do anything or remember to check anything, but does just enough to catch my eye if the alert situation is changing. Part of this comes from how it puts information about all current alerts into one spot, in a terse form that's easy to scan in the usual case. We have Grafana dashboards that present the same information (and a lot more), but it's more spread out (partly because I was able to do some relatively complex transformations and summarizations in my code).
My primary source for real alerts is still our email messages about alerts, which have gone through additional Alertmanager processing and which carry much more information than is in my terse monitor (in several ways, including explicitly noting resolved alerts). But our email is in a sense optimized for notification, not for giving me a clear picture of the current status, especially since we normally group alert notifications on a per-host basis.
(This is part of what makes having this status monitor nice; it's an alternate view of alerts from the email message view.)
My new solution for quiet monitoring of our Prometheus alerts
Our Prometheus setup delivers all alert messages through email, because we do everything through email (as a first approximation). As we saw yesterday, doing everything through email has problems when your central email server isn't responding; Prometheus raised alerts about the problems but couldn't deliver them via email because the core system necessary to deliver email wasn't doing so. Today, I built myself a little X based system to get around that, using the same approach as my non-interrupting notification of new email.
At a high level, what I now have is an xlbiff based notification of our current Prometheus alerts. If there are no alerts, everything is quiet. If new alerts appear, xlbiff will pop up a text window over in the corner of my screen with a summary of what hosts have what alerts; I can click the window to dismiss it. If the current set of alerts changes, xlbiff will re-display the alerts. I currently have xlbiff set to check the alerts every 45 seconds, and I may lengthen that at some point.
(The current frequent checking is because of what started all of this; if there are problems with our email alert notifications, I want to know about it pretty promptly.)
The work of fetching, checking, and formatting alerts is done by a Python program I wrote. To get the alerts, I directly query our Prometheus server rather than talking to Alertmanager; as a side effect, this lets me see pending alerts as well (although then I have to have the Python program ignore a bunch of pending alerts that are too flaky). I don't try to do the ignoring with clever PromQL queries; instead the Python program gets everything and does the filtering itself.
Pulling the current alerts directly from Prometheus means that I can't readily access the explanatory text we add as annotations (and that then appears in our alert notification emails), but for the purposes of a simple notification that these alerts exist, the name of the alert or other information from the labels is good enough. This isn't intended to give me full details about the alerts, just to let me know what's out there. Most of the time I'll get email about the alert (or alerts) soon anyway, and if not I can directly look at our dashboards and Alertmanager.
To support this sort of thing, xlbiff has the notion of a 'check'
program that can print out a number every time it runs, and will
get passed the last invocation's number on the command line (or '0'
at the start). Using this requires boiling down the state of the
current alerts to a single signed 32-bit number. I could have used
something like the count of current alerts, but me being me I decided
to be more clever. The program takes the start time of every current
alert (from the ALERTS_FOR_STATE
Prometheus metric), subtracts
a starting epoch to make sure we're not going to overflow, and adds
them all up to be the state number (which I call a 'checksum' in
my code because I started out thinking about more complex tricks
like running my output text through CRC32).
(As a minor wrinkle, I add one second to the start time of every firing alert so that when alerts go from pending to firing the state changes and xlbiff will re-display things. I did this because pending and firing alerts are presented differently in the text output.)
To get both the start time and the alert state, we must use the usual trick for pulling in extra labels:
ALERTS_FOR_STATE * ignoring(alertstate) group_left(alertstate) ALERTS
I understand why ALERTS_FOR_STATE
doesn't include the alert state,
but sometimes it does force you to go out of your way.
PS: If we had alerts going off all of the time, this would be far too obtrusive an approach. Instead, our default state is that there are no alerts happening, so this alert notifier spends most of its time displaying nothing (well, having no visible window, which is even better).
Our Prometheus alerting problem if our central mail server isn't working
Over on the Fediverse, I said something:
Ah yes, the one problem that our Prometheus based alert system can't send us alert email about: when the central mail server explodes. Who rings the bell to tell you that the bell isn't working?
(This is of course an aspect of monitoring your Prometheus setup itself, and also seeing if Alertmanager is truly healthy.)
There is a story here. The short version of the story is that today we wound up with a mail loop that completely swamped our central Exim mail server, briefly running its one minute load average up to a high water mark of 3,132 before a co-worker who'd noticed the problem forcefully power cycled it. Plenty of alerts fired during the incident, but since we do all of our alert notification via email and our central email server wasn't delivering very much email (on account of that load average, among other factors), we didn't receive any.
The first thing to note is that this is a narrow and short term problem for us (which is to say, me and my co-workers). On the short term side, we send and receive enough email that not receiving email for very long during working hours is unusual enough that someone would have noticed before too long, in fact my co-worker noticed the problems even without an alert actively being triggered. On the narrow side, I failed to notice this as it was going on because the system stayed up, it just wasn't responsive. Once the system was rebooting, I noticed almost immediately because I was in the office and some of the windows on my office desktop disappeared.
(In that old version of my desktop I would have
noticed the issue right away, because an xload
for the machine
in question was right in the middle of these things. These days
it's way off to the right side, out of my routine view, but I could
change that back.)
One obvious approach is some additional delivery channel for alerts about our central mail server. Unfortunately, we're entirely email focused; we don't currently use Slack, Teams, or other online chatting systems, so sending selected alerts to any of them is out as a practical option. We do have work smartphones, so in theory we could send SMS messages; in practice, free email to SMS gateways have basically vanished, so we'd have to pay for something (either for direct SMS access and we'd build some sort of system on top, or for a SaaS provider who would take some sort of notification and arrange to deliver it via SMS).
For myself, I could probably build some sort of script or program that regularly polled our Prometheus server to see if there were any relevant alerts. If there were, the program would signal me somehow, either by changing the appearance of a status window in a relatively unobtrusive way (eg turning it red) or popping up some sort of notification (perhaps I could build something around a creative use of xlbiff to display recent alerts, although this isn't as simple as it looks).
(This particular idea is a bit of a trap, because I could spend a lot of time crafting a little X program that, for example, had a row of boxes that were green, yellow, or red depending on the alert state of various really important things.)
-
Chris's Wiki :: blog
- IPv6 networks do apparently get probed (and implications for address assignment)
IPv6 networks do apparently get probed (and implications for address assignment)
For reasons beyond the scope of this entry, my home ISP recently changed my IPv6 assignment from a /64 to a (completely different) /56. Also for reasons beyond the scope of this entry, they left my old /64 routing to me along with my new /56, and when I noticed I left my old IPv6 address on my old /64 active, because why not. Of course I changed my DNS immediately, and at this point it's been almost two months since my old /64 appeared in DNS. Today I decided to take a look at network traffic to my old /64, because I knew there was some (which is actually another entry), and to my surprise much more appeared than I expected.
On my old /64, I used ::1/64 and ::2/64 for static IP addresses,
of which the first was in DNS, and the other IPv6 addresses in it
were the usual SLAAC assignments. The first thing I discovered in
my tcpdump
was a surprisingly large number of cloud-based IPv6
addresses that were pinging my ::1 address. Once I excluded that
traffic, I was left with enough volume of port probes that I could
easily see them in a casual tcpdump.
The somewhat interesting thing is that these IPv6 port probes were happening at all. Apparently there is enough out there on IPv6 that it's worth scraping IPv6 addresses from DNS and then probing potentially vulnerable ports on them to see if something responds. However, as I kept watching I discovered something else, which is that a significant number of these probes were not to my ::1 address (or to ::2). Instead they were directed to various (very) low-number addresses on my /64. Some went to the ::0 address, but I saw ones to ::3, ::5, ::7, ::a, ::b, ::c, ::f, ::15, and a (small) number of others. Sometimes a sequence of source addresses in the same /64 would probe the same port on a sequence of these addresses in my /64.
(Some of this activity is coming from things with DNS, such as various shadowserver.org hosts.)
As usual, I assume that people out there on the IPv6 Internet are doing this sort of scanning of low-numbered /64 IPv6 addresses because it works. Some number of people put additional machines on such low-numbered addresses and you can discover or probe them this way even if you can't find them in DNS.
One of the things that I take away from this is that I may not want to put servers on these low IPv6 addresses in the future. Certainly one should have firewalls and so on, even on IPv6, but even then you may want to be a little less obvious and easily found. Or at the least, only use these IPv6 addresses for things you're going to put in DNS anyway and don't mind being randomly probed.
PS: This may not be news to anyone who's actually been using IPv6 and paying attention to their traffic. I'm late to this particular party for various reasons.
Your options for displaying status over time in Grafana 11
A couple of years ago I wrote about your options for displaying status over time in Grafana 9, which discussed the problem of visualizing things how many (firing) Prometheus alerts there are of each type over time. Since then, some things have changed in the Grafana ecosystem, and especially some answers have recently become clearer to me (due to an old issue report), so I have some updates to that entry.
The generally best panel type you want to use for this is a state timeline panel, with 'merge equal consecutive values' turned on. State timelines are no longer 'beta' in Grafana 11 and they work for this, and I believe they're Grafana's more or less officially recommended solution for this problem. By default a state timeline panel will show all labels, but you can enable pagination. The good news (in some sense) is that Grafana is aware that people want a replacement for the old third party Discrete panel (1, 2, 3) and may at some point do more to move toward this.
You can also use bar graphs and line graphs, as mentioned back then, which continue to have the virtue that you can selectively turn on and off displaying the timelines of some alerts. Both bar graphs and line graphs continue to have their issues for this, although I think they're now different issues than they had in Grafana 9. In particular I think (stacked) line graphs are now clearly less usable and harder to read than stacked bar graphs, which is a pity because they used to work decently well apart from a few issues.
(I've been impressed, not in a good way, at how many different ways Grafana has found to make their new time series panel worse than the old graph panel in a succession of Grafana releases. All I can assume is that everyone using modern Grafana uses time series panels very differently than we do.)
As I found out, you don't want to use the status history panel for this. The status history panel isn't intended for this usage; it has limits on the number of results it can represent and it lacks the 'merge equal consecutive values' option. More broadly, Grafana is apparently moving toward merging all of the function of this panel into the Heatmap panel (also). If you do use the status history panel for anything, you want to set a general query limit on the number of results returned, and this limit is probably best set low (although how many points the panel will accept depends on its size in the browser, so life is fun here).
Since the status history panel is basically a variant of heatmaps, you don't really want to use heatmaps either. Using Heatmaps to visualize state over time in Grafana 11 continue to have the issues that I noted in Grafana 9, although some of them may be eliminated at some point in the future as the status history panel is moved further out. Today, if for some reason you have to choose between Heatmaps and Status History for this, I think you should use Status History with a query limit.
If we ever have to upgrade from our frozen Grafana version, I would expect to keep our line graph alert visualizations and replace our Discrete panel usage with State Timeline panels with pagination turned on.
Finding a good use for keep_firing_for in our Prometheus alerts
A while back (in 2.42.0), Prometheus
introduced a feature to artificially keep alerts firing for some
amount of time after their alert condition had cleared; this is
'keep_firing_for
'. At the time, I said that I didn't really
see a use for it for us, but I now
have to change that. Not only do we have a use for it, it's one
that deals with a small problem in our large scale alerts.
Our 'there is something big going on' alerts exist only to inhibit
our regular alerts. They trigger when there seems to be 'too much'
wrong, ideally fast enough that their inhibition effect stops the
normal alerts from going out. Because normal alerts from big issues
being resolved don't necessarily clean out immediately, we want our
large scale alerts to linger on for some time after the amount of
problems we have drop below their trigger point. Among other things,
this avoids a gotcha with inhibitions and resolved alerts. Because we created these alerts
before v2.42.0, we implemented the effect of lingering on by using
max_over_time()
on the alert conditions (this was the old
way of giving an alert a minimum duration).
The subtle problem with using max_over_time() this way is that it means you can't usefully use a 'for:' condition to de-bounce your large scale alert trigger conditions. For example, if one of the conditions is 'there are too many ICMP ping probe failures', you'd potentially like to only declare a large scale issue if this persisted for more than one round of pings; otherwise a relatively brief blip of a switch could trigger your large scale alert. But because you're using max_over_time(), no short 'for:' will help; once you briefly hit the trigger number, it's effectively latched for our large scale alert lingering time.
Switching to extending the large scale alert directly with
'keep_firing_for
' fixes this issue, and also simplifies the
alert rule expression. Once we're no longer using max_over_time(),
we can set 'for: 1m' or another useful short number to de-bounce
our large scale alert trigger conditions.
(The drawback is that now we have a single de-bounce interval for all of the alert conditions, whereas before we could possibly have a more complex and nuanced set of conditions. For us, this isn't a big deal.)
I suspect that this may be generic to most uses of max_over_time() in alert rule expressions (fortunately, this was our only use of it). Possibly there are reasonable uses for it in sub-expressions, clever hacks, and maybe also using times and durations (eg, also, also).
Prometheus makes it annoyingly difficult to add more information to alerts
Suppose, not so hypothetically, that you have a special Prometheus meta-alert about large scale issues, that exists to avoid drowning you in alerts about individual hosts or whatever when you have a large scale issue. As part of that alert's notification message, you'd like to include some additional information about things like why you triggered the alert, how many down things you detected, and so on.
While Alertmanager creates
the actual notification messages by expanding (Go) templates, it
doesn't have direct access to Prometheus or any other source of
external information, for relatively straightforward reasons.
Instead, you need to pass any additional information from Prometheus
to Alertmanager in the form (generally) of alert annotations.
Alert annotations (and alert labels) also go through template
expansion,
and in the templates for alert annotations, you can directly make
Prometheus queries with the query
function.
So on the surface this looks relatively simple, although you're
going to want to look carefully at YAML string quoting.
I did some brief experimentation with this today, and it was enough to convince me that there are some issues with doing this in practice. The first issue is that of quoting. Realistic PromQL queries often use " quotes because they involve label values, and the query you're doing has to be a (Go) template string, which probably means using Go raw quotes unless you're unlucky enough to need ` characters, and then there's YAML string quoting. At a minimum this is likely to be verbose.
A somewhat bigger problem is that straightforward use of Prometheus
template expansion (using a simple pipeline) is generally going to
complain in the error log if your query provides no results. If
you're doing the query to generate a value, there are some standard
PromQL hacks to get around this.
If you want to find a label, I think you need to use a more complex
template with
operation; on
the positive side, this may let you format a message fragment with
multiple labels and even the value.
More broadly, if you want to pass multiple pieces of information from a single query into Alertmanager (for example, the query value and some labels), you have a collection of less than ideal approaches. If you create multiple annotations, one for each piece of information, you give your Alertmanager templates the maximum freedom but you have to repeat the query and its handling several times. If you create a text fragment with all of the information that Alertmanager will merely insert somewhere, you basically split writing your alerts between Alertmanager and Prometheus alert rules, And if you encode multiple pieces of information into a single annotation with some scheme, you can use one query in Prometheus and not lock yourself into how the Alertmanager template will use the information, but your Alertmanager template will have to parse that information out again with Go template functions.
What all of this is a symptom of is that there's no particularly good way to pass structured information between Prometheus and Alertmanager. Prometheus has structured information (in the form of query results) and your Alertmanager template would like to use it, but today you have to smuggle that through unstructured text. It would be nice if there was a better way.
(Prometheus doesn't quite pass through structured information from a single query, the alert rule query, but it does make all of the labels and annotations available to Alertmanager. You could imagine a version where this could be done recursively, so some annotations could themselves have labels and etc.)
Doing general address matching against varying address lists in Exim
In various Exim setups, you sometimes want to match an email address against a file (or in general a list) of addresses and some sort of address patterns; for example, you might have a file of addresses and so on that you will never accept as sender addresses. Exim has two different mechanisms for doing this, address lists and nwildlsearch lookups in files that are performed through the '${lookup}' string expansion item. Generally it's better to use address lists, because they have a wildcard syntax that's specifically focused on email addresses, instead of the less useful nwildlsearch lookup wildcarding.
Exim has specific features for matching address lists (including in file form) against certain addresses associated with the email message; for example, both ACLs and routers can match against the envelope sender address (the SMTP MAIL FROM) using 'senders = ...'. If you want to match against message addresses that are not available this way, you must use a generic 'condition =' operation and either '${lookup}' or '${if match_address {..}{...}}', depending on whether you want to use a nwildlsearch lookup or an actual address list (likely in a file). As mentioned, normally you'd prefer to use an actual address list.
Now suppose that your file of addresses is, for example, per-user. In a straight 'senders =' match this is no problem, you can just write 'senders = /some/where/$local_part_data/addrs'. Life is not as easy if you want to match a message address that is not directly supported, for example the email address of the 'From:' header. If you have the user (or whatever other varying thing) in $acl_m0_var, you would like to write:
condition = ${if match_address {${address:$h_from:}} {/a/dir/$acl_m0_var/fromaddrs} }
However, match_address (and its friends) have a deliberate limitation, which is that in common Exim build configurations they don't perform string expansion on their second argument.
The way around this turns out to be to use an explicitly defined and named 'addresslist' that has the string expansion:
addresslist badfromaddrs = /a/dir/$acl_m0_var/fromaddrs[...]condition = ${if match_address {${address:$h_from:}} {+badfromaddrs} }
This looks weird, since at the point we're setting up badfromaddrs the $acl_m0_var is not even vaguely defined, but it works. The important thing that makes this go is a little sentence at the start of the Exim documentation's Expansion of lists:
Each list is expanded as a single string before it is used. [...]
Although the second argument of match_address is not string-expanded when used, if it specifies a named address list, that address list is string expanded when used and so our $acl_m0_var variable is substituted in and everything works.
Speaking from personal experience, it's easy to miss this sentence and its importance, especially if you normally use address lists (and domain lists and so on) without any string expansion, with fixed arguments.
(Probably the only reason I found it was that I was in the process of writing a question to the Exim mailing list, which of course got me to look really closely at the documentation to make sure I wasn't asking a stupid question.)
Having rate-limits on failed authentication attempts is reassuring
A while back I added rate-limits to failed SMTP authentication attempts. Mostly I did it because I was irritated at seeing all of the failed (SMTP) authentication attempts in logs and activity summaries; I didn't think we were in any actual danger from the usual brute force mass password guessing attacks we see on the Internet. To my surprise, having this rate-limit in place has been quite reassuring, to the point where I no longer even bother looking at the overall rate of SMTP authentication failures or their sources. Attackers are unlikely to make much headway or have much of an impact on the system.
Similarly, we recently updated an OpenBSD machine that has its SSH port open to the Internet from OpenBSD 7.5 to OpenBSD 7.6. One of the things that OpenBSD 7.6 brings with it is the latest version of OpenSSH, 9.8, which has per-source authentication rate limits (although they're not quite described that way and the feature is more general). This was also a reassuring change. Attackers wouldn't be getting into the machine in any case, but I have seen the machine use an awful lot of CPU at times when attackers were pounding away, and now they're not going to be able to do that.
(We've long had firewall rate limits on connections, but they have to be set high for various reasons including that the firewall can't tell connections that fail to authenticate apart from brief ones that did.)
I can wave my hands about why it feels reassuring (and nice) to know that we have rate-limits in place for (some) commonly targeted authentication vectors. I know it doesn't outright eliminate the potential exposure, but I also know that it helps reduce various risks. Overall, I think of it as making things quieter, and in some sense we're no longer getting constantly attacked as much.
(It's also nice to hope that we're frustrating attackers and wasting their time. They do sort of have limits on how much time they have and how many machines they can use and so on, so our rate limits make attacking us more 'costly' and less useful, especially if they trigger our rate limits.)
PS: At the same time, this shows my irrationality, because for a long time I didn't even think about how many SSH or SMTP authentication attempts were being made against us. It was only after I put together some dashboards about this in our metrics system that I started thinking about it (and seeing temporary changes in SSH patterns and interesting SMTP and IMAP patterns). Had I never looked, I would have never thought about it.
Our various different types of Ubuntu installs
In my entry on how we have lots of local customizations I mentioned that the amount of customization we do to any particular Ubuntu server depends on what class or type of machine they are. That's a little abstract, so let's talk about how our various machines are split up by type.
Our general install framework has two pivotal questions that categorize machines. The first question is what degree of NFS mounting the machine will do, with the choices being all of the NFS filesystems from our fileservers (more or less), NFS mounting just our central administrative filesystem either with our full set of accounts or with just staff accounts, rsync'ing that central administrative filesystem (which implies only staff accounts), or being a completely isolated machine that doesn't have even the central administrative filesystem.
Servers that people will use have to have all of our NFS filesystems mounted, as do things like our Samba and IMAP servers. Our fileservers don't cross-mount NFS filesystems from each other, but they do need a replicated copy of our central administrative filesystem and they have to have our full collection of logins and groups for NFS reasons. Many of our more stand-alone, special purpose servers only need our central administrative filesystem, and will either NFS mount it or rsync it depending on how fast we want updates to propagate. For example, our local DNS resolvers don't particularly need fast updates, but our external mail gateway needs to be up to date on what email addresses exist, which is propagated through our central administrative filesystem.
On machines that have all of our NFS mounts, we have a further type choice; we can install them either as a general login server (called an 'apps' server for historical reasons), as a 'comps' compute server (which includes our SLURM nodes), or only install a smaller 'base' set of packages on them (which is not all that small; we used to try to have a 'core' package set and a larger 'base' package set but over time we found we never installed machines with only the 'core' set). These days the only difference between general login servers and compute servers is some system settings, but in the past they used to have somewhat different package sets.
The general login servers and compute servers are mostly not further customized (there are a few exceptions, and SLURM nodes need a bit of additional setup). Almost all machines that get only the base package set are further customized with additional packages and specific configuration for their purpose, because the base package set by itself doesn't make the machine do anything much or be particularly useful. These further customizations mostly aren't scripted (or otherwise automated) for various reasons. The one big exception is installing our NFS fileservers, which we decided was both large enough and we had enough of that we wanted to script it so that everything came out the same.
As a practical matter, the choice between NFS mounting our central administrative filesystem (with only staff accounts) and rsync'ing it makes almost no difference to the resulting install. We tend to think of the two types of servers it creates as almost equivalent and mostly lump them together. So as far as operating our machines goes, we mostly have 'all NFS mounts' machines and 'only the administrative filesystem' machines, with a few rare machines that don't have anything (and our NFS fileservers, which are special in their own way).
(In the modern Linux world of systemd, much of our customizations aren't Ubuntu specific, or even specific to Debian and derived systems that use apt-get. We could probably switch to Debian relatively easily with only modest changes, and to an RPM based distribution with more work.)
We have lots of local customizations (and how we keep track of them)
In a comment on my entry on forgetting some of our local changes to our Ubuntu installs, pk left an interesting and useful comment on how they manage changes so that the changes are readily visible in one place. This is a very good idea and we do something similar to it, but a general limitation of all such approaches is that it's still hard to remember all of your changes off the top of your head once you've made enough of them. Once you're changing enough things, you generally can't put them all in one directory that you can simply 'ls' to be reminded of everything you change; at best, you're looking at a list of directories where you change things.
Our system for customizing Ubuntu stores the master version of customizations in our central administrative filesystem, although split across several places for convenience. We broadly have one directory hierarchy for Ubuntu release specific files (or at least ones that are potentially version specific; in practice a lot are the same between different Ubuntu releases), a second hierarchy (or two) for files that are generic across Ubuntu versions (or should be), and then a per-machine hierarchy for things specific to a single machine. Each hierarchy mirrors the final filesystem location, so that our systemd unit files will be in, for example, <hierarchy root>/etc/systemd/system.
Our current setup embeds the knowledge of what files will or won't be installed on any particular class of machines into the Ubuntu release specific 'postinstall' script that we run to customize machines, in the form of a whole bunch of shell commands to copy each of the files (or collections of files). This gives us straightforward handling of files that aren't always installed (or that vary between types of machines), at the cost of making it a little unclear if a particular file in the master hierarchy will actually be installed. We could try to do something more clever, but it would be less obvious that tne current straightforward approach where the postinstall script has a lot of 'cp -a <src>/etc/<file> /etc/<file>' and it's easy to see what you need to do to add one or specially handle one.
(The obvious alternate approach would be to have a master file that listed all of the files to be installed on each type of machine. However, one advantage of the current approach is that it's easy to have various commentary about the files being installed and why, and it's also easy to run commands, install packages, and so on in between installing various files. We don't install them all at once.)
Based on some brute force approximation, it appears that we install around 100 customization files on a typical Ubuntu machine (we install more on some types of machines than on other types, depending on whether the machine will have all of our NFS mounts and whether or not it's a machine regular people will log in to). Specific machines can be significantly customized beyond this; for example, our ZFS fileservers get an additional scripted customization pass.
PS: The reason we have this stuff scripted and stored in a central filesystem is that we have over a hundred servers and a lot of them are basically identical to each other (most obviously, our SLURM nodes). In aggregate, we install and reinstall a fair number of machines and almost all of them have this common core.
Our local changes to standard (Ubuntu) installs are easy to forget
We have been progressively replacing a number of old one-off Linux machines with up to date replacements that run Ubuntu and so are based on our standard Ubuntu install. One of those machines has a special feature where a group of people are allowed to use passworded sudo to gain access to a common holding account. After we deployed the updated machine, these people got in touch with us to report that something had gone wrong with the sudo system. This was weird to me, because I'd made sure to faithfully replicate the old system's sudo customizations to the new one. When I did some testing, things got weirder; I discovered that sudo was demanding the root password instead of my password. This was definitely not how things were supposed to work for this sudo access (especially since the people with sudo access don't know the root password for the machine).
Whether or not sudo
does this is controlled by the setting of 'rootpw
' in sudoers or one of
the files it includes (at least with Ubuntu's standard sudo.conf). The stock
Ubuntu sudoers doesn't set 'rootpw
', and of course this machine's
sudoers customizations didn't set them either. But when I looked
around, I discovered that we had long ago set up an /etc/sudoers.d
customization file to set 'rootpw
' and made it part of our
standard Ubuntu install. When I rebuilt this machine based on our
standard Ubuntu setup, the standard install stuff had installed
this sudo customization. Since we'd long ago completely forgotten
about its existence, I hadn't remembered it while customizing the
machine to its new purpose, so it had stayed.
(We don't normally use passworded sudo, and we definitely want access to root to require someone to know the special root password, not just the password to a sysadmin's account.)
There are probably a lot of things that we've added to our standard install over the years that are like this sudo customization. They exist to make things work (or not work), and as long as they keep quietly doing their jobs it's very easy to forget them and their effects. Then we do something exceptional on a machine and they crop up, whether it's preventing sudo from working like we want it to or almost giving us a recursive syslog server.
(I don't have any particular lesson to draw from this, except that it's surprisingly difficult to de-customize a machine. One might think the answer is to set up the machine from scratch outside our standard install framework, but the reality is that there's a lot from the standard framework that we still want on such machines. Even with issues like this, it's probably easier to install them normally and then fix the issues than do a completely stock Ubuntu server install.)
Some thoughts on why 'inetd activation' didn't catch on
Inetd is a traditional Unix 'super-server' that listens on multiple (IP) ports and runs programs in response to activity on them; it dates from the era of 4.3 BSD. In theory inetd can act as a service manager of sorts for daemons like the BSD r* commands, saving them from having to implement things like daemonization, and in fact it turns out that one version of this is how these daemons were run in 4.3 BSD. However, running daemons under inetd never really caught on (even in 4.3 BSD some important daemons ran outside of inetd), and these days it's basically dead. You could ask why, and I have some thoughts on that.
The initial version of inetd only officially supported running TCP services in a mode where each connection ran a new instance of the program (call this the CGI model). On the machines of the 1980s and 1990s, this wasn't a particularly attractive way to run anything but relatively small and simple programs (and ones that didn't have to do much work on startup). In theory you could possibly run TCP services in a mode where they were passed the server socket and then accepted new connections themselves for a while; in practice, no one seems to have really written daemons that supported this. Daemons that supported an 'inetd mode' generally meant the 'run a copy of the program for each connection' mode.
(Possibly some of them supported both modes of inetd operation, but system administrators would pretty much assume that if a daemon's documentation said just 'inetd mode' that it meant the CGI model.)
Another issue is that inetd is not a service manager. It will start things for you, but that's it; it won't shut down things for you (although you can get it to stop listening on a port), and it won't tell you what's running (you get to inspect the process list). On Unixes with a System V init system or something like it, running your daemons as standalone things gave you access to start, stop, restart, status, and so on service management options that might even work (depending on the quality of the init.d scripts involved). Since daemons had better usability when run as standalone services, system administrators and others had relatively little reason to push for inetd support, especially in the second mode.
In general, running any important daemon under inetd has many of the same downside as systemd socket activation of services. As a practical matter, system administrators like to know that important daemons are up and running right away, and they don't have some hidden issue that will cause them to fail to start just when you want them. The normal CGI-like inetd mode also means that any changes to configuration files and the like take effect right away, which may not be what you want; system administrators tend to like controlling when daemons restart with new configurations.
All of this is likely tied to what we could call 'cultural factors'. I suspect that authors of daemons perceived running standalone as the more serious and prestigious option, the one for serious daemons like named and sendmail, and inetd activation to be at most a secondary feature. If you wrote a daemon that only worked with inetd activation, you'd practically be proclaiming that you saw your program as a low importance thing. This obviously reinforces itself, to the degree that I'm surprised sshd even has an option to run under inetd.
(While some Linuxes are now using systemd socket activation for sshd, they aren't doing it via its '-i' option.)
PS: There are some services that do still generally run under inetd (or xinetd, often the modern replacement, cf). For example, I'm not sure if the Amanda backup system even has an option to run its daemons as standalone things.
Brief notes on making Prometheus's SNMP exporter use additional SNMP MIB(s)
Suppose, not entirely hypothetically, that you have a DSL modem that exposes information about the state of your DSL link through SNMP, and you would like to get that information into Prometheus so that you could track it over time (for reasons). You could scrape this information by 'hand' using scripts, but Prometheus has an officially supported SNMP exporter. Unfortunately, in practice the Prometheus SNMP exporter pretty much has a sign on the front door that says "no user serviceable parts, developer access only"; how you do things with it if its stock configuration doesn't meet your needs is what I would call rather underdocumented.
The first thing you'll need to do is find out what generally known
and unknown SNMP attributes ('OIDs') your device
exposes. You can do this using tools like snmpwalk
,
and see also some general information on reading things over SNMP. Once you've found out what OIDs your device
supports, you need to find out if there are public MIBs for
them. In my case, my DSL modem exposed information about network
interfaces in the standard and widely available 'IF-MIB', and ADSL
information in the standard but not widely available 'ADSL-LINE-MIB'.
For the rest of this entry I''ll assume that you've managed to fetch
the ADSL-LINE-MIB and everything it depends on and put them in a
directory, /tmp/adsl-mibs.
The SNMP exporter effectively has two configuration files (as I wrote about recently); a compiled ('generated') configuration file (or set of them) that lists in exhausting detail all of the SNMP OIDs to be collected, and an input file to a separate tool, the generator, that creates the compiled main file. To collect information from a new MIB, you need to set up a new SNMP exporter 'module' for it, and specify the root OID or OIDs involved to walk. This looks like:
---modules: # The ADSL-LINE-MIB MIB adsl_line_mib: walk: - 1.3.6.1.2.1.10.94 # or: #- adslMIB
Here adsl_line_mib is the name of the new SNMP exporter module, and we give it the starting OID of the MIB. You can't specify the name of the MIB itself as the OID to walk, although this is how 'snmpwalk' will present it. Instead you have to use the MIB's 'MODULE-IDENTITY' line, such as 'adslMIB'. Alternately, perusal of your MIB and snmpwalk results may suggest alternate names to use, such as 'adslLineMib'. Using the top level OID is probably easier.
The name of your new module is arbitrary, but it's conventional to use the name of the MIB in this form. You can do other things in your module; reading the existing generator.yml is probably the most useful documentation. As various existing modules show, you can walk multiple OIDs in one module.
This configuration file leaves out the 'auths:' section from the main generator.yml, because we only need one of them, and what we're doing is generating an additional configuration file for snmp_exporter that we'll use along with the stock snmp.yml. To actually generate our new snmp-adsl.yml, we do:
cd snmp_exporter/generator go build make # builds ./mibs ./generator generate \ -m ./mibs \ -m /tmp/adsl-mibs \ -g generator-adsl.yml -o /tmp/snmp-adsl.yml
We give the generator both its base set of MIBs, which will define various common things, and the directory with our ADSL-LINE-MIB and all of the MIBs it may depend on. Although the input is small, the snmp-adsl.yml will generally be quite big; in my case, over 2,000 lines.
As I mentioned the other day, you may find that some of the SNMP OIDs actually returned by your device don't conform to the SNMP MIB. When this happens, your scrape results will not be a success but instead a HTTP 500 error with text that says things like:
An error has occurred while serving metrics:error collecting metric Desc{fqName: "snmp_error", help: "BITS type was not a BISTRING on the wire.", constLabels: {}, variableLabels: {}}: error for metric adslAturCurrStatus with labels [1]: <nil>
This says that the the actual OID(s) for adslAturCurrStatus from my actual device didn't match what the MIB claimed. In this case, my raw snmpwalk output for this OID is:
.1.3.6.1.2.1.10.94.1.1.3.1.6.1 = BITS: 00 00 00 01 31
(I don't understand what this means, since I'm not anywhere near an SNMP expert.)
If the information is sufficiently important, you'll need to figure out how to modify either the MIB or the generated snmp-adsl.yml to get the information without snmp_exporter errors. Doing so is far beyond the scope of this entry. If the information is not that important, the simple way is to exclude it with a generator override:
---modules: adsl_line_mib: walk: # ADSL-LINE-MIB #- 1.3.6.1.2.1.10.94 - adslMIB overrides: # My SmartRG SR505N produces values for this metric # that make the SNMP exporter unhappy. adslAturCurrStatus: ignore: true
You can at least get the attribute name you need to ignore from the SNMP exporter's error message. Unfortunately this error message is normally visible only in scrape output, and you'll only see it if you scrape manually with something like 'curl'.
Brief notes on how the Prometheus SNMP exporter's configurations work
A variety of devices (including DSL modems) expose interesting information via SNMP (which is not simple, despite its name). If you have a Prometheus environment, it would be nice to get (some of) this information from your SNMP capable devices into Prometheus. You could do this by hand with scripts and commands like 'snmpget', but there is also the officially supported SNMP exporter. Unfortunately, in practice the Prometheus SNMP exporter pretty much has a sign on the front door that says "no user serviceable parts, developer access only". Understanding how to do things even a bit out of standard with it is, well, a bit tricky. So here are some notes.
The SNMP exporter ships with a 'snmp.yml' configuration file that's
what the actual 'snmp_exporter
' program uses at runtime (possibly
augmented by additional files you provide). As you'll read when you
look at the file, this file is machine generated. As far as I can
tell, the primary purpose of this file is to tell the exporter what
SNMP OIDs it could try to read from devices,
what metrics generated from them should be called, and how to
interpret the various sorts of values it gets back over SNMP (for
instance, network interfaces have a 'ifType' that in raw format is
a number, but where the various values correspond to different types
of physical network types). These SNMP OIDs are grouped into
'modules', with each module roughly corresponding to a SNMP MIB (the
correspondence isn't necessarily exact). When you ask the SNMP
exporter to query a SNMP device, you normally tell the exporter
what modules to use, which determines what OIDs will be retrieved
and what metrics you'll get back.
The generated file is very verbose, which is why it's generated, and its format is pretty underdocumented, which certainly does help contribute to the "no user serviceable parts" feeling. There is very little support for directly writing a new snmp.yml module (which you can at least put in a separate 'snmp-me.yml' file) if you happen to have a few SNMP OIDs that you know directly, don't have a MIB for, and want to scrape and format specifically. Possibly the answer is to try to write a MIB yourself and generate a snmp-me.yml from it, but I haven't had to do this so I have no opinions on which way is better.
The generated file and its modules are created from various known MIBs by a separate program, the generator. The generator has its own configuration file to describe what modules to generate, what OIDs go into each module, and so on. This means that reading generator.yml is the best way to find out what MIBs the SNMP exporter already supports. As far as I know, although generator.yml doesn't necessarily specify OIDs by name, the generator requires MIBs for everything you want to be in the generated snmp.yml file and generate metrics for.
The generator program and its associated data isn't available as part of the pre-built binary SNMP exporter packages. If you need anything beyond the limited selection of MIBs that are compiled into the stock snmp.yml, you need to clone the repository, go to the 'generator' subdirectory, build the generator with 'go build' (currently), run 'make' to fetch and process the MIBs it expects, get (or write) MIBs for your additional metrics, and then write yourself a minimal generator-me.yml of your own to add one or more (new) modules for your new MIBs. You probably don't want to regenerate the main snmp.yml; you might as well build a 'snmp-me.yml' that just has your new modules in it, and run the SNMP exporter with snmp-me.yml as an additional configuration file.
As a practical matter, you may find that your SNMP capable device doesn't necessarily conform to the MIB that theoretically describes it, including OIDs with different data formats (or data) than expected. In the simple case, you can exclude OIDs or named attributes from being fetched so that the non-conformance doesn't cause the SNMP exporter to throw errors:
modules: adsl_line_mib: [...] overrides: adslAturCurrStatus: ignore: true
More complex mis-matches between the MIB and your device will have you reading whatever you can find for the available options for generator.yml or even for snmp.yml itself. Or you can change your mind and scrape through scripts or programs in other languages instead of the SNMP exporter (it's what we do for some of our machine room temperature sensors).
(I guess another option is editing the MIB so that it corresponds to what your device returns, which should make the generator produce a snmp-me.yml that matches what the SNMP exporter sees from the device.)
PS: A peculiarity of the SNMP exporter is that the SNMP metrics it generates are all named after their SNMP MIB names, which produce metric names that are not at all like conventional Prometheus metric names. It's possible to put a common prefix, such as 'snmp_metric_', on all SNMP metrics to make them at least a little bit better. Technically this is a peculiarity of snmp.yml, but changing it is functionally impossible unless you hand-edit your own version.
The impact of the September 2024 CUPS CVEs depends on your size
The recent information security news is that there are a series of potentially serious issues in CUPS (via), but on the other hand a lot of people think that this isn't an exploit with a serious impact because, based on current disclosures, someone has to print something to a maliciously added new 'printer' (for example). My opinion is that how potentially serious this issue is for you depends on the size and scope of your environment.
Based on what we know, the vulnerability requires the CUPS server to also be running 'cups-browsed'. One of the things that cups-browsed does is allow remote printers to register themselves on the CUPS server; you set up your new printer, point it at your local CUPS print server, and everyone can now use it. As part of this registration, the collection of CUPS issues allows a malicious 'printer' to set up server side data (a CUPS PPD) that contains things that will run commands on the print server when a print job is sent to this malicious 'printer'. In order to get anything to happen, an attacker needs to get someone to do this.
In a personal environment or a small organization, this is probably unlikely. Either you know all the printers that are supposed to be there and a new one showing up is alarming, or at the very least you'll probably assume that the new printer is someone's weird experiment or local printer or whatever, and printing to it won't make either you or the owner very happy. You'll take your print jobs off to the printers you know about, and ignore the new one.
(Of course, an attacker with local knowledge could target their new printer name to try to sidestep this; for example, calling it 'Replacement <some existing printer>' or the like.)
In a larger organization, such as ours, people don't normally know all of the printers that are around and don't generally know when new printers show up. In such an environment, it's perfectly reasonable for people to call up a 'what printer do you want to use' dialog, see a new to them printer with an attractive name, and use it (perhaps thinking 'I didn't know they'd put a printer in that room, that's conveniently close'). And since printer names that include locations are perpetually misleading or wrong, most of the time people won't be particularly alarmed if they go to the location where they expect the printer (and their print job) to be and find nothing. They'll shrug, go back, and re-print their job to a regular printer they know.
(There are rare occasions here where people get very concerned when print output can't be found, but in most cases the output isn't sensitive and people don't care if there's an extra printed copy of a technical paper or the like floating around.)
Larger scale environments, possibly with an actual CUPS print server, are also the kind of environment where you might deliberately run cups-browsed. This could be to enable easy addition of new printers to your print server or to allow people's desktops to pick up what printers were available out there without you needing to even have a central print server.
My view is that this set of CVEs shows that you probably can't trust cups-browsed in general and need to stop running it, unless you're very confident that your environment is entirely secure and will never have a malicious attacker able to send packets to cups-browsed.
(I said versions of this on the Fediverse (1, 2), so I might as well elaborate on it here.)
Our broad reasons for and approach to mirroring disks
When I talked about our recent interest in FreeBSD, I mentioned the issue of disk mirroring. One of the questions this raises is what we use disk mirroring for, and how we approach it in general. The simple answer is that we mirror disks for extra redundancy, not for performance, but we don't go too far to get extra redundancy.
The extremely thorough way to do disk mirroring for redundancy is to mirror with different makes and ages of disks on each side of the mirror, to try to avoid both age related failures and model or maker related issues (either firmware or where you find out that the company used some common problematic component). We don't go this far; we generally buy a block of whatever SSD is considered good at the moment, then use them for a while, in pairs, either fresh in newly deployed servers or re-using a pair in a server being re-deployed. One reason we tend to do this is that we generally get 'consumer' drives, and finding decent consumer drives is hard enough at the best of times without having to find two different vendors of them.
(We do have some HDD mirrors, for example on our Prometheus server, but these are also almost always paired disks of the same model, bought at the same time.)
Because we have backups, our redundancy goals are primarily to keep servers operating despite having one disk fail. This means that it's important that the system keep running after a disk failure, that it can still reboot after a disk failure (including of its first, primary disk), and that the disk can be replaced and put into service without downtime (provided that the hardware supports hot swapping the drive). The less this is true, the less useful any system's disk mirroring is to us (including 'hardware' mirroring, which might make you take a trip through the BIOS to trigger a rebuild after a disk replacement, which means downtime). It's also vital that the system be able to tell us when a disk has failed. Not being able to reliably tell us this is how you wind up with systems running on a single drive until that single drive then fails too.
On our ZFS fileservers it would be quite undesirable to have to restore from backups, so we have an elaborate spares system that uses extra disk space on the fileservers (cf) and a monitoring system to rapidly replace failed disks. On our regular servers we don't (currently) bother with this, even on servers where we could add a third disk as a spare to the two system disks.
(We temporarily moved to three way mirrors for system disks on some critical servers back in 2020, for relatively obvious reasons. Since we're now in the office regularly, we've moved back to two way mirrors.)
Our experience so far with both HDDs and SSDs is that we don't really seem to have clear age related or model related failures that take out multiple disks at once. In particular, we've yet to lose both disks of a mirror before one could be replaced, despite our habit of using SSDs and HDDs in basically identical pairs. We have had a modest number of disk failures over the years, but they've happened by themselves.
(It's possible that at some point we'll run a given set of SSDs for long enough that they start hitting lifetime limits. But we tend to grab new SSDs when re-deploying important servers. We also have a certain amount of server generation turnover for important servers, and when we use the latest hardware it also gets brand new SSDs.)
Why we're interested in FreeBSD lately (and how it relates to OpenBSD here)
We have a long and generally happy history of using OpenBSD and PF for firewalls. To condense a long story, we're very happy with the PF part of our firewalls, but we're increasingly not as happy with the OpenBSD part (outside of PF). Part of our lack of cheer is the state of OpenBSD's 10G Ethernet support when combined with PF, but there are other aspects as well; we never got OpenBSD disk mirroring to be really useful and eventually gave up on it.
We wound up looking at FreeBSD after another incident with OpenBSD doing weird and unhelpful hardware things, because we're a little tired of the whole area. Our perception (which may not be reality) is that FreeBSD likely has better driver support for modern hardware, including 10G cards, and has gone further on SMP support for networking, hopefully including PF. The last time we looked at this, OpenBSD PF was more or less limited by single-'core' CPU performance, especially when used in bridging mode (which is what our most important firewall uses). We've seen fairly large bandwidth rates through our OpenBSD PF firewalls (in the 800 MBytes/sec range), but never full 10G wire bandwidth, so we've wound up suspecting that our network speed is partly being limited by OpenBSD's performance.
(To get to this good performance we had to buy servers that focused on single-core CPU performance. This created hassles in our environment, since these special single-core performance servers had to be specially reserved for OpenBSD firewalls. And single-core performance isn't going up all that fast.)
FreeBSD has a version of PF that's close enough to OpenBSD's older versions to accept much or all of the syntax of our pf.conf files (we're not exactly up to the minute on our use of PF features and syntax). We also perceive FreeBSD as likely more normal to operate than OpenBSD has been, making it easier to integrate into our environment (although we'd have to actually operate it for a while to see if that was actually the case). If FreeBSD has great 10G performance on our current generation commodity servers, without needing to buy special servers for it, and fixes other issues we have with OpenBSD, that makes it potentially fairly attractive.
(To be clear, I think that OpenBSD is (still) a great operating system if you're interested in what it has to offer for security and so on. But OpenBSD is necessarily opinionated, since it has a specific focus, and we're not really using OpenBSD for that focus. Our firewalls don't run additional services and don't let people log in, and some of them can only be accessed over a special, unrouted 'firewall' subnet.)
Getting maximum 10G Ethernet bandwidth still seems tricky
For reasons outside the scope of this entry, I've recently been trying to see how FreeBSD performs on 10G Ethernet when acting as a router or a bridge (both with and without PF turned on). This pretty much requires at least two more 10G test machines, so that the FreeBSD server can be put between them. When I set up these test machines, I didn't think much about them so I just grabbed two old servers that were handy (well, reasonably handy), stuck a 10G card into each, and set them up. Then I actually started testing their network performance.
I'm used to 1G Ethernet, where long ago it became trivial to achieve full wire bandwidth, even bidirectional full bandwidth (with test programs; there are many things that can cause real programs to not get this). 10G Ethernet does not seem to be like this today; the best I could do was get close to around 950 MBytes a second in one direction (which is not 10G's top speed). With the right circumstances, bidirectional traffic could total to just over 1 GByte a second, which is of course nothing like what we'd like to see.
(This isn't a new problem with 10G Ethernet, but I was hoping this had been solved in the past decade or so.)
There's a lot of things that could be contributing to this, like the speed of the CPU (and perhaps RAM), the specific 10G hardware I was using (including if it lacked performance increasing features that more expensive hardware would have had), and Linux kernel or driver issues (although this was Ubuntu 24.04, so I would hope that they were sorted out). I'm especially wondering about CPU limitations, because the kernel's CPU usage did seem to be quite high during my tests and, as mentioned, they're old servers with old CPUs (different old CPUs, even, one of which seemed to perform a bit better than the other).
(For the curious, one was a Celeron G530 in a Dell R210 II and the other a Pentium G6950 in a Dell R310, both of which date from before 2016 and are something like four generations back from our latest servers (we've moved on slightly since 2022).)
Mostly this is something I'm going to have to remember about 10G Ethernet in the future. If I'm doing anything involving testing its performance, I'll want to use relatively modern test machines, possibly several of them to create aggregate traffic, and then I'll want to start out by measuring the raw performance those machines can give me under the best circumstances. Someday perhaps 10G Ethernet will be like 1G Ethernet for this, but that's clearly not the case today (in our environment).
What admin access researchers have to their machines here
Recently on the Fediverse, Stephen Checkoway asked what level of access fellow academics had to 'their' computers to do things like install software (via). This is an issue very relevant to where I work, so I put a short-ish answer in the Fediverse thread and now I'm going to elaborate it at more length. Locally (within the research side of the department) we have a hierarchy of machines for this sort of thing.
At the most restricted are the shared core machines my group operates in our now-unusual environment, such as the mail server, the IMAP server, the main Unix login server, our SLURM cluster and general compute servers, our general purpose web server, and of course the NFS fileservers that sit behind all of this. For obvious reasons, only core staff have any sort of administrative access to these machines. However, since we operate a general Unix environment, people can install whatever they want to in their own space, and they can request that we install standard Ubuntu packages, which we mostly do (there are some sorts of packages that we'll decline to install). We do have some relatively standard Ubuntu features turned off for security reasons, such as "user namespaces", which somewhat limits what people can do without system privileges. Only our core machines live on our networks with public IPs; all other machines have to go on separate private "sandbox" networks.
The second most restricted are researcher owned machines that want to NFS mount filesystems from our NFS fileservers. By policy, these must be run by the researcher's Point of Contact, operated securely, and only the Point of Contact can have root on those machines. Beyond that, researchers can and do ask their Point of Contact to install all sorts of things on their machines (the Point of Contact effectively works for the researcher or the research group). As mentioned, these machines live on "sandbox" networks. Most often they're servers that the researcher has bought with grant funding, and there are some groups that operate more and better servers than we (the core group) do.
Next are non-NFS machines that people put on research group "sandbox" networks (including networks where some machines have NFS access); people do this with both servers and desktops (and sometimes laptops as well). The policies on who has what power over these machines is up to the research group and what they (and their Point of Contact) feel comfortable with. There are some groups where I believe the Point of Contact runs everything on their sandbox network, and other groups where their sandbox network is wide open with all sorts of people running their own machines, both servers and desktops. Usually if a researcher buys servers, the obvious person to have run them is their Point of Contact, unless the research work being done on the servers is such that other people need root access (or it's easier for the Point of Contact to hand the entire server over to a graduate student and have them run it as they need it).
Finally there are generic laptops and desktops, which normally go on our port-isolated 'laptop' network (called the 'red' network after the colour of network cables we use for it, so that it's clearly distinct from other networks). We (the central group) have no involvement in these machines and I believe they're almost always administered by the person who owns or at least uses them, possibly with help from that person's Point of Contact. These days, some number of laptops (and probably even desktops) don't bother with wired networking and use our wireless network instead, where similar 'it's yours' policies apply.
People who want access to their files from their self-managed desktop or laptop aren't left out in the cold, since we have a SMB (CIFS) server. People who use Unix and want their (NFS, central) home directory mounted can use the 'cifs' (aka 'smb3') filesystem to access it through our SMB server, or even use sshfs if they want to. Mounting via cifs or sshfs is in some cases superior to using NFS, because they can give you access to important shared filesystems that we can't NFS export to machines outside our direct control.
Rate-limiting failed SMTP authentication attempts in Exim 4.95
Much like with SSH servers, if you have a SMTP server exposed to the Internet that supports SMTP authentication, you'll get a whole lot of attackers showing up to do brute force password guessing. It would be nice to slow these attackers down by rate-limiting their attempts. If you're using Exim, as we are, then this is possible to some degree. If you're using Exim 4.95 on Ubuntu 22.04 (instead of a more recent Exim), it's trickier than it looks.
One of Exim's ACLs, the ACL specified by acl_smtp_auth, is consulted just before Exim accepts a SMTP 'AUTH <something>' command. If this ACL winds up returning a 'reject' or a 'defer' result, Exim will defer or reject the AUTH command and the SMTP client will not be able to try authenticating. So obviously you need to put your ratelimit statement in this ACL, but there are two complications. First, this ACL doesn't have access to the login name the client is trying to authenticate (this information is only sent after Exim accepts the 'AUTH <whatever>' command), so all you can ratelimit is the source IP (or a network area derived from it). Second, this ACL happens before you know what the authentication result is, so you don't want to actually update your ratelimit in it, just check what the ratelimit is.
This leads to the basic SMTP AUTH ACL of:
acl_smtp_auth = acl_check_authbegin aclacl_check_auth: # We'll cover what this is for later warn set acl_c_auth = true deny ratelimit = 10 / 10m / per_cmd / readonly / $sender_host_address delay = 10s message = You are failing too many authentication attempts. # you might also want: # log_message = .... # don't forget this or you will be sad # (because no one will be able to authenticate) accept
(The 'delay = 10s' usefully slows down our brute force SMTP authentication attackers because they seem to wait for the reply to their SMTP AUTH command rather than giving up and terminating the session after a couple of seconds.)
This ratelimit is read-only because we don't want to update it unless the SMTP authentication fails; otherwise, you will wind up (harshly) rate-limiting legitimate people who repeatedly connect to you, authenticate, perhaps send an email message, and then disconnect. Since we can't update the ratelimit in the SMTP AUTH ACL, we need to somehow recognize when authentication has failed and update the ratelimit in that place.
In Exim 4.97 and later, there's a convenient and direct way to do this through the events system and the 'auth:fail' event that is raised by an Exim server when SMTP authentication fails. As I understand it, the basic trick is that you make the auth:fail event invoke a special ACL, and have the user ACL update the ratelimit. Unfortunately Ubuntu 22.04 has Exim 4.95, so we must be more clever and indirect, and as a result somewhat imperfect in what we're doing.
To increase the ratelimit when SMTP authentication has failed, we add an ACL that is run at the end of the connection and increases the ratelimit if an authentication was attempted but did not succeed, which we detect by the lack of authentication information. Exim has two possible 'end of session' ACL settings, one that is used if the session is ended with a SMTP QUIT command and one that is ended if the SMTP session is just ended without a QUIT.
So our ACL setup to update our ratelimit looks like this:
[...] acl_smtp_quit = acl_count_failed_auth acl_smtp_notquit = acl_count_failed_auth begin acl [...] acl_count_failed_auth: warn: condition = ${if bool{$acl_c_auth} } !authenticated = * ratelimit = 10 / 10m / per_cmd / strict / $sender_host_address accept
Our $acl_c_auth SMTP connection ACL variable tells us whether or not the connection attempted to authenticate (sometimes legitimate people simply connect and don't do anything before disconnecting), and then we also require that the connection not be authenticated now to screen out people who succeeded in their SMTP authentication. The settings for the two 'ratelimit =' settings have to match or I believe you'll get weird results.
(The '10 failures in 10 minutes' setting works for us but may not work for you. If you change the 'deny' to 'warn' in acl_check_auth and comment out the 'message =' bit, you can watch your logs to see what rates real people and your attackers actually use.)
The limitation on this is that we're actually increasing the ratelimit based not on the number of (failed) SMTP authentication attempts but on the number of connections that tried but failed SMTP authentication. If an attacker connects and repeatedly tries to do SMTP AUTH in the session, failing each time, we wind up only counting it as a single 'event' for ratelimiting because we only increase the ratelimit (by one) when the session ends. For the brute force SMTP authentication attackers we see, this doesn't seem to be an issue; as far as I can tell, they disconnect their session when they get a SMTP authentication failure.
I should probably reboot BMCs any time they behave oddly
Today on the Fediverse I said:
It has been '0' days since I had to reset a BMC/IPMI for reasons (in this case, apparently something power related happened that glitched the BMC sufficiently badly that it wasn't willing to turn on the system power). Next time a BMC is behaving oddly I should just immediately tell it to cold reset/reboot and see, rather than fiddling around.
(Assuming the system is already down. If not, there are potential dangers in a BMC reset.)
I've needed to reset a BMC before, but this time was more odd and less clear than the KVM over IP that wouldn't accept the '2' character.
We apparently had some sort of power event this morning, with a number of machines abruptly going down (distributed across several different PDUs). Most of the machines rebooted fine, either immediately or after some delay. A couple of the machines did not, and conveniently we had set up their BMCs on the network (although they didn't have KVM over IP). So I remotely logged in to their BMC's web interface, saw that the BMC was reporting that the power was off, and told the BMC to power on.
Nothing happened. Oh, the BMC's web interface accepted my command, but the power status stayed off and the machines didn't come back. Since I had a bike ride to go to, I stopped there. After I came back from the bike ride I tried some more things (still remotely). One machine I could remotely power cycle through its managed PDU, which brought it back. But the other machine was on an unmanaged PDU with no remote control capability. I wound up trying IPMI over the network (with ipmitool), which had no better luck getting the machine to power on, and then I finally decided to try resetting the BMC. That worked, in that all of a sudden the machine powered on the way it was supposed to (we set the 'what to do after power comes back' on our machines to 'last power state', which would have been 'powered on').
As they say, I have questions. What I don't have is any answers. I believe that the BMC's power control talks to the server's motherboard, instead of to the power supply units, and I suspect that it works in a way similar to desktop ATX chassis power switches. So maybe the BMC software had a bug, or some part of the communication between the BMC and the main motherboard circuitry got stuck or desynchronized, or both. Resetting the BMC would reset its software, and it could also force a hardware reset to bring the communication back to a good state. Or something else could be going on.
(Unfortunately BMCs are black boxes that are supposed to just work, so there's no way for ordinary system administrators like me to peer inside.)
Using rsync to create a limited ability to write remote files
Suppose that you have an isolated high security machine and you want to back up some of its data on another machine, which is also sensitive in its own way and which doesn't really want to have to trust the high security machine very much. Given the source machine's high security, you need to push the data to the backup host instead of pulling it. Because of the limited trust relationship, you don't want to give the source host very much power on the backup host, just in case. And you'd like to do this with standard tools that you understand.
I will cut to the chase: as far as I can tell, the easiest way to do this is to use rsync's daemon mode on the backup host combined with SSH (to authenticate either end and encrypt the traffic in transit). It appears that another option is rrsync, but I just discovered that and we have prior experience with rsync's daemon mode for read-only replication.
Rsync's daemon mode is controlled by a configuration file that can restrict what it allows the client (your isolated high security source host) to do, particularly where the client can write, and can even chroot if you run things as root. So the first ingredient we need is a suitable rsyncd.conf, which will have at least one 'module' that defines parameters:
[backup-host1] comment = Backup module for host1 # This will normally have restricted # directory permissions, such as 0700. path = /backups/host1 hosts allow = <host1 IP> # Let's assume we're started out as root use chroot = yes uid = <something> gid = <something>
The rsyncd.conf 'hosts allow' module parameter works even over SSH; rsync will correctly pull out the client IP from the environment variables the SSH daemon sets.
The next ingredient is a shell script that forces the use of this rsyncd.conf:
#!/bin/sh exec /usr/bin/rsync --server --daemon --config=/backups/host1-rsyncd.conf .
As with the read-only replication, this script completely ignores command line arguments that the client may try to use. Very cautious people could inspect the client's command line to look for unexpected things, but we don't bother.
Finally you need a SSH keypair and a .ssh/authorized_keys entry on the backup machine for that keypair that forces using your script:
from="<host1 IP>",command="/backups/host1-script",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty [...]
(Since we're already restricting the rsync module by IP, we definitely want to restrict the key usage as well.)
On the high security host, you transfer files to the backup host with:
rsync -a --rsh="/usr/bin/ssh -i /client/identity" yourfile LOGIN@SERVER::backup-host1/
Depending on what you're backing up and how you want to do things, you might want to set the rsyncd.conf module parameters 'write only = true' and perhaps 'refuse options = delete', if you're sure you don't want the high security machine to be able to retrieve its files once it has put them there. On the other hand, if the high security machine is supposed to be able to routinely retrieve its backups (perhaps to check that they're good), you don't want this.
(If the high security machine is only supposed to read back files very rarely, you can set 'write only = true' until it needs to retrieve a file.)
There are various alternative approaches, but this one is relatively easy to set up, especially if you already have a related rsync daemon setup for read-only replication.
(On the one hand it feels annoying that there isn't a better way to do this sort of thing by now. On the other hand, the problems involved are not trivial. You need encryption, authentication of both ends, a confined transfer protocol, and so on. Here, SSH provides the encryption and authentication and rsync provides the confined transfer protocol, at the cost of having to give access to a Unix account and trust rsync's daemon mode code.)
Some reasons why we mostly collect IPMI sensor data locally
Most servers these days support IPMI and can report various sensor readings through it, which you often want to use. In general, you can collect IPMI sensor readings either on the host itself through the host OS or over the network using standard IPMI networking protocols (there are several generations of them). Locally, we have almost always collected this information locally (and then fed it into our Prometheus based monitoring system), for an assortment of reasons, some of them general and some of them specific to us.
When we collect IPMI sensor data locally, we export it through the the standard Prometheus host agent, which has a feature where you can give it text files of additional metrics (cf). Although there is a 'standard' third party network IPMI metrics exporter, we ended up rolling our own for various reasons (through a Prometheus exporter that can run scripts for us). So we could collect IPMI sensor data either way, but we almost entirely collect the data locally.
(These days it is a standard part of our general Ubuntu customizations to set up sensor data collection from the IPMI if the machine has one.)
The generic reasons for not collecting IPMI sensor data over the network is that your server BMCs might not be on the network at all (perhaps they don't have a dedicated BMC network interface), or you've sensibly put them on a secured network and your monitoring system doesn't have access to it. We have two additional reasons for preferring local IPMI sensor data collection.
First, even when our servers have dedicated management network ports, we don't always bother to wire them up; it's often just extra work for relatively little return (and it exposes the BMC to the network, which is not always a good thing). Second, when we collect IPMI sensor data through the host, we automatically start and stop collecting sensor data for the host when we start or stop monitoring the host in general (and we know for sure that the IPMI sensor data really matches that host). We almost never care about IPMI data when either the host isn't otherwise being monitored or the host is off.
Our system for collecting IPMI sensor data over the network actually dates from when this wasn't true, because we once had some (donated) blade servers that periodically mysteriously locked up under some conditions that seemed related to load (so much so that we built a system to automatically power cycle them via IPMI when they got hung). One of the things we were very interested in was if these blade servers were hitting temperature or fan limits when they hung. Since the machines had hung we couldn't collect IPMI information through their host agent; getting it from the IPMI over the network was our only option.
(This history has created a peculiarity, which is that our script for collecting network IPMI sensor data used what was at the time the existing IPMI user that was already set up to remotely power cycle the C6220 blades. So now anything we want to remotely collect IPMI sensor data from has a weird 'reboot' user, which these days doesn't necessarily have enough IPMI privileges to actually reset the machine.)
PS: We currently haven't built a local IPMI sensor data collection system for our OpenBSD machines, although OpenBSD can certainly talk to a local IPMI, so we collect data from a few of those machines over the network.
JSON is usually the least bad option for machine-readable output formats
Over on the Fediverse, I said something:
In re JSON causing problems, I would rather deal with JSON than yet another bespoke 'simpler' format. I have plenty of tools that can deal with JSON in generally straightforward ways and approximately none that work on your specific new simpler format. Awk may let me build a tool, depending on what your format is, and Python definitely will, but I don't want to.
This is re: <Royce Williams Fediverse post>
This is my view as a system administrator, because as a system administrator I deal with a lot of tools that could each have their own distinct output format, each of which I have to parse separately (for example, smartctl's bespoke output, although that output format sort of gets a pass because it was intended for people, not further processing).
JSON is not my ideal output format. But it has the same virtue as gofmt does; as Rob Pike has said, "gofmt's style is no one's favorite, yet gofmt is everyone's favorite" (source, also), because gofmt is universal and settles the arguments. Everything has to have some output format, so having a single one that is broadly used and supported is better than having N of them. And jq shows the benefit of this universality, because if something outputs JSON, jq can do useful things with it.
(In turn, the existence of jq makes JSON much more attractive to system administrators than it otherwise would be. If I had no ready way to process JSON output, I'd be much less happy about it and it would stop being the easy output format to deal with.)
I don't have any particular objection to programs that want to output in their own format (perhaps a simpler one). But I want them to give me an option for JSON too, and most of the time I'm going to go with JSON. I've already written enough ad-hoc text processing things in awk, and a few too many heavy duty text parsing things in Python. I don't really want to write another one just for you. If your program does use only a custom output format, I want there to be a really good reason why you did it, not just that you don't like the aesthetics of JSON. As Rob Pike says, no one likes gofmt's style, but we all like that everyone uses it.
(It's my view that JSON's increased verbosity over alternates isn't a compelling reason unless there's either a really large amount of data or you have to fit into very constrained space, bandwidth, or other things. In most environments, disk space and bandwidth are much cheaper than people's time and the liability of yet another custom tool that has to be maintained.)
PS: All of this is for output formats that are intended to be further processed. JSON is a terrible format for people to read directly, so terrible that my usual reaction to having to view raw JSON is to feed it through 'jq . | less'. But your tool should almost always also have an option for some machine readable format (trust me, someday system administrators will want to process the information your tool generates).
Some brief notes on 'numfmt' from GNU Coreutils
Many years ago I learned about numfmt
(also)
from GNU Coreutils (see the comments on this entry and then this entry). An additional source of information
is PΓ‘draig Brady's numfmt - A number reformatting utility. Today I was faced
with a situation where I wanted to compute and print multi-day,
cumulative Amanda dump total sizes for filesystems in a readable
way, and the range went from under a GByte to several TBytes, so I
didn't want to just convert everything to TBytes (or GBytes) and
be done with it. I was doing the summing up in awk and briefly
considered doing this 'humanization' in awk (again, I've done it
before) before I remembered numfmt
and decided to give it a try.
The basic pattern for using numfmt here was:
cat <amanda logs> | awk '...' | sort -nr | numfmt --to iec
This printed out '<size> <what ...>', and then numfmt turned the first field into humanized IEC values. As I did here, it's better to sort before numfmt, using the full precision raw number, rather than after numfmt (with 'sort -h'), with its rounded (printed) values.
Although Amanda records dump sizes in KBytes, I had my awk print
them out in bytes. It turns out that I could have kept them in
KBytes and had numfmt do the conversion, with 'numfmt --from-unit
1024 --to iec
'.
(As far as I can tell, the difference between --from-unit and --to-unit is that the former multiplies the number and the latter divides it, which is probably not going to be useful with IEC units. However, I can see it being useful if you wanted to mass-convert times in sub-second units to seconds, or convert seconds to a larger unit such as hours. Unfortunately numfmt currently has no unit options for time, so you can only do pure numeric shifts.)
If left to do its own formatting, numfmt has two issues (at least when doing conversions to IEC units). First, it will print some values with one decimal place and others with no decimal place. This will generally give you a result that can be hard to skim because not everything lines up, like this:
3.3T [...] 581G [...] 532G [...] [...] 11G [...] 9.8G [...] [...] 1.1G [...] 540M [...]
I prefer all of the numbers to line up, which means explicitly specifying the number of decimal places that everything gets. I tend to use one decimal place for everything, but none ('.0') is a perfectly okay choice. This is done with the --format argument:
... | numfmt --format '%.1f' --to iec
The second issue is that in the process of reformatting your numbers, numfmt will by and large remove any nice initial formatting you may have tried to do in your awk. Depending on how much (re)formatting you want to do, you may want another 'awk' step after the numfmt to pretty-print everything, or you can perhaps get away with --format:
... | numfmt --format '%10.1f ' --to iec
Here I'm specifying a field width for enough white space and also putting some spaces after the number.
Even with the need to fiddle around with formatting afterward, using numfmt was very much the easiest and fastest way to humanize numbers in this script. Now that I've gone through this initial experience with numfmt, I'll probably use it more in the future.
Workarounds are often forever (unless you work to make them otherwise)
Back in 2018, ZFS on Linux had a bug that could panic the system if you NFS-exported ZFS snapshots. We were setting up ZFS based NFS fileservers and we knew about this bug, so at the time we set things so that only filesystems themselves were NFS exported and available on our servers. Any ZFS snapshots on filesystems were only visible if you directly logged in to the fileservers, which was (and is) something that only core system staff could do. This is somewhat inconvenient; we have to get involved any time people want to get stuff back from snapshots.
It is now 2024. ZFS on Linux became OpenZFS (in 2020) and has long since fixed that issue and released versions with the fix. If I'm retracing Git logs correctly, the fix was in 0.8.0, so it was included (among many others) in Ubuntu 22.04's ZFS 2.1.5 (what our fileservers are currently running) and Ubuntu 24.04's ZFS 2.2.2 (what our new fileservers will run).
When we upgraded the fileservers from 18.04 to 22.04, did we go back to change our special system for generating NFS export entries to allow NFS clients to access ZFS snapshots? You already know the answer to that. We did not, because we had completely forgotten about it. Nor did we go back to do it as we were preparing the 24.04 setup of our ZFS fileservers. It was only today that it came up, as we were dealing with restoring a file from those ZFS snapshots. Since it's come up, we're probably going to test the change and then do it for our future 24.04 fileservers, since it will make things a bit more convenient for some people.
(The good news is that I left comments to myself in one program about why we weren't using the relevant NFS export option, so I could tell for sure that it was this long since fixed bug that had caused us to leave it out.)
It's a trite observation that there's nothing so permanent as a temporary solution, but just because it's trite doesn't mean that it's wrong. A temporary workaround that code comments say we thought we might revert later in the life of our 18.04 fileservers has lasted about six years, despite being unnecessary since no later than when our fileservers moved to Ubuntu 22.04 (admittedly, this wasn't all that long ago).
One moral I take from this is that if I want us to ever remove a 'temporary' workaround, I need to somehow explicitly schedule us reconsidering the workaround. If we don't explicitly schedule things, we probably won't remember (unless it's something sufficiently painful that it keeps poking us until we can get rid of it). The purpose of the schedule isn't necessarily to make us do the thing, it's to remind us that the thing exists and maybe it shouldn't.
(As a corollary, the schedule entry should include pointers to a lot of detail, because when it goes off in a year or two we won't really remember what it's talking about. That's why we have to schedule a reminder.)
Traceroute, firewalls, and the modern Internet: a horrible realization
The venerable traceroute command sort of reports the hops your packets take to reach a host, and in the process can reveal where your packets are getting dropped or diverted. The traditional default way that traceroute works is by sending UDP packets to a series of high UDP ports with increasing IP TTLs, and seeing where each reply comes from. If the TTL runs out on the way, traceroute gets one reply; if the packet reaches the host, traceroute gets another one (assuming that nothing is listening on the particular UDP port on the host, which usually it isn't). Most versions of traceroute can also use ICMP based probes, while some of them can also use TCP based ones.
While writing my entry on using traceroute with a fixed target port, I had a horrible realization: traceroute's UDP probes mostly won't make it through firewalls. Traceroute's UDP probes are made to a series of high UDP ports (often starting at port 33434 and counting up). Most firewalls are set to block unsolicited incoming UDP traffic by default; you normally specifically configure them to pass only some UDP traffic through to limited ports (such as port 53 for DNS queries to your DNS servers). When traceroute's UDP packets, sent to effectively random high ports, arrive at such a firewall, the firewall will discard or reject them and your traceroute will go no further.
(If you're extremely confident no one will ever run something that listens on the UDP port range, you can make your firewall friendly to traceroute by allowing through UDP ports 33434 to 33498 or so. But I wouldn't want to take that risk.)
The best way around this is probably to use ICMP for traceroute (using a fixed UDP port is more variable and not always possible). Most Unix traceroute implementations support '-I' to do this.
This matters in two situations. First, if you're asking outside people to run traceroutes to your machines and send you the results, and you have a firewall; without having them use ICMP, their traceroutes will all look like they fail to reach your machines (although you may be able to tell whether or not their packets reach your firewall). Second, if you're running traceroute against some outside machine that is (probably) behind a firewall, especially if the firewall isn't directly in front of it. In that case, your traceroute will always stop at or just before the firewall.
A note to myself about using traceroute to check for port reachability
Once upon a time, the Internet was a simple place; if you could ping some remote IP, you could probably reach it with anything. The Internet is no longer such a simple place, or rather I should say that various people's networks no longer are. These days there are a profusion of firewalls, IDS/IDR/IPS systems, and so on out there in the world, and some of them may decide to block access only to specific ports (and only some of the time). In this much more complicated world, you can want to check not just whether a machine is responding to pings, but if a machine responds to a specific port and if it doesn't, where your traffic stops.
The general question of 'where does your traffic stop' is mostly answered by the venerable traceroute. If you think there's some sort of general block, you traceroute to the target and then blame whatever is just beyond the last reported hop (assuming that you can traceroute to another IP at the same destination to determine this). I knew that traceroute normally works by sending UDP packets to 'random' ports (with manipulated (IP) TTLs, and the ports are not actually picked randomly) and then looking at what comes back, and I superstitiously remembered that you could fix the target port with the '-p' argument. This is, it turns out, not actually correct (and these days that matters).
There are several common versions of (Unix) traceroute out there; Linux, FreeBSD, and OpenBSD all use somewhat different versions. In all of them, what '-p port' actually does by itself is set the starting port, which is then incremented by one for each additional hop. So if you do 'traceroute -p 53 target', only the first hop will be probed with a UDP packet to port 53.
In Linux traceroute, you get a fixed UDP port by using the additional argument '-U'; -U by itself defaults to using port 53. Linux traceroute can also do TCP traceroutes with -T, and when you do TCP traceroutes the port is always fixed.
In OpenBSD traceroute, as far as I can see you just can't get a fixed UDP port. OpenBSD traceroute also doesn't do TCP traceroutes. On today's Internet, this is actually a potentially significant limitation, so I suspect that you most often want to try ICMP probes ('traceroute -I').
In FreeBSD traceroute, you get a fixed UDP port by turning on 'firewall evasion mode' with the '-e' argument. FreeBSD traceroute sort of supports a TCP traceroute with '-P tcp', but as the manual page says you need to see the BUGS section; it's going to be most useful if you believe your packets are getting filtered well before their destination. Using the TCP mode doesn't automatically turn on fixed port numbers, so in practice you probably want to use, for example, 'traceroute -P tcp -e -p 22 <host>' (with the port number depending on what you care about).
Having written all of this down, hopefully I will remember it for the next time it comes up (or I can look it up here, to save me reading through manual pages).
Some thoughts on OpenSSH 9.8's PerSourcePenalties feature
One of the features added in OpenSSH 9.8 is a new SSH server security feature to slow down certain sorts of attacks. To quote the release notes:
[T]he server will now block client addresses that repeatedly fail authentication, repeatedly connect without ever completing authentication or that crash the server. [...]
This is the PerSourcePenalties
configuration
setting and its defaults, and also see PerSourcePenaltyExemptList
and PerSourceNetBlockSize
.
OpenSSH 9.8 isn't yet in anything we
can use at work, but it will be in the next OpenBSD release (and then I'll
get it on Fedora).
On the one hand, this new option is exciting to me because for the first time it lets us block only rapidly repeating SSH sources that fail to authenticate, as opposed to rapidly repeating SSH sources that are successfully logging in to do a whole succession of tiny little commands. Right now our perimeter firewall is blind to whether a brief SSH connection was successful or not, so all it can do is block on total volume, and this means we need to be conservative in its settings. This is a single machine block (instead of the global block our perimeter firewall can do), but a lot of SSH attackers do seem to target single machines with their attacks (for a single external source IP, at least).
(It's also going to be a standard OpenSSH feature that won't require any configuration, firewall or otherwise, and will slow down rapid attackers.)
On the other hand, this is potentially an issue for anything that makes health checks like 'is this machine responding with a SSH banner' (used in our Prometheus setup) or 'does this machine have the SSH host key we expect' (used in our NFS mount authentication system). Both of these cases will stop before authentication and so fall into the 'noauth' category of PerSourcePenalties. The good news is that the default refusal duration for this penalty is only one second, which is usually not very long and you're probably not going to run into in health checks. The exception is if you're trying to verify multiple types of SSH host keys for a server, because you can only verify one host key in a given connection, so if you need to verify both a RSA host key and an Ed25519 host key, you need two connections.
(Even then, the OpenSSH 9.8 default is that only you only get blocked once you've built up 15 seconds of penalties. At the default settings, this would be hard with even repeated host key checks, unless the server has multiple IPs and you're checking all of them.)
It's going to be interesting to read practical experience reports with this feature as OpenSSH 9.8 rolls out to more and more people. And on that note I'm certainly going to wait for people's reports before doing things like increasing the 'authfail' penalty duration, as tempting as it is right now (especially since it's not clear from the current documentation how unenforced penalty times accumulate).
Uncertainties and issues in using IPMI temperature data
In a comment on my entry about a machine room temperature distribution surprise, tbuskey suggested (in part) using the temperature sensors that many server BMCs support and make visible through IPMI. As it happens, I have flirted with this and have some pessimistic views on it in practice in a lot of circumstances (although I'm less pessimistic now that I've looked at our actual data).
The big issue we've run into is limitations in what temperature sensors are available with any particular IPMI, which varies both between vendors and between server models even for the same vendor. Some of these sensors are clearly internal to the system and some are often vaguely described (at least in IPMI sensor names), and it's hit or miss if you have a sensor that either explicitly labels itself as an 'ambient' temperature or that is probably this because it's called an 'inlet' temperature. My view is that only sensors that report on ambient air temperature (at the intake point, where it is theoretically cool) are really useful, even for relative readings. Internal temperatures may not rise very much even if the ambient temperature does, because the system may respond with measures like ramping up fan speed; obviously this has limits, but you'd generally like to be alerted before things have gotten that bad.
(Out of the 85 of our servers that are currently reporting any IPMI temperatures at all, only 53 report an inlet temperature and only nine report an 'ambient' temperature. One server reports four inlet temperatures; 'ambient', two power supplies, and a 'board inlet' temperature. Currently its inlet ambient is 22C, the board inlet is 32C, and the power supplies are 31C and 36C.)
The next issue I'm seeing in our data is that either we have temperature differences of multiple degrees C between machines higher and lower in racks, or the inlet temperature sensors aren't necessarily all that accurate (even within the same model of server, which will all have the 'inlet' temperature sensor in the same place). I'd be a bit surprised if our machine room ambient air did have this sort of temperature gradient, but I've been surprised before. But that probably means that you have to care about where in the rack your indicator machines are, not just where in the room.
(And where in the room probably matters too, as discussed. I see about a 5C swing in inlet temperatures between the highest and lowest machines in our main machine room.)
We push all of the IPMI readings we can get (temperature and otherwise) into our Prometheus environment and we use some of the IPMI inlet temperature readings to drive alerts. But we consider them only a backup to our normal machine room temperature monitoring, which is done by dedicated units that we trust; if we can't get readings from the main unit for some reason, we'll at least get alerts if something also goes wrong with the air conditioning. I wouldn't want to use IPMI readings as our primary temperature monitoring unless I had no other choice.
(The other aspect of using IPMI temperature measurements is that either the server has to be up or you have to be able to talk to its BMC over the network, depending on how you're collecting the readings. We generally collect IPMI readings through the host agent, using an appropriate ipmitool sub-command. Doing this through the host agent has the advantage that the BMC doesn't even have to be connected to the network, and usually we don't care about BMC sensor readings for machines that are not in service.)
Allocating disk space (and all resources) is ultimately a political decision
In a multi-person or multi-group environment with shared resources, like a common set of fileservers, you often need to allocate resources like disk space between different uses. There are many different technical ways to do this, and also you can often not explicitly do this by shoving everyone into a big pile. Sometimes, you might be tempted to debate the technical merits of any particular approach, and while the technical merits of different ways potentially matter, in the end resource allocation is a political decision (although what is technically possible or feasible does limit the political options).
(Note that not specifically allocating resources is also a political decision; it is the decision to let resources like disk space be allocated on a first come, first served basis instead of anything else.)
In general, "political" is not a bad word. Politics, in the large, is about mediating social disagreements and, hopefully, making people feel okay about the results. Allocating limited resources is an area where there is no perfect answer and any answer that you choose will have unsatisfactory aspects. Weighing those tradeoffs and choosing a set of them is a (hard) social problem, which must be dealt with through a political decision.
Because resource allocation is a political decision, the specific decisions reached in your organization may well constrain your technical choices and, for example, complicate a storage migration (because you've chosen to allocate disk space in a specific way). Over the course of my career, I've come to understand that this isn't bad as such; it's just that social problems are more important and higher level than technical ones. It's more important to solve the social problems than it is to have an ideal technical world, because ultimately the technology exists to help the people.
One aspect of constraining your technical choices is that you may wind up not doing perfectly sensible and useful technical things because they go against the political decisions and goals around resource allocation. These decisions aren't irrational or wrong, exactly, although they can be hard to explain without explaining the political background.
(This doesn't mean that every design or operations decision that affects resource allocation has to be made at the political level in your organization, and in fact they generally can't be; you have to make some of them, even if it's to not specifically allocate resources and let them be used on a first come, first serve basis (or an 'everyone gets whatever portion they can right now'). But even if you make the decision and do so based on technical factors, it's best to remember that you're making a decision with political effects, and perhaps to think about who will be affected and how.)
PS: This aspect of why things work as they do being hard to explain isn't confined to technology; there are aspects of how the recreational bike club I'm part of operates that people have sometimes asked me about (sometimes in the form of 'why doesn't the club do <sensible seeming thing X>') and I've found hard to explain, especially concisely. Part of the answer is that the club has made a social ('political') decision to operate in a certain way.
A surprise with the temperature distribution in our machine room
Our primary machine room is quite old and is set up in an old fashioned way, so that we don't really have separate 'hot aisles' and 'cold aisles'; the closest we come is one aisle where both sides are the the front of servers. We have some long standing temperature monitoring in this machine room, and recently (for reasons outside the scope of this entry) we put a second (trustworthy) temperature monitoring unit into the room. The first temperature sensor is relatively near the room's AC unit, while the second unit is about as far away from it as you can get (by our rack of fileservers, not entirely coincidentally).
Before we set up the second temperature unit and started to get readings from it, I would have confidently predicted that it would report a higher temperature than the first unit, given that it was all the way diagonally across the room from the AC unit, and that row of racks sort of backs on to one of the room's walls (with space left for access and air circulation). Instead, it consistently reads lower than the first unit; how much lower depends on where the room is in the AC's cycle, because the second unit sees lower temperature swings than the first one.
(At their farthest apart, the two readings can be over 2 degrees Celsius different; at their closest, they can be only 0.2 C apart. Generally they're closest when the AC is on and the room temperature is at its coolest, and furthest apart when the room is at its warmest and the AC is about to come up for another cycle. Our temperature graphs also suggest that the cold air from the AC being on takes a bit longer to reach the far unit than the near unit.)
Temperature sensors can be fickle things, but this is an industrial unit with a good reputation (and an external sensor on a wire), so I believe the absolute numbers shown by its readings. So one of the lessons I take from this is that I can't predict the temperature distributions of our machine room (or more generally, any of our machine rooms and wiring closets). If we ever need to know where the hot and cold spots are, we can't guess based on factors like the distance from the AC units; we'll need to actively measure with something appropriate.
(I'm not sure what we'd use for relatively rapid temperature readings of the local ambient air temperature, but there are probably things that can be used for this.)
On not automatically reconnecting to IPMI Serial-over-LAN consoles
One of the things that the IPMI (network) protocol supports is Serial over LAN, which can be used to expose a server's serial console over your BMC's management network. These days, servers are starting to drop physical serial ports, making IPMI SOL your only way of getting console serial ports. The conserver serial console management software supports IPMI SOL (if built with the appropriate libraries), and you can directly access SOL serial consoles with IPMI programs. However, as I mentioned in passing in yesterday's entry, IPMI SOL access has a potential problem, which is that only one SOL connection is allowed at a time and if someone makes a new SOL connection, any old one is automatically disconnected. This disconnection is invisible to the IPMI SOL client until (and unless) it attempts to send something to the SOL console, at which point it apparently gets a timeout. This is bad for a program like conserver, which in many situations will only read SOL console output in order to log it, not send any input to the SOL console.
(This BMC behavior may not be universal, based on some comments in FreeIPMI.)
Conserver uses FreeIPMI for IPMI SOL access, which supports a special 'serial keepalive' option (which you can configure in libipmiconsole.conf) to detect and remedy this. As covered in comments in ipmiconsole.h, this option (normally) works by periodically sending a NUL character to the SOL console, which will make the BMC eventually tell you that the serial connection has been broken and you need to re-create your IPMI SOL session so that now you get serial output again.
When I first read about this option I was enthused about putting it into our configuration, so that conserver would automatically re-establish stolen SOL connections. Then I thought about it a bit more and decided that this probably wasn't a good idea. The problem is that there's no way to tell if another IPMI SOL session is active at the moment or not (at least with this option); all we can do is unconditionally take the SOL console back. If one of us has made a SOL connection, done some stuff, and disconnected again, this is fine. If one of us is in the process of using a live SOL connection right now, this is bad.
This is especially so because about the only time when we'd resort to using a direct IPMI SOL connection instead of logging in to the console server and using conserver is when either we can't get to the console server or the console server can't get to the BMC of the machine we want to connect to. These are stressful situations when something is already wrong, so the last thing we want is to compound our problems by having a serial console connection stolen in the middle of our work.
Not configuring FreeIPMI with serial keepalives doesn't completely eliminate this problem; it could still happen if the console server machine is (re)booted or conserver is restarted. Both of these will cause conserver to start up, make a bunch of IPMI SOL connections, and steal any current by-hand SOL connections away from us. But at least it's less likely.
Handling (or not) the serial console of our serial console server
We've had a central serial console server for a long time. It has two purposes; it logs all of the (serial) console output from servers and various other pieces of hardware (which on Linux machines includes things like kernel messages, cf), and it allows us to log in to machines over their serial console. For a long time this server was a hand built one-off machine, but recently we've been rebuilding it on our standard Ubuntu framework (much like our central syslog server). Our standard Ubuntu framework includes setting up a (kernel) serial console, which made me ask myself what we were going to do with the console server's serial console.
We have a matrix of options. We can direct the serial console to either a physical serial port or to the BMC's Serial over LAN system. Once the serial console is somewhere, we can ignore it except when we want to manually use it, connect it to the console server's regular conserver instance, or connect it to a new conserver instance on some other machine (which would have to be using either IPMI Serial-over-LAN or a USB serial port, depending on which serial console we pick).
Connecting the console server's serial port to itself would let us log routine serial console output in the same place that we put all of the other serial console output. However, it wouldn't allow us to capture kernel logs if the machine crashed for some reason, which is one valuable thing that our current serial console setup has, or log in through the serial console if the console server fell off the network. Setting up a backup, single-host conserver on another machine would allow us to do both, at the cost of having a second conserver machine to think about.
Using Serial-over-LAN would allow us to log in to the console server over its serial console from any other machine that had access to what has become our IPMI/BMC network, which is a number of them (it's that way for emergency access purposes). However it requires that the BMC network be up, which is to say that all of the relevant switches are working. A direct (USB) serial connection would only require the other machine to be up and reachable.
Of course we can split the difference. We could have the Linux kernel serial console on the physical serial port and also have logins enabled on the Serial-over-LAN serial port. In a lot of situations this would still give us remote access to the console server, although we wouldn't be able to trigger things like Magic SysRq over the SoL connection since it's not a kernel console.
(Unfortunately you can only have one kernel serial console.)
My current view is that the easiest thing to start with is to set the serial console to the Serial-over-LAN port and then not have anything collecting kernel messages from it. If we decide we want to change that, we can harvest SoL serial console messages from either the console server itself or from another machine. In an emergency, a SoL port can be accessed from any machine with BMC network access, not just from its conserver machine, unlike a physical serial port (which would have to be accessed from the other machine connected to it).
(In our current conserver setup, you don't really want to access the SoL port from another machine if you can avoid it. Doing so will quietly break the connection from conserver on the console server until you restart conserver. It's possible we could work around this with libipmiconsole.conf settings.)
Our slowly growing Unix monoculture
Once upon a time, we ran Ubuntu Linux machines, OpenBSD machines, x86 Solaris machines, and what was then RHEL machines (in the days of our first generation ZFS fileservers). Over time, Solaris changed to OmniOS (and RHEL to CentOS), but even at the time it was clear that both of those hadn't caught on here and after a while we replaced the OmniOS fileservers and CentOS iSCSI backends with our third generation Ubuntu-based fileservers. Then recently, the final pieces of CentOS have been getting removed, such as our central syslog servers because CentOS as it originally was is dead (the current 'CentOS Stream' doesn't meet our needs).
Our OpenBSD usage has also been dwindling. Originally we used OpenBSD for firewalls, most DNS service, a DHCP server, and several VPN servers (for different VPN protocols). Our internal DNS resolvers now run Bind on Ubuntu and we've been expecting to some day have to move our VPN servers away from OpenBSD in order to get more up to date versions of the various VPNs (although this hasn't happened yet). The OpenBSD DHCP server is fine so far, but we have three DHCP servers and two of them are Ubuntu machines, so I wouldn't be surprised if we switch the third to Ubuntu as well when we next rebuild it.
(There's basically no prospect of us switching away from OpenBSD on the firewalls, but the firewalls are effectively appliances.)
It's probably been plural decades since our users logged in to anything other than x86 Ubuntu machines, and at least a decade since any of them were 32-bit x86 instead of 64-bit x86. It seems unlikely that we'll get ARM-based machines, especially ones that we expose to people to log in to and use. I expect we'll have to switch away from Ubuntu someday, but that will be a switch, not a long term plan of running Ubuntu as well as something else, and the most likely candidate (Debian) won't look particularly different to most people.
The old multi-Unix, multi-architecture days had their significant drawbacks, but sometimes I wonder what we're losing by increasingly becoming a monoculture that runs Ubuntu Linux and (almost) nothing else. I feel that as system administrators, there's something we gain by having exposure to different Unixes that make different choices and have different tools than Ubuntu Linux. To put it one way, I think we get a wider perspective and wind up with more ideas and approaches in our mental toolkit. We have that today because of our history, so hopefully it won't atrophy too badly when we really narrow down to being a monoculture.
How I almost set up a recursive syslog server
Over on the Fediverse, I mentioned an experience I had today:
Today I experienced that when you tell a syslog server to forward syslog to another server, it forwards everything. Including anything it was sent by other servers. And to confuse you, those forwarded messages will often be logged with the original host names, so you can wonder what these weird servers are that are sending you unexpected traffic.
At least I caught this before we had the central syslog server forward to itself. That probably would have been funβ’.
You might wonder how on earth you do this to yourself without noticing, and the answer is the (dangerous) power of standardized installs.
We've had a central syslog server for a long time, along with another syslog server that we run for machines run by Points of Contact that are on internal sandbox networks. For much of this time, these syslog servers have been completely custom-installed machines; for example, they ran RHEL and then CentOS when we'd switched to Ubuntu for the rest of our machines. The current hardware and OS setup on these machines has been aging, so we've been working on replacing them. This time around, rather than doing a custom install, we decided to make these machines one variant of our standard Ubuntu install, supplemented by a small per-machine customization process. There are some potential downsides to this, since the machines have somewhat less security isolation, but we felt the advantages were worth it (for example, now they'll be part of our standard update system).
Part of our standard Ubuntu install configures the installed machine's syslog daemon to forward a copy of all syslog messages to our central syslog server; specifically this is part of the standard scripts that are run on a machine to give it our general baseline setup. This is standard and so basically invisible, so I didn't think of this syslog forwarding when putting together the post-install customization instructions for these syslog servers. Fortunately, the first syslog server we rebuilt and put into production was the additional syslog server for other people's logs, not the central server for our own logs. It was fortunate that today I had a reason to look at one set of logs on our central syslog server that had low enough log volume that I could spot out of place entries immediately, and then start trying to track them down.
This sort of thing is fairly closely related to the general large environment issue where you have recursive dependencies or recursive relationships between services, often without realizing it. You can even get direct self-dependencies, for example if you don't remember to change your DHCP server away from getting its network configuration by DHCP, although in that sort of case you're probably going to notice the first time you reboot the machine in production (assuming you don't have redundant DHCP servers; if you do, you might not find this out until you're cold-starting your entire environment).
(Some self-usage is harmless and even a good thing. For example, you probably want your internal DNS resolvers to do any necessary DNS lookups through themselves, instead of trying to find some other DNS resolver for them.)
Our giant login server: solving resource problems with brute force
One of the moderately peculiar aspects of our environment is that we still have general Unix multiuser systems that people with accounts can log in to and do stuff on. As part of this we have some general purpose login servers, and in particular we have one that's always been the most popular, partly because it was what you got when you did 'ssh cs.toronto.edu'. For years and years we had a succession of load and usage issues on this server, where someone would log in and start doing something that was CPU or memory intensive, hammering the machine for everyone on it (which was generally a lot of people, and so this could be pretty visible). We spent a non-trivial amount of time keeping an eye on the machine's load, sending email to people, terminating people's heavy-duty processes, and in a few cases having to block logins from specific people until they paid attention to their email.
Then a few years ago we had a chunk of spare money and decided to spend it on getting rid of the problem once and for all. We did this by buying a ridiculously overpowered server to become the new version of our primary login server, with 512 GB of RAM and 112 CPUs (AMD Epyc 7453s); in fact we bought two at once and put the other one into our SLURM cluster, where it was at the time one of the most powerful compute machines there (back in 2022).
By itself this wouldn't be sufficient to protect us from having to care about what people were doing on the machine, because (some) modern software can eat any amount of CPUs and RAM that's available (due to things like auto-sizing how many things it does in parallel based on the available CPU count). So we set up per-user CPU and memory resource limits for all users. Because this server is so big, we can actually give people quite large limits; our current settings are 30 GBytes of RAM and 8 CPUs, which is effectively a reasonable desktop machine (we figure people can't really complain at that point).
(In completely unsurprising news, people do manage to run into the memory limit from time to time and have their giant processes killed.)
These limits don't completely guarantee avoiding problems, since enough different people doing enough at once could still overload the machine. But this hasn't happened yet, so in practice we've been able to basically stop caring about what people run on our primary login server, and with it we've stopped watching things like its load average and free memory. For people using our primary login server, the benefit is that they can do a lot more than they could before without problems and they don't get affected by what other people are doing.
My home wireless network and convenience versus security
The (more) secure way to do a home wireless network (or networks) is relatively clear. Your wireless network (or networks) should exist on its own network segment, generally cut off from any wired networking you have and definitely cut off from direct access to your means of Internet connectivity. To get out of the network it should always have to go through a secure gateway that firewalls your home infrastructure from the random wireless devices you have to give wifi access to and their random traffic. One of the things that this implies is that you should implement your wireless with a dedicated wireless access point, not with the wifi capabilities of some all in one device.
When I set up my wireless network, I didn't do it this way, and I've kept not doing it this way ever since. My internet connection uses VDSL and when I upgraded to VDSL you couldn't get things that were just a 'VDSL modem'; the best you could do was all in one routers that could have the router bit turned off. My VDSL 'modem' also could be a wifi AP, so when I wanted a wireless network all of a sudden I just turned that on and then set up my home desktop to be a DHCP server, NAT gateway, and so on. This put wifi clients on the same network segment as the VDSL modem, and in fact I lazily used the same subnet rather than running two subnets over the same physical network segment.
(Because all Internet access runs through my desktop, there's always been some security there. I only NAT'd specific IPs that I'd configured, not anything that happened to randomly show up on the network.)
Every so often since then I've thought about changing this situation. I could get a dedicated wifi AP (and it might well have better performance and reach more areas than the current VDSL modem AP does; the VDSL modem doesn't even have an external wifi antenna), and add another network interface to my desktop to segment wifi traffic to the new wifi AP network. It would get its own subnet and client devices wouldn't be able to talk directly to the VDSL modem or potentially snoop (PPPoE) traffic between my desktop and the VDSL modem.
However, much as with other tradeoffs of security versus convenience, in practice I've come down on the side of convenience. Even though it's a bit messy and not as secure as it could be, my current setup works well enough and hasn't caused problems. By sticking with the current situation, I avoid the annoyance of trying to find and buy a decent wifi AP, reorganizing things physically, changing various system configurations, and so on.
(This also avoids adding another little device I'd want to keep powered from my UPS during a power outage. I'm always going to power the VDSL modem, and I'd want to power the wifi AP too because otherwise things like my phone stop being able to use my local Internet connection and have to fall back to potentially congested or unavailable cellular signal.)
SSH has become our universal (Unix) external access protocol
When I noted that brute force attackers seem to go away rapidly if you block them, one reaction was to suggest that SSH shouldn't be exposed to the Internet. While this is viable in some places and arguably broadly sensible (since SSH has a large attack surface, as we've seen recently in CVE-2024-6387), it's not possible for us. Here at a university, SSH has become our universal external access protocol.
One of the peculiarities of universities is that people travel widely, and during that travel they need access to our systems so they can continue working. In general there are a lot of ways to give people external access to things; you can set up VPN servers, you can arrange WireGuard peer to peer connections, and so on. Unfortunately, often two issues surface; our people have widely assorted devices that they want to work from, with widely varying capabilities and ease of using VPN and VPN like things, and their (remote) network environments may or may not like any particular VPN protocol (and they probably don't want to route their entire Internet traffic the long way around through us).
The biggest advantage of SSH is that pretty much everything can do SSH, especially because it's already a requirement for working with our Unix systems when you're on campus and connecting from within the department's networks; this is not necessarily so true of the zoo of different VPN options out there. Because SSH is so pervasive, it's also become a lowest common denominator remote access protocol, one that almost everyone allows people to use to talk to other places. There are a few places where you can't use SSH, but most of them are going to block VPNs too.
In most organizations, even if you use SSH (and IMAP, our other universal external access protocol), you're probably operating with a lot less travel and external access in general, and hopefully a rather more controlled set of client setups. In such an environment you can centralize on a single VPN that works on all of your supported client setups (and meets your security requirements), and then tell everyone that if they need to SSH to something, first they bring up their VPN connection. There's no need to expose SSH to the world, or even let the world know about the existence of specific servers.
(And in a personal environment, the answer today is probably WireGuard, since there are WireGuard clients on most modern things and it's simple enough to only expose SSH on your machines over WireGuard. WireGuard has less exposed attack surface and doesn't suffer from the sort of brute force attacks that SSH does.)
Brute force attackers seem to switch targets rapidly if you block them
Like everyone else, we have a constant stream of attackers trying brute force password guessing against us using SSH or authenticated SMTP, from a variety of source IPs. Some of the source IPs attack us at a low rate (although there can be bursts when a lot of them are trying), but some of them do so at a relatively high rate, high enough to be annoying. When I notice such IPs (ones making hundreds of attempts an hour, for example), I tend to put them in our firewall blocks. After recently starting to pay attention to what happens next, what I've discovered is that at least currently, most such high volume IPs give up almost immediately. Within a few minutes of being blocked their activity typically drops to nothing.
Once I thought about it, this behavior feels like an obvious thing for attackers to do. Attackers clearly have a roster of hosts they've obtained access to and a whole collection of target machines to try brute force attacks against, with very low expectations of success for any particular attack or target machine; to make up for the low success rate, they need to do as much as possible. Wasting resources on unresponsive machines cuts down the number of useful attacks they can make, so over time attackers have likely had a lot of motivation to move on rapidly when their target stops responding. If the target machine comes back some day, well, they have a big list, they'll get around to trying it again sometime.
The useful thing about this attacker behavior is that if attackers are going to entirely stop using an IP to attack you (at least for a reasonable amount of time) within a few minutes of it being blocked, you only need to block attacker IPs for those few minutes. After five or ten or twenty minutes, you can remove the IP block again. Since the attackers use a lot of IPs and their IPs may get reused later for innocent purposes, this is useful for keeping the size of firewall blocks down and limiting the potential impact of a mis-block.
(A traditional problem with putting IPs in your firewall blocks is that often you don't have a procedure to re-assess them periodically and remove them again. So once you block an IP, it can remain blocked for years, even after it gets turned over to someone completely different. This is especially the case with cloud provider IPs, which are both commonly used for attacks and then commonly turn over. Fast and essentially automated expiry helps a lot here.)
"Out of band" network management is not trivial
One of the Canadian news items of the time interval is that a summary of the official report on the 2022 Rogers Internet and phone outage has been released (see also the CBC summary of the summary, and the Wikipedia page on the outage). This was an extremely major outage that took down both Internet and phone service for a lot of people for roughly a day and caused a series of failures in services and systems that turned out to rely on Rogers for (enough of) their phone and Internet connectivity. In the wake of the report, some people are (correctly) pointing to Rogers not having any "Out of Band" network management capability as one of the major contributing factors. Some people have gone so far as to suggest that out of band network management is an obvious thing that everyone should have. As it happens I have some opinions on this, and the capsule summary is that out of band network management is non-trivial.
(While the outage 'only' cut off an estimated 12 million people, the total population of Canada is about 40 million people, so it directly affected more than one in four Canadians.)
Obviously, doing out of band network management means that you need a dedicated set of physical hardware for your OOB network; separate switches, routers, local network cabling, and long distance fiber runs between locations (whether that is nearby university buildings or different cities). If you're serious, you probably want your OOB fiber runs to have different physical paths than your regular network fiber, so one backhoe accident can't cut both of them. This separate network infrastructure has to run to everything you want to manage and also to everywhere you want to manage your network from. This is potentially a lot of physical hardware and networking, and as they say it can get worse.
(This out of band network also absolutely has to be secure, because it's a back door to your entire network.)
When you set up OOB network management, you have a choice to make; is your OOB network the only way to manage equipment, or can you manage equipment either 'in-band' through your regular network or through the out of band network. If your OOB network is your only way of managing things, you not only have to build a separate network, you have to make sure it is fully redundant, because otherwise you've created a single point of failure for (some) management. If your OOB network is a backup, you don't necessarily need as much redundancy (although you may want some), but now you need to actively monitor and verify that both access paths work. You also have two access paths to keep secure, instead of just one.
Security or rather access authentication is another complication for out of band management networks. If you need your OOB network, you have to assume that all other networks aren't working, which means that everything your network routers, switches, and so on need to authenticate your access has to be accessible through the OOB management network (possibly in addition to through your regular networks, if you also have in-band management). This may not be trivial to arrange, depending on what sort of authentication system you're using. You also need to make sure that your overall authentication flow can complete using only OOB network information and services (so, for example, your authentication server can't reach out to a third party provider's MFA service to send push notifications to authentication apps on people's phones).
Locally, we have what I would describe as a discount out of band management network. It has a completely separate set of switches, cabling, and building to building fiber runs, and some things have their management interfaces on it. It doesn't have any redundancy, which is acceptable in our particular environment. Unfortunately, because it's a completely isolated network, it can be a bit awkward to use, especially if you want to put a device on it that would appreciate modern conveniences like the ability to send alert emails if something happens (or even send syslog messages to a remote server; currently our central syslog server isn't on this network, although we should probably fix that).
In many cases I think you're better off having redundant and and hardened in-band management, especially with smaller networks. Running an out of band network is effectively having two separate networks to look after instead of just one; if you have limited resources (including time and attention), I think you're further ahead focusing on making a single network solid and redundant rather than splitting your efforts.
Structured log formats are not really "plaintext" logs
As sort of a follow on to how plaintext is not a great format for logs, I said something on the Fediverse:
A hill that I will at least fight on is that text based structured log formats are not 'plain text logs' as people understand them, unless perhaps you have very little metadata attached to your log messages and don't adopt one of the unambiguous encoding formats. Sure you can read them with 'less', sort of, but not really well (much less skim them rapidly).
"Plaintext" logs are a different thing than log formats that are stored using only printable and theoretically readable text. JSON is printable text, but if you dump a sequence of JSON objects into a file and call it a 'plaintext log', I think everyone will disagree with you. For system administrators, a "plaintext log" is something that we can readily view and follow using basic Unix text tools. If we can't really read through log messages with 'less' or follow the log file live with 'tail -f' or similar things, you don't have a plaintext log, you have a text encoded log.
Unfortunately, structured log formats may produce text output but often not plaintext output. Consider, for example:
ts=<...> caller=main.go:190 module=dns_amazonca target=8.8.8.8:53 level=info msg="Beginning probe" probe=dns timeout_seconds=30 ts=<...> caller=dns.go:200 module=dns_amazonca target=8.8.8.8:53 level=info msg="Resolving target address" target=8.8.8.8 ip_protocol=ip4 [...] ts=<...> caller=dns.go:302 module=dns_amazonca target=8.8.8.8:53 level=info msg="Validating RR" rr="amazon.ca.\t17\tIN\tA\t54.239.18.172"
This is all text. You can sort of read it (especially since I've left out the relatively large timestamps). But trying to read through all of these messages with 'less' at any volume would be painful, especially if you care about the specific values of those 'rr=' things, which you're going to have to mentally decode to see through the '\t's (and other characters that may be quoted in strings).
There are text structured log formats that are somewhat better than this, for example ones that put a series of metadata labels and their values at the front then end the log line with the main log message. At least there you can look at the end of the line in things like 'tail' and 'less' to see the message, although it may not be in a consistent column. But the more labels there are, the more the message text gets pushed aside.
One of the most common example of a plaintext log format is the traditional syslog format:
Jul 1 17:58:53 HOST sshd[PID]: error: beginning MaxStartups throttling Jul 1 17:58:53 HOST sshd[PID]: drop connection #10 from [SOMEIP]:36039 on [MYIP]:22 past MaxStartups
This is almost entirely the message with relatively little metadata (and a minimal timestamp that doesn't even include the year). This is what you need to maximize human readability with 'less', 'tail', and so on.
At this point people will note that the information added by structured logging is potentially important and it's useful to represent it relatively unambiguously. Some other people might ask if traditional Apache common log format, or Exim's log format, are 'plaintext logs'. My answer to both is that this illustrates why plaintext is not a great format for logs. True maximally readable plaintext logs are highly constrained and wind up leaving lots of information out or being ambiguous and hard to process or both. The more additional information you include in a clearly structured format, the more potentially useful it is but the less straightforwardly readable the result is and the less you have plaintext logs.
If you want to use a structured log format, where you sit on the spectrum between plaintext logs and JSON blobs appended to something depends on how you expect your logs to be used and consumed (and stored). If people are only ever going to consume them through special tools, you might as well go full JSON or the equivalent. If people will sometimes read your logs in raw format with 'less' or 'tail' or whatever, or your logs will be comingled with logs from other programs in random line-focused formats, you should probably choose a format that's more readable by eye, perhaps some version of logfmt.
Plaintext is not a great format for (system) logs
Recently I saw some grumpiness on the Fediverse about systemd's journal not using 'plain text' for storing logs. I have various feelings here, but one of the probably controversial ones is that in general, plain text is not a great format for logs, especially system logs. This is independent of systemd's journal or of anything else, and in fact looking back I can see signs of this in my own experiences long before the systemd journal showed up (for instance, it's part of giving up on syslog priorities).
The core problem is that log messages themselves almost invariably come with additional metadata, often fairly rich metadata, but if you store things in plain text it's difficult to handle that metadata. You have more or less three things you can do with any particular piece of metadata:
- You can augment the log message with the metadata in some (text)
format. For example, the traditional syslog 'plain text' format
augments the basic syslog message with the timestamp, the host
name, the program, and possibly the process ID. The downside of
this is that it makes log messages themselves harder to pick out
and process; the more metadata you add, the more the log message
itself becomes obscured.
(One can see this in syslog messages from certain sorts of modern programs, which augment their log messages with a bunch of internal metadata that they put in the syslog log message as a series of 'key=value' text.)
- You can store the metadata by implication, for example by writing
log messages to separate files based on the metadata. For example,
syslog is often configured to use metadata (such as the syslog
facility and the log level) to control which files a log message
is written to. One of the drawbacks of storing metadata by
implication is that it separates out log messages, making it
harder to get a global picture of what was going on. Another
drawback is that it's hard to store very many different pieces
of metadata this way.
- You can discard the metadata. Once again, the traditional syslog log format is an example, because it normally discards the syslog facility and the syslog log level (unless they're stored by implication).
The more metadata you have, the worse this problem is. Perhaps unsurprisingly, modern systems can often attach rich metadata to log messages, and this metadata can be quite useful for searching and monitoring. But if you write your logs out in plain text, either you get clutter and complexity or you lose metadata.
Of course if you have standard formats for attaching metadata to log messages, you can write tools that strip or manipulate this metadata in order to give you (just) the log messages. But the more you do this and rely on it, the less your logs are really plain text instead of 'structured logs stored in a somewhat readable text format'.
(The ultimate version of this is giving up on readability in the raw and writing everything out as JSON. This is technically just text, but it's not usefully plain text.)
Is blocking outgoing traffic by default a good firewall choice now?
A few years ago I wrote about how HTTP/3 needed us (and other people) to make firewall changes to allow outgoing UDP port 443 traffic. Recently this entry got discussed on lobste.rs, and the discussion made me think about if our (sort of) default of blocking outgoing traffic was a good idea these days, at least in an environment where we don't know what's happening on our networks.
(If you do know exactly what should be on your networks and what it should be talking to, then blocking everything else is a solid security precaution against various sorts of surprises.)
I say that we 'sort of' block outgoing traffic by default because the composite of our firewall rules (on the firewalls for internal 'sandbox' networks and the perimeter firewall between our overall networks and the university's general network) already default to allowing a lot of things. In practice, mostly we default to blocking access to 'privileged' TCP ports; most or all UDP traffic and most TCP traffic to ports above 1023 is just allowed through. Then of course there is a variegated list of TCP ports that we just always allow through, some of them clearly mostly for historical reasons (we allow gopher (port 70) and finger (port 79), for example).
(Our general allowance for TCP ports above 1023 may have been partly due to FTP, back in the days. Our firewalls and their rules have been there for a long time.)
Historically, ports under 1024 were where interesting services hung out, and so you could block outgoing access to them for a combination of being a good network neighbor and to stop your people from accidentally doing things like use insecure protocols across the Internet (but then, we still allow telnet). These days this logic still sort of applies, but there are a lot of unencrypted and potentially insecure protocols that are found on high TCP ports and so could be accessed fine by people here. And outgoing access to UDP based things (including HTTP/3) is surprisingly open for most of our internal networks (it varies somewhat by network).
There are definitely outgoing low TCP ports that you don't want to let people connect to; the obvious candidate is the constellation of TCP ports associated with Microsoft CIFS (aka 'Samba'). But beyond a few known candidates I'm not sure there's a strong reason to block access to low-numbered TCP ports if we're already allowing access to high ones.
(Pragmatically we're probably not going to change our firewalls at this point. They work as it is and people aren't complaining. Of course we're making a little contribution to an environment where very few people bother trying to get a low numbered port assigned for their new system, because it often wouldn't do them much good. Instead they'll run it over HTTPS.)