❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

US sanctions and your VPN (and certain big US-based cloud providers)

By: cks
28 March 2025 at 02:43

As you may have heard (also) and to simplify, the US government requires US-based organizations to not 'do business with' certain countries and regions (what this means in practice depends in part which lawyer you ask, or more to the point, that the US-based organization asked). As a Canadian university, we have people from various places around the world, including sanctioned areas, and sometimes they go back home. Also, we have a VPN, and sometimes when people go back home, they use our VPN for various reasons (including that they're continuing to do various academic work while they're back at home). Like many VPNs, ours normally routes all of your traffic out of our VPN public exit IPs (because people want this, for good reasons).

Getting around geographical restrictions by using a VPN is a time honored Internet tradition. As a result of it being a time honored Internet tradition, a certain large cloud provider with a lot of expertise in browsers doesn't just determine what your country is based on your public IP; instead, as far as we can tell, it will try to sniff all sorts of attributes of your browser and your behavior and so on to tell if you're actually located in a sanctioned place despite what your public IP is. If this large cloud provider decides that you (the person operating through the VPN) actually are in a sanctioned region, it then seems to mark your VPN's public exit IP as 'actually this is in a sanctioned area' and apply the result to other people who are also working through the VPN.

(Well, I simplify. In real life the public IP involved may only be one part of a signature that causes the large cloud provider to decide that a particular connection or request is from a sanctioned area.)

Based on what we observed, this large cloud provider appears to deal with connections and HTTP requests from sanctioned regions by refusing to talk to you. Naturally this includes refusing to talk to your VPN's public exit IP when it has decided that your VPN's IP is really in a sanctioned country. When this sequence of events happened to us, this behavior provided us an interesting and exciting opportunity to discover how many companies hosted some part of their (web) infrastructure and assets (static or otherwise) on the large cloud provider, and also how hard to diagnose the resulting failures were. Some pages didn't load at all; some pages loaded only partially, or had stuff that was supposed to work but didn't (because fetching JavaScript had failed); with some places you could load their main landing page (on one website) but then not move to the pages (on another website at a subdomain) that you needed to use to get things done.

The partial good news (for us) was that this large cloud provider would reconsider its view of where your VPN's public exit IP 'was' after a day or two, at which point everything would go back to working for a while. This was also sort of the bad news, because it made figuring out what was going on somewhat more complicated and hit or miss.

If this is relevant to your work and your VPNs, all I can suggest is to get people to use different VPNs with different public exit IPs depending on where the are (or force them to, if you have some mechanism for that).

PS: This can presumably also happen if some of your people are merely traveling to and in the sanctioned region, either for work (including attending academic conferences) or for a vacation (or both).

(This is a sysadmin war story from a couple of years ago, but I have no reason to believe the situation is any different today. We learned some troubleshooting lessons from it.)

Three ways I know of to authenticate SSH connections with OIDC tokens

By: cks
27 March 2025 at 02:56

Suppose, not hypothetically, that you have an MFA equipped OIDC identity provider (an 'OP' in the jargon), and you would like to use it to authenticate SSH connections. Specifically, like with IMAP, you might want to do this through OIDC/OAuth2 tokens that are issued by your OP to client programs, which the client programs can then use to prove your identity to the SSH server(s). One reason you might want to do this is because it's hard to find non-annoying, MFA-enabled ways of authenticating SSH, and your OIDC OP is right there and probably already supports sessions and so on. So far I've found three different projects that will do this directly, each with their own clever approach and various tradeoffs.

(The bad news is that all of them require various amounts of additional software, including on client machines. This leaves SSH apps on phones and tablets somewhat out in the cold.)

The first is ssh-oidc, which is a joint effort of various European academic parties, although I believe it's also used elsewhere (cf). Based on reading the documentation, ssh-oidc works by directly passing the OIDC token to the server, I believe through a SSH 'challenge' as part of challenge/response authentication, and then verifying it on the server through a PAM module and associated tools. This is clever, but I'm not sure if you can continue to do plain password authentication (at least not without PAM tricks to selectively apply their PAM module depending on, eg, the network area the connection is coming from).

Second is Smallstep's DIY Single-Sign-On for SSH (also). This works by setting up a SSH certificate authority and having the CA software issue signed, short-lived SSH client certificates in exchange for OIDC authentication from your OP. With client side software, these client certificates will be automatically set up for use by ssh, and on servers all you need is to trust your SSH CA. I believe you could even set this up for personal use on servers you SSH to, since you set up a personally trusted SSH CA. On the positive side, this requires minimal server changes and no extra server software, and preserves your ability to directly authenticate with passwords (and perhaps some MFA challenge). On the negative side, you now have a SSH CA you have to trust.

(One reason to care about still supporting passwords plus another MFA challenge is that it means that people without the client software can still log in with MFA, although perhaps somewhat painfully.)

The third option, which I've only recently become aware of, is Cloudflare's recently open-sourced 'opkssh' (via, Github). OPKSSH builds on something called OpenPubkey, which uses a clever trick to embed a public key you provide in (signed) OIDC tokens from your OP (for details see here). OPKSSH uses this to put a basically regular SSH public key into such an augmented OIDC token, then smuggles it from the client to the server by embedding the entire token in a SSH (client) certificate; on the server, it uses an AuthorizedKeysCommand to verify the token, extract the public key, and tell the SSH server to use the public key for verification (see How it works for more details). If you want, as far as I can see OPKSSH still supports using regular SSH public keys and also passwords (possibly plus an MFA challenge).

(Right now OPKSSH is not ready for use with third party OIDC OPs. Like so many things it's started out by only supporting the big, established OIDC places.)

It's quite possible that there are other options for direct (ie, non-VPN) OIDC based SSH authentication. If there are, I'd love to hear about them.

(OpenBao may be another 'SSH CA that authenticates you via OIDC' option; see eg Signed SSH certificates and also here and here. In general the OpenBao documentation gives me the feeling that using it merely to bridge between OIDC and SSH servers would be swatting a fly with an awkwardly large hammer.)

Some notes on configuring Dovecot to authenticate via OIDC/OAuth2

By: cks
15 March 2025 at 03:01

Suppose, not hypothetically, that you have a relatively modern Dovecot server and a shiny new OIDC identity provider server ('OP' in OIDC jargon, 'IdP' in common usage), and you would like to get Dovecot to authenticate people's logins via OIDC. Ignoring certain practical problems, the way this is done is for your mail clients to obtain an OIDC token from your IdP, provide it to Dovecot via SASL OAUTHBEARER, and then for Dovecot to do the critical step of actually validating that token it received is good, still active, and contains all the information you need. Dovecot supports this through OAuth v2.0 authentication as a passdb (password database), but in the usual Dovecot fashion, the documentation on how to configure the parameters for validating tokens with your IdP is a little bit lacking in explanations. So here are some notes.

If you have a modern OIDC IdP, it will support OpenID Connect Discovery, including the provider configuration request on the path /.well-known/openid-configuration. Once you know this, if you're not that familiar with OIDC things you can request this URL from your OIDC IdP, feed the result through 'jq .', and then use it to pick out the specific IdP URLs you want to set up in things like the Dovecot file with all of the OAuth2 settings you need. If you do this, the only URL you want for Dovecot is the userinfo_endpoint URL. You will put this into Dovecot's introspection_url, and you'll leave introspection_mode set to the default of 'auth'.

You don't want to set tokeninfo_url to anything. This setting is (or was) used for validating tokens with OAuth2 servers before the introduction of RFC 7662. Back then, the defacto standard approach was to make a HTTP GET approach to some URL with the token pasted on the end (cf), and it's this URL that is being specified. This approach was replaced with RFC 7662 token introspection, and then replaced again with OpenID Connect UserInfo. If both tokeninfo_url and introspection_url are set, as in Dovecot's example for Google, the former takes priority.

(Since I've just peered deep into the Dovecot source code, it appears that setting 'introspection_mode = post' actually performs an (unauthenticated) token introspection request. The 'get' mode seems to be the same as setting tokeninfo_url. I think that if you set the 'post' mode, you also want to set active_attribute and perhaps active_value, but I don't know what to, because otherwise you aren't necessarily fully validating that the token is still active. Does my head hurt? Yes. The moral here is that you should use an OIDC IdP that supports OpenID Connect UserInfo.)

If your IdP serves different groups and provides different 'issuer' ('iss') values to them, you may want to set the Dovecot 'issuers =' to the specific issuer that applies to you. You'll also want to set 'username_attribute' to whatever OIDC claim is where your IdP puts what you consider the Dovecot username, which might be the email address or something else.

It would be nice if Dovecot could discover all of this for itself when you set openid_configuration_url, but in the current Dovecot, all this does is put that URL in the JSON of the error response that's sent to IMAP clients when they fail OAUTHBEARER authentication. IMAP clients may or may not do anything useful with it.

As far as I can tell from the Dovecot source code, setting 'scope =' primarily requires that the token contains those scopes. I believe that this is almost entirely a guard against the IMAP client requesting a token without OIDC scopes that contain claims you need elsewhere in Dovecot. However, this only verifies OIDC scopes, it doesn't verify the presence of specific OIDC claims.

So what you want to do is check your OIDC IdP's /.well-known/openid-configuration URL to find out its collection of endpoints, then set:

# Modern OIDC IdP/OP settings
introspection_url = <userinfo_endpoint>
username_attribute = <some claim, eg 'email'>

# not sure but seems common in Dovecot configs?
pass_attrs = pass=%{oauth2:access_token}

# optionally:
openid_configuration_url = <stick in the URL>

# you may need:
tls_ca_cert_file = /etc/ssl/certs/ca-certificates.crt

The OIDC scopes that IMAP clients should request when getting tokens should include a scope that gives the username_attribute claim, which is 'email' if the claim is 'email', and also apparently the requested scopes should include the offline_access scope.

If you want a test client to see if you've set up Dovecot correctly, one option is to appropriately modify a contributed Python program for Mutt (also the README), which has the useful property that it has an option to check all of IMAP, POP3, and authenticated SMTP once you've obtained a token. If you're just using it for testing purposes, you can change the 'gpg' stuff to 'cat' to just store the token with no fuss (and no security). Another option, which can be used for real IMAP clients too if you really want to, is an IMAP/etc OAuth2 proxy.

(If you want to use Mutt with OAuth2 with your IMAP server, see this article on it also, also, also. These days I would try quite hard to use age instead of GPG.)

How I got my nose rubbed in my screens having 'bad' areas for me

By: cks
10 March 2025 at 02:50

I wrote a while back about how my desktop screens now had areas that were 'good' and 'bad' for me, and mentioned that I had recently noticed this, calling it a story for another time. That time is now. What made me really notice this issue with my screens and where I had put some things on them was our central mail server (temporarily) stopping handling email because its load was absurdly high.

In theory I should have noticed this issue before a co-worker rebooted the mail server, because for a long time I've had an xload window from the mail server (among other machines, I have four xloads). Partly I did this so I could keep an eye on these machines and partly it's to help keep alive the shared SSH connection I also use for keeping an xrun on the mail server.

(In the past I had problems with my xrun SSH connections seeming to spontaneously close if they just sat there idle because, for example, my screen was locked. Keeping an xload running seemed to work around that; I assumed it was because xload keeps updating things even with the screen locked and so forced a certain amount of X-level traffic over the shared SSH connection.)

When the mail server's load went through the roof, I should have noticed that the xload for it had turned solid green (which is how xload looks under high load). However, I had placed the mail server's xload way off on the right side of my office dual screens, which put it outside my normal field of attention. As a result, I never noticed the solid green xload that would have warned me of the problem.

(This isn't where the xload was back on my 2011 era desktop, but at some point since then I moved it and some other xloads over to the right.)

In the aftermath of the incident, I relocated all of those xloads to a more central location, and also made my new Prometheus alert status monitor appear more or less centrally, where I'll definitely notice it.

(Some day I may do a major rethink about my entire screen layout, but most of the time that feels like yak shaving that I'd rather not touch until I have to, for example because I've been forced to switch to Wayland and an entirely different window manager.)

Sidebar: Why xload turns green under high load

Xload draws a horizontal tick line for every integer load average it needs to display the maximum load that fits in its moving histogram. If the highest load average is 1.5, there will be one tick; if the highest load average is 10.2, there will be ten. Ticks are normally drawn in green. This means that as the load average climbs, xload draws more and more ticks, and after a certain point the entire xload display is just solid green from all of the tick lines.

This has the drawback that you don't know the shape of the load average (all you know is that at some point it got quite high), but the advantage that it's quite visually distinctive and you know you have a problem.

A Prometheus gotcha with alerts based on counting things

By: cks
6 March 2025 at 04:39

Suppose, not entirely hypothetically, that you have some backup servers that use swappable HDDs as their backup media and expose that 'media' as mounted filesystems. Because you keep swapping media around, you don't automatically mount these filesystems and when you do manually try to mount them, it's possible to have some missing (if, for example, a HDD didn't get fully inserted and engaged with the hot-swap bay). To deal with this, you'd like to write a Prometheus alert for 'not all of our backup disks are mounted'. At first this looks simple:

count(
  node_filesystem_size_bytes{
         host = "backupserv",
         mountpoint =~ "/dumps/tapes/slot.*" }
) != <some number>

This will work fine most of the time and then one day it will fail to alert you to the fact that none of the expected filesystems are mounted. The problem is the usual one of PromQL's core nature as a set-based query language (we've seen this before). As long as there's at least one HDD 'tape' filesystem mounted, you can count them, but once there are none, the result of counting them is not 0 but nothing. As a result this alert rule won't produce any results when there are no 'tape' filesystems on your backup server.

Unfortunately there's no particularly good fix, especially if you have multiple identical backup servers and so the real version uses 'host =~ "bserv1|bserv2|..."'. In the single-host case, you can use either absent() or vector() to provide a default value. There's no good solution in the multi-host case, because there's no version of vector() that lets you set labels. If there was, you could at least write:

count( ... ) by (host)
  or vector(0, "host", "bserv1")
  or vector(0, "host", "bserv2")
  ....

(Technically you can set labels via label_replace(). Let's not go there; it's a giant pain for simply adding labels, especially if you want to add more than one.)

In my particular case, our backup servers always have some additional filesystems (like their root filesystem), so I can write a different version of the count() based alert rule:

count(
  node_filesystem_size_bytes{
         host =~ "bserv1|bserv2|...",
         fstype =~ "ext.*' }
) by (host) != <other number>

In theory this is less elegant because I'm not counting exactly what I care about (the number of 'tape' filesystems that are mounted) but instead something more general and potentially more variable (the number of extN filesystems that are mounted) that contains various assumptions about the systems. In practice the number is just as fixed as the number of 'taoe' filesystems, and the broader set of labels will always match something, producing a count of at least one for each host.

(This would change if the standard root filesystem type changed in a future version of Ubuntu, but if that happened, we'd notice.)

PS: This might sound all theoretical and not something a reasonably experienced Prometheus person would actually do. But I'm writing this entry partly because I almost wrote a version of my first example as our alert rule, until I realized what would happen when there were no 'tape' filesystems mounted at all, which is something that happens from time to time for reasons outside the scope of this entry.

What SimpleSAMLphp's core:AttributeAlter does with creating new attributes

By: cks
5 March 2025 at 03:41

SimpleSAMLphp is a SAML identity provider (and other stuff). It's of deep interest to us because it's about the only SAML or OIDC IdP I can find that will authenticate users and passwords against LDAP and has a plugin that will do additional full MFA authentication against the university's chosen MFA provider (although you need to use a feature branch). In the process of doing this MFA authentication, we need to extract the university identifier to use for MFA authentication from our local LDAP data. Conveniently, SimpleSAMLphp has a module called core:AttributeAlter (a part of authentication processing filters) that is intended to do this sort of thing. You can give it a source, a pattern, a replacement that includes regular expression group matches, and a target attribute. In the syntax of its examples, this looks like the following:

 // the 65 is where this is ordered
 65 => [
    'class' => 'core:AttributeAlter',
    'subject' => 'gecos',
    'pattern' => '/^[^,]*,[^,]*,[^,]*,[^,]*,([^,]+)(?:,.*)?$/',
    'target' => 'mfaid',
    'replacement' => '\\1',
 ],

If you're an innocent person, you expect that your new 'mfaid' attribute will be undefined (or untouched) if the pattern does not match because the required GECOS field isn't set. This is not in fact what happens, and interested parties can follow along the rest of this in the source.

(All of this is as of SimpleSAMLphp version 2.3.6, the current release as I write this.)

The short version of what happens is that when the target is a different attribute and the pattern doesn't match, the target will wind up set but empty. Any previous value is lost. How this happens (and what happens) starts with that 'attributes' here are actually arrays of values under the covers (this is '$attributes'). When core:AttributeAlter has a different target attribute than the source attribute, it takes all of the source attribute's values, passes each of them through a regular expression search and replace (using your replacement), and then gathers up anything that changed and sets the target attribute to this gathered collection. If the pattern doesn't match any values of the attribute (in the normal case, a single value), the array of changed things is empty and your target attribute is set to an empty PHP array.

(This is implemented with an array_diff() between the results of preg_replace() and the original attribute value array.)

My personal view is that this is somewhere around a bug; if the pattern doesn't match, I expect nothing to happen. However, the existing documentation is ambiguous (and incomplete, as the use of capture groups isn't particularly documented), so it might not be considered a bug by SimpleSAMLphp. Even if it is considered a bug I suspect it's not going to be particularly urgent to fix, since this particular case is unusual (or people would have found it already).

For my situation, perhaps what I want to do is to write some PHP code to do this extraction operation by hand, through core:PHP. It would be straightforward to extract the necessary GECOS field (or otherwise obtain the ID we need) in PHP, without fooling around with weird pattern matching and module behavior.

(Since I just looked it up, I believe that in the PHP code that core:PHP runs for you, you can use a PHP 'return' to stop without errors but without changing anything. This is relevant in my case since not all GECOS entries have the necessary information.)

If you get the chance, always run more extra network fiber cabling

By: cks
4 March 2025 at 04:22

Some day, you may be in an organization that's about to add some more fiber cabling between two rooms in the same building, or maybe two close by buildings, and someone may ask you for your opinion about many fiber pairs should be run. My personal advice is simple: run more fiber than you think you need, ideally a bunch more (this generalizes to network cabling in general, but copper cabling is a lot more bulky and so harder to run (much) more of). There is an unreasonable amount of fiber to run, but mostly it comes up when you'd have to put in giant fiber patch panels.

The obvious reason to run more fiber is that you may well expand your need for fiber in the future. Someone will want to run a dedicated, private network connection between two locations; someone will want to trunk things to get more bandwidth; someone will want to run a weird protocol that requires its own network segment (did you know you can run HDMI over Ethernet?); and so on. It's relatively inexpensive to add some more fiber pairs when you're already running fiber but much more expensive to have to run additional fiber later, so you might as well give yourself room for growth.

The less obvious reason to run extra fiber is that every so often fiber pairs stop working, just like network cables go bad, and when this happens you'll need to replace them with spare fiber pairs, which means you need those spare fiber pairs. Some of the time this fiber failure is (probably) because a raccoon got into your machine room, but some of the time it just happens for reasons that no one is likely to ever explain to you. And when this happens, you don't necessarily lose only a single pair. Today, for example, we lost three fiber pairs that ran between two adjacent buildings and evidence suggests that other people at the university lost at least one more pair.

(There are a variety of possible causes for sudden loss of multiple pairs, probably all running through a common path, which I will leave to your imagination. These fiber runs are probably not important enough to cause anyone to do a detailed investigation of where the fault is and what happened.)

Fiber comes in two varieties, single mode and multi-mode. I don't know enough to know if you should make a point of running both (over distances where either can be used) as part of the whole 'run more fiber' thing. Locally we have both SM and MM fiber and have switched back and forth between them at times (and may have to do so as a result of the current failures).

PS: Possibly you work in an organization where broken inside-building fiber runs are regularly fixed or replaced. That is not our local experience; someone has to pay for fixing or replacing, and when you have spare fiber pairs left it's easier to switch over to them rather than try to come up with the money and so on.

(Repairing or replacing broken fiber pairs will reduce your long term need for additional fiber, but obviously not the short term need. If you lose N pairs of fiber, you need N spare pairs to get back into operation.)

MFA's "push notification" authentication method can be easier to integrate

By: cks
26 February 2025 at 03:59

For reasons outside the scope of this entry, I'm looking for an OIDC or SAML identity provider that supports primary user and password authentication against our own data and then MFA authentication through the university's SaaS vendor. As you'd expect, the university's MFA SaaS vendor supports all of the common MFA approaches today, covering push notifications through phones, one time codes from hardware tokens, and some other stuff. However, pretty much all of the MFA integrations I've been able to find only support MFA push notifications (eg, also). When I thought about it, this made a lot of sense, because it's often going to be much easier to add push notification MFA than any other form of it.

A while back I wrote about exploiting password fields for multi-factor authentication, where various bits of software hijacked password fields to let people enter things like MFA one time codes into systems (like OpenVPN) that were never set up for MFA in the first place. With most provider APIs, authentication through push notification can usually be inserted in a similar way, because from the perspective of the overall system it can be a synchronous operation. The overall system calls a 'check' function of some sort, the check function calls out the the provider's API and then possibly polls for a result for a while, and then it returns a success or a failure. There's no need to change the user interface of authentication or add additional high level steps.

(The exception is if the MFA provider's push authentication API only returns results to you by making a HTTP query to you. But I think that this would be a relatively weird API; a synchronous reply or at least a polled endpoint is generally much easier to deal with and is more or less required to integrate push authentication with non-web applications.)

By contrast, if you need to get a one time code from the person, you have to do things at a higher level and it may not fit well in the overall system's design (or at least the easily exposed points for plugins and similar things). Instead of immediately returning a successful or failed authentication, you now need to display an additional prompt (in many cases, a HTML page), collect the data, and only then can you say yes or no. In a web context (such as a SAML or OIDC IdP), the provider may want you to redirect the user to their website and then somehow call you back with a reply, which you'll have to re-associate with context and validate. All of this assumes that you can even interpose an additional prompt and reply, which isn't the case in some contexts unless you do extreme things.

(Sadly this means that if you have a system that only supports MFA push authentication and you need to also accept codes and so on, you may be in for some work with your chainsaw.)

JSON has become today's machine-readable output format (on Unix)

By: cks
24 February 2025 at 04:26

Recently, I needed to delete about 1,200 email messages to a particular destination from the mail queue on one of our systems. This turned out to be trivial, because this system was using Postfix and modern versions of Postfix can output mail queue status information in JSON format. So I could dump the mail queue status, select the relevant messages and print the queue IDs with jq, and feed this to Postfix to delete the messages. This experience has left me with the definite view that everything should have the option to output JSON for 'machine-readable' output, rather than some bespoke format. For new programs, I think that you should only bother producing JSON as your machine readable output format.

(If you strongly object to JSON, sure, create another machine readable output format too. But if you don't care one way or another, outputting only JSON is probably the easiest approach for programs that don't already have such a format of their own.)

This isn't because JSON is the world's best format (JSON is at best the least bad format). Instead it's because JSON has a bunch of pragmatic virtues on a modern Unix system. In general, JSON provides a clear and basically unambiguous way to represent text data and much numeric data, even if it has relatively strange characters in it (ie, JSON has escaping rules that everyone knows and all tools can deal with); it's also generally extensible to add additional data without causing heartburn in tools that are dealing with older versions of a program's output. And on Unix there's an increasingly rich collection of tools to deal with and process JSON, starting with jq itself (and hopefully soon GNU Awk in common configurations). Plus, JSON can generally be transformed to various other formats if you need them.

(JSON can also be presented and consumed in either multi-line or single line formats. Multi-line output is often much more awkward to process in other possible formats.)

There's nothing unique about JSON in all of this; it could have been any other format with similar virtues where everything lined up this way for the format. It just happens to be JSON at the moment (and probably well into the future), instead of (say) XML. For individual programs there are simpler 'machine readable' output formats, but they either have restrictions on what data they can represent (for example, no spaces or tabs in text), or require custom processing that goes well beyond basic grep and awk and other widely available Unix tools, or both. But JSON has become a "narrow waist" for Unix programs talking to each other, a common coordination point that means people don't have to invent another format.

(JSON is also partially self-documenting; you can probably look at a program's JSON output and figure out what various parts of it mean and how it's structured.)

PS: Using JSON also means that people writing programs don't have to design their own machine-readable output format. Designing a machine readable output format is somewhat more complicated than it looks, so I feel that the less of it people need to do, the better.

(I say this as a system administrator who's had to deal with a certain amount of output formats that have warts that make them unnecessarily hard to deal with.)

It's good to have offline contact information for your upstream networking

By: cks
21 February 2025 at 03:42

So I said something on the Fediverse:

Current status: it's all fun and games until the building's backbone router disappears.

A modest suggestion: obtain problem reporting/emergency contact numbers for your upstream in advance and post them on the wall somewhere. But you're on your own if you use VOIP desk phones.

(It's back now or I wouldn't be posting this, I'm in the office today. But it was an exciting 20 minutes.)

(I was somewhat modeling the modest suggestion after nuintari's Fediverse series of "rules of networking", eg, also.)

The disappearance of the building's backbone router took out all local networking in the particular building that this happened in (which is the building with our machine room), including the university wireless in the building. THe disappearance of the wireless was especially surprising, because the wireless SSID disappeared entirely.

(My assumption is that the university's enterprise wireless access points stopped advertising the SSID when they lost some sort of management connection to their control plane.)

In a lot of organizations you might have been able to relatively easily find the necessary information even with this happening. For example, people might have smartphones with data plans and laptops that they could tether to the smartphones, and then use this to get access to things like the university directory, the university's problem reporting system, and so on. For various reasons, we didn't really have any of this available, which left us somewhat at a loss when the external networking evaporated. Ironically we'd just managed to finally find some phone numbers and get in touch with people when things came back.

(One bit of good news is that our large scale alert system worked great to avoid flooding us with internal alert emails. My personal alert monitoring (also) did get rather noisy, but that also let me see right away how bad it was.)

Of course there's always things you could do to prepare, much like there are often too many obvious problems to keep track of them all. But in the spirit of not stubbing our toes on the same problem a second time, I suspect we'll do something to keep some problem reporting and contact numbers around and available.

Shared (Unix) hosting and the problem of managing resource limits

By: cks
20 February 2025 at 03:14

Yesterday I wrote about how one problem with shared Unix hosting was the lack of good support for resource limits in the Unixes of the time. But even once you have decent resource limits, you still have an interlinked set of what we could call 'business' problems. These are the twin problems of what resource limits you set on people and how you sell different levels of these resources limits to your customers.

(You may have the first problem even for purely internal resource allocation on shared hosts within your organization, and it's never a purely technical decision.)

The first problem is whether you overcommit what you sell and in general how you decide on the resource limits. Back in the big days of the shared hosting business, I believe that overcommitting was extremely common; servers were expensive and most people didn't use much resources on average. If you didn't overcommit your servers, you had to charge more and most people weren't interested in paying that. Some resources, such as CPU time, are 'flow' resources that can be rebalanced on the fly, restricting everyone to a fair share when the system is busy (even if that share is below what they're nominally entitled to), but it's quite difficult to take memory back (or disk space). If you overcommit memory, your systems might blow up under enough load. If you don't overcommit memory, either everyone has to pay more or everyone gets unpopularly low limits.

(You can also do fancy accounting for 'flow' resources, such as allowing bursts of high CPU but not sustained high CPU. This is harder to do gracefully for things like memory, although you can always do it ungracefully by terminating things.)

The other problem entwined with setting resource limits is how (and if) you sell different levels of resource limits to your customers. A single resource limit is simple but probably not what all of your customers want; some will want more and some will only need less. But if you sell different limits, you have to tell customers what they're getting, let them assess their needs (which isn't always clear in a shared hosting situation), deal with them being potentially unhappy if they think they're not getting what they paid for, and so on. Shared hosting is always likely to have complicated resource limits, which raises the complexity of selling them (and of understanding them, for the customers who have to pick one to buy).

Viewed from the right angle, virtual private servers (VPSes) are a great abstraction to sell different sets of resource limits to people in a way that's straightforward for them to understand (and which at least somewhat hides whether or not you're overcommitting resources). You get 'a computer' with these characteristics, and most of the time it's straightforward to figure out whether things fit (the usual exception is IO rates). So are more abstracted, 'cloud-y' ways of selling computation, database access, and so on (at least in areas where you can quantify what you're doing into some useful unit of work, like 'simultaneous HTTP requests').

It's my personal suspicion that even if the resource limitation problems had been fully solved much earlier, shared hosting would have still fallen out of fashion in favour of simpler to understand VPS-like solutions, where what you were getting and what you were using (and probably what you needed) were a lot clearer.

One problem with "shared Unix hosting" was the lack of resource limits

By: cks
19 February 2025 at 04:04

I recently read Comments on Shared Unix Hosting vs. the Cloud (via), which I will summarize as being sad about how old fashioned shared hosting on a (shared) Unix system has basically died out, and along with it web server technology like CGI. As it happens, I have a system administrator's view of why shared Unix hosting always had problems and was a down-market thing with various limitations, and why even today people aren't very happy with providing it. In my view, a big part of the issue was the lack of resource limits.

The problem with sharing a Unix machine with other people is that by default, those other people can starve you out. They can take up all of the available CPU time, memory, process slots, disk IO, and so on. On an unprotected shared web server, all you need is one person's runaway 'CGI' code (which might be PHP code or etc) or even an unusually popular dynamic site and all of the other people wind up having a bad time. Life gets worse if you allow people to log in, run things in the background, run things from cron, and so on, because all of these can add extra load. In order to make shared hosting be reliable and good, you need some way of forcing a fair sharing of resources and limiting how much resources a given customer can use.

Unfortunately, for much of the practical life of shared Unix hosting, Unixes did not have that. Some Unixes could create various sorts of security boundaries, but generally not resource usage limits that applied to an entire group of processes. Even once this became possibly to some degree in Linux through cgroup(s), the kernel features took some time to mature and then it took even longer for common software to support running things in isolated and resource controlled cgroups. Even today it's still not necessarily entirely there for things like running CGIs from your web server, never mind a potential shared database server to support everyone's database backed blog.

(A shared database server needs to implement its own internal resource limits for each customer, otherwise you have to worry about a customer gumming it up with expensive queries, a flood of queries, and so on. If they need separate database servers for isolation and resource control, now they need more server resources.)

My impression is that the lack of kernel supported resource limits forced shared hosting providers to roll their own ad-hoc ways of limiting how much resources their customers could use. In turn this created the array of restrictions that you used to see on such providers, with things like 'no background processes', 'your CGI can only run for so long before being terminated', 'your shell session is closed after N minutes', and so on. If shared hosting had been able to put real limits on each of their customers, this wouldn't have been as necessary; you could go more toward letting each customer blow itself up if it over-used resources.

(How much resources to give each customer is also a problem, but that's another entry.)

How you should respond to authentication failures isn't universal

By: cks
13 February 2025 at 02:55

A discussion broke out in the comments on my entry on how everything should be able to ratelimit authentication failures, and one thing that came up was the standard advice that when authentication fails, the service shouldn't give you any indication of why. You shouldn't react any differently if it's a bad password for an existing account, an account that doesn't exist any more (perhaps with the correct password for the account when it existed), an account that never existed, and so on. This is common and long standing advice, but like a lot of security advice I think that the real answer is that what you should do depends on your circumstances, priorities, and goals.

The overall purpose of the standard view is to not tell attackers what they got wrong, and especially not to tell them if the account doesn't even exist. What this potentially achieves is slowing down authentication guessing and making the attacker use up more resources with no chance of success, so that if you have real accounts with vulnerable passwords the attacker is less likely to succeed against them. However, you shouldn't have weak passwords any more and on the modern Internet, attackers aren't short of resources or likely to suffer any consequences for trying and trying against you (and lots of other people). In practice, much like delays on failed authentications, it's been a long time since refusing to say why something failed meaningfully impeded attackers who are probing standard setups for SSH, IMAP, authenticated SMTP, and other common things.

(Attackers are probing for default accounts and default passwords, but the fix there is not to have any, not to slow attackers down a bit. Attackers will find common default account setups, probably much sooner than you would like. Well informed attackers can also generally get a good idea of your valid accounts, and they certainly exist.)

If what you care about is your server resources and not getting locked out through side effects, it's to your benefit for attackers to stop early. In addition, attackers aren't the only people who will fail your authentication. Your own people (or ex-people) will also be doing a certain amount of it, and some amount of the time they won't immediately realize what's wrong and why their authentication attempt failed (in part because people are sadly used to systems simply being flaky, so retrying may make things work). It's strictly better for your people if you can tell them what was wrong with their authentication attempt, at least to a certain extent. Did they use a non-existent account name? Did they format the account name wrong? Are they trying to use an account that has now been disabled (or removed)? And so on.

(Some of this may require ingenious custom communication methods (and custom software). In the comments on my entry, BP suggested 'accepting' IMAP authentication for now-closed accounts and then providing them with only a read-only INBOX that had one new message that said 'your account no longer exists, please take it out of this IMAP client'.)

There's no universally correct trade-off between denying attackers information and helping your people. A lot of where your particular trade-offs fall will depend on your usage patterns, for example how many of your people make mistakes of various sorts (including 'leaving their account configured in clients after you've closed it'). Some of it will also depend on how much resources you have available to do a really good job of recognizing serious attacks and impeding attackers with measures like accurately recognizing 'suspicious' authentication patterns and blocking them.

(Typically you'll have no resources for this and will be using more or less out of the box rate-limiting and other measures in whatever software you use. Of course this is likely to limit your options for giving people special messages about why they failed authentication, but one of my hopes is that over time, software adds options to be more informative if you turn them on.)

Everything should be able to ratelimit sources of authentication failures

By: cks
11 February 2025 at 03:54

One of the things that I've come to believe in is that everything, basically without exception, should be able to rate-limit authentication failures, at least when you're authenticating people. Things don't have to make this rate-limiting mandatory, but it should be possible. I'm okay with basic per-IP or so rate limiting, although it would be great if systems could do better and be able to limit differently based on different criteria, such as whether the target login exists or not, or is different from the last attempt, or both.

(You can interpret 'sources' broadly here, if you want to; perhaps you should be able to ratelimit authentication by target login, not just by source IP. Or ratelimit authentication attempts to nonexistent logins. Exim has an interesting idea of a ratelimit 'key', which is normally the source IP in string form but which you can make be almost anything, which is quite flexible.)

I have come to feel that there are two reasons for this. The first reason, the obvious one, is that the Internet is full of brute force bulk attackers and if you don't put in rate-limits, you're donating CPU cycles and RAM to them (even if they have no chance of success and will always fail, for example because you require MFA after basic password authentication succeeds). This is one of the useful things that moving your services to non-standard ports helps with; you're not necessarily any more secure against a dedicated attacker, but you've stopped donating CPU cycles to the attackers that only poke the default port.

The second reason is that there are some number of people out there who will put a user name and a password (or the equivalent in the form of some kind of bearer token) into the configuration of some client program and then forget about it. Some of the programs these people are using will retry failed authentications incessantly, often as fast as you'll allow them. Even if the people check the results of the authentication initially (for example, because they want to get their IMAP mail), they may not keep doing so and so their program may keep trying incessantly even after events like their password changing or their account being closed (something that we've seen fairly vividly with IMAP clients). Without rate-limits, these programs have very little limits on their blind behavior; with rate limits, you can either slow them down (perhaps drastically) or maybe even provoke error messages that get the person's attention.

Unless you like potentially seeing your authentication attempts per second trending up endlessly, you want to have some way to cut these bad sources off, or more exactly make their incessant attempts inexpensive for you. The simple, broad answer is rate limiting.

(Actually getting rate limiting implemented is somewhat tricky, which in my view is one reason it's uncommon (at least as an integrated feature, instead of eg fail2ban). But that's another entry.)

PS: Having rate limits on failed authentications is also reassuring, at least for me.

The practical (Unix) problems with .cache and its friends

By: cks
5 February 2025 at 03:53

Over on the Fediverse, I said:

Dear everyone writing Unix programs that cache things in dot-directories (.cache, .local, etc): please don't. Create a non-dot directory for it. Because all of your giant cache (sub)directories are functionally invisible to many people using your programs, who wind up not understanding where their disk space has gone because almost nothing tells them about .cache, .local, and so on.

A corollary: if you're making a disk space usage tool, it should explicitly show ~/.cache, ~/.local, etc.

If you haven't noticed, there are an ever increasing number of programs that will cache a bunch of data, sometimes a very large amount of it, in various dot-directories in people's home directories. If you're lucky, these programs put their cache somewhere under ~/.cache; if you're semi-lucky, they use ~/.local, and if you're not lucky they invent their own directory, like ~/.cargo (used by Rust's standard build tool because it wants to be special). It's my view that this is a mistake and that everyone should put their big caches in a clearly visible directory or directory hierarchy, one that people can actually find in practice.

I will freely admit that we are in a somewhat unusual environment where we have shared fileservers, a now very atypical general multi-user environment, a compute cluster, and a bunch of people who are doing various sorts of modern GPU-based 'AI' research and learning (both AI datasets and AI software packages can get very big). In our environment, with our graduate students, it's routine for people to wind up with tens or even hundreds of GBytes of disk space used up for caches that they don't even realize are there because they don't show up in conventional ways to look for space usage.

As noted by Haelwenn /элвэн/, a plain 'du' will find such dotfiles. The problem is that plain 'du' is more or less useless for most people; to really take advantage of it, you have to know the right trick (not just the -h argument but feeding it to sort to find things). How I think most people use 'du' to find space hogs is they start in their home directory with 'du -s *' (or maybe 'du -hs *') and then they look at whatever big things show up. This will completely miss things in dot-directories in normal usage. And on Linux desktops, I believe that common GUI file browsers will omit dot-directories by default and may not even have a particularly accessible option to change that (this is certainly the behavior of Cinnamon's 'Files' application and I can't imagine that GNOME is different, considering their attitude).

(I'm not sure what our graduate students use to try explore their disk usage, but I know that multiple graduate students have been unable to find space being eaten up in dot-directories and surprised that their home directory was using so much.)

Modern languages and bad packaging outcomes at scale

By: cks
1 February 2025 at 03:30

Recently I read Steinar H. Gunderson's Migrating away from bcachefs (via), where one of the mentioned issues was a strong disagreement between the author of bcachefs and the Debian Linux distribution about how to package and distribute some Rust-based tools that are necessary to work with bcachefs. In the technology circles that I follow, there's a certain amount of disdain for the Debian approach, so today I want to write up how I see the general problem from a system administrator's point of view.

(Saying that Debian shouldn't package the bcachefs tools if they can't follow the wishes of upstream is equivalent to saying that Debian shouldn't support bcachefs. Among other things, this isn't viable for something that's intended to be a serious mainstream Linux filesystem.)

If you're serious about building software under controlled circumstances (and Linux distributions certainly are, as are an increasing number of organizations in general), you want the software build to be both isolated and repeatable. You want to be able to recreate the same software (ideally exactly binary identical, a 'reproducible build') on a machine that's completely disconnected from the Internet and the outside world, and if you build the software again later you want to get the same result. This means that build process can't download things from the Internet, and if you run it three months from now you should get the same result even if things out there on the Internet have changed (such as third party dependencies releasing updated versions).

Unfortunately a lot of the standard build tooling for modern languages is not built to do this. Instead it's optimized for building software on Internet connected machines where you want the latest patchlevel or even entire minor version of your third party dependencies, whatever that happens to be today. You can sometimes lock down specific versions of all third party dependencies, but this isn't necessarily the default and so programs may not be set up this way from the start; you have to patch it in as part of your build customizations.

(Some languages are less optimistic about updating dependencies, but developers tend not to like that. For example, Go is controversial for its approach of 'minimum version selection' instead of 'maximum version selection'.)

The minimum thing that any serious packaging environment needs to do is contain all of the dependencies for any top level artifact, and to force the build process to use these (and only these), without reaching out to the Internet to fetch other things (well, you're going to block all external access from the build environment). How you do this depends on the build system, but it's usually possible; in Go you might 'vendor' all dependencies to give yourself a self-contained source tree artifact. This artifact never changes the dependency versions used in a build even if they change upstream because you've frozen them as part of the artifact creation process.

(Even if you're not a distribution but an organization building your own software using third-party dependencies, you do very much want to capture local copies of them. Upstream things go away or get damaged every so often, and it can be rather bad to not be able to build a new release of some important internal tool because an upstream decided to retire to goat farming rather than deal with the EU CRA. For that matter, you might want to have local copies of important but uncommon third party open source tools you use, assuming you can reasonably rebuild them.)

If you're doing this on a small scale for individual programs you care a lot about, you can stop there. If you're doing this on an distribution's scale you have an additional decision to make: do you allow each top level thing to have its own version of dependencies, or do you try to freeze a common version? If you allow each top level thing to have its own version, you get two problems. First, you're using up more disk space for at least your source artifacts. Second and worse, now you're on the hook for maintaining, checking, and patching multiple versions of a given dependency if it turns out to have a security issue (or a serious bug).

Suppose that you have program A using version 1.2.3 of a dependency, program B using 1.2.7, the current version is 1.2.12, and the upstream releases 1.2.13 to fix a security issue. You may have to investigate both 1.2.3 and 1.2.7 to see if they have the bug and then either patch both with backported fixes or force both program A and program B to be built with 1.2.13, even if the version of these programs that you're using weren't tested and validated with this version (and people routinely break things in patchlevel releases).

If you have a lot of such programs it's certainly tempting to put your foot down and say 'every program that uses dependency X will be set to use a single version of it so we only have to worry about that version'. Even if you don't start out this way you may wind up with it after a few security releases from the dependency and the packagers of programs A and B deciding that they will just force the use of 1.2.13 (or 1.2.15 or whatever) so that they can skip the repeated checking and backporting (especially if both programs are packaged by the same person, who has only so much time to deal with all of this). If you do this inside an organization, probably no one in the outside world knows. If you do this as a distribution, people yell at you.

(Within an organization you may also have more flexibility to update program A and program B themselves to versions that might officially support version 1.2.15 of that dependency, even if the program version updates are a little risky and change some behavior. In a distribution that advertises stability and has no way of contacting people using it to warn them or coordinate changes, things aren't so flexible.)

The tradeoffs of having an internal unauthenticated SMTP server

By: cks
31 January 2025 at 04:08

One of the reactions I saw to my story of being hit by an alarming well prepared phish spammer was surprise that we had an unauthenticated SMTP server, even if it was only available to our internal networks. Part of the reason we have such a server is historical, but I also feel that the tradeoffs involved are not as clear cut as you might think.

One fundamental problem is that people (actual humans) aren't the only thing that needs to be able to send email. Unless you enjoy building your own system problem notification system from scratch, a whole lot of things will try to send you email to tell you about problems. Cron jobs will email you output, you may want to get similar email about systemd units, both Linux software RAID and smartd will want to use email to tell you about failures, you may have home-grown management systems, and so on. In addition to these programs on your servers, you may have inconvenient devices like networked multi-function photocopiers that have scan to email functionality (and the people who bought them and need to use them have feelings about being able to do so). In a university environment such as ours, some of the machines involved will be run by research groups, graduate students, and so on, not your core system administrators (and it's a very good idea if these machines can tell their owners about failed disks and the like).

Most of these programs will submit their email through the local mailer facilities (whatever they are), and most local mail systems ('MTAs') can be configured to use authentication when they talk to whatever SMTP gateway you point them at. So in theory you could insist on authenticated SMTP for everything. However, this gives you a different problem, because now you must manage this authentication. Do you give each machine its own authentication identity and password, or have some degree of shared authentication? How do you distribute and update this authentication information? How much manual work are you going to need to do as research groups add and remove machines (and as your servers come and go)? Are you going to try to build a system that restricts where a given authentication identity can be used from, so that someone can't make off with the photocopier's SMTP authorization and reuse it from their desktop?

(If you instead authorize IP addresses without requiring SMTP authentication, you've simply removed the requirement for handling and distributing passwords; you're still going to be updating some form of access list. Also, this has issues if people can use your servers.)

You can solve all of these problems if you want to. But there is no current general, easily deployed solution for them, partly because we don't currently have any general system of secure machine and service identity that programs like MTAs can sit on top of. So system administrators have to build such things ourselves to let one MTA prove to another MTA who and what it is.

(There are various ways to do this other than SMTP authentication and some of them are generally used in some environments; I understand that mutual TLS is common in some places. And I believe that in theory Kerberos could solve this, if everything used it.)

Every custom piece of software or piece of your environment that you build is an overhead; it has to be developed, maintained, updated, documented, and so on. It's not wrong to look at the amount of work it would require in your environment to have only authenticated SMTP and conclude that the practical risks of having unauthenticated SMTP are low enough that you'll just do that.

PS: requiring explicit authentication or authorization for notifications is itself a risk, because it means that a machine that's in a sufficiently bad or surprising state can't necessarily tell you about it. Your emergency notification system should ideally fail open, not fail closed.

PPS: In general, there are ways to make an unauthenticated SMTP server less risky, depending on what you need it to do. For example, in many environments there's no need to directly send such system notification email to arbitrary addresses outside the organization, so you could restrict what destinations the server accepts, and maybe what sending addresses can be used with it.

Sometimes you need to (or have to) run old binaries of programs

By: cks
24 January 2025 at 03:52

Something that is probably not news to system administrators who've been doing this long enough is that sometimes, you need to or have to run old binaries of programs. I don't mean that you need to run old versions of things (although since the program binaries are old, they will be old versions); I mean that you literally need to run old binaries, ones that were built years ago.

The obvious situation where this can happen is if you have commercial software and the vendor either goes out of business or stops providing updates for the software. In some situations this can result in you needing to keep extremely old systems alive simply to run this old software, and there are lots of stories about 'business critical' software in this situation.

(One possibly apocryphal local story is that the central IT people had to keep a SPARC Solaris machine running for more than a decade past its feasible end of life because it was the only environment that ran a very special printer driver that was used to print payroll checks.)

However, you can also get into this situation with open source software too. Increasingly, rebuilding complex open source software projects is not for the faint of heart and requires complex build environments. Not infrequently, these build environments are 'fragile', in the sense that in practice they depend on and require specific versions of tools, supporting language interpreters and compilers, and so on. If you're trying to (re)build them on a modern version of the OS, you may find some issues (also). You can try to get and run the version of the tools they need, but this can rapidly send you down a difficult rabbit hole.

(If you go back far enough, you can run into 32-bit versus 64-bit issues. This isn't just compilation problems, where code isn't 64-bit safe; you can also have code that produces different results when built as a 64-bit binary.)

This can create two problems. First, historically, it complicates moving between CPU architectures. For a couple of decades that's been a non-issue for most Unix environments, because x86 was so dominant, but now ARM systems are starting to become more and more available and even attractive, and they generally don't run old x86 binaries very well. Second, there are some operating systems that don't promise long term binary compatibility to older versions of themselves; they will update system ABIs, removing the old version of the ABI after a while, and require you to rebuild software to use the new ABIs if you want to run it on the current version of the OS. If you have to use old binaries you're stuck with old versions of the OS and generally no security updates.

(If you think that this is absurd and no one would possibly do that, I will point you to OpenBSD, which does it regularly to help maintain and improve the security of the system. OpenBSD is neither wrong nor right to take their approach; they're making a different set of tradeoffs than, say, Linux, because they have different priorities.)

Some ways to restrict who can log in via OpenSSH and how they authenticate

By: cks
19 January 2025 at 04:20

In yesterday's entry on allowing password authentication from the Internet for SSH, I mentioned that there were ways to restrict who this was enabled for or who could log in through SSH. Today I want to cover some of them, using settings in /etc/ssh/sshd_config.

The simplest way is to globally restrict logins with AllowUsers, listing only specific accounts you want to be accessed over SSH. If there are too many such accounts or they change too often, you can switch to AllowGroups and allow only people in a specific group that you maintain, call it 'sshlogins'.

If you want to allow logins generally but restrict, say, password based authentication to only people that you expect, what you want is a Match block and setting AuthenticationMethods within it. You would set it up something like this:

AuthenticationMethods publickey
Match User cks
  AuthenticationMethods any

If you want to be able to log in using password from your local networks but not remotely, you could extend this with an additional Match directive that looked at the origin IP address:

Match Address 127.0.0.0/8,<your networks here>
  AuthenticationMethods any

In general, Match directives are your tool for doing relatively complex restrictions. You could, for example, arrange that accounts in a certain Unix group can only log in from the local network, never remotely. Or reverse this so that only logins in some Unix group can log in remotely, and everyone else is only allowed to use SSH within the local network.

However, any time you're doing complex things with Match blocks, you should make sure to test your configuration to make sure it's working the way you want. OpenSSH's sshd_config is a configuration file with some additional capabilities, not a programming language, and there are undoubtedly some subtle interactions and traps you can fall into.

(This is one reason I'm not giving a lot of examples here; I'd have to carefully test them.)

Sidebar: Restricting root logins via OpenSSH

If you permit root logins via OpenSSH at all, one fun thing to do is to restrict where you'll accept them from:

PermitRootLogin no
Match Address 127.0.0.0/8,<your networks here>
  PermitRootLogin prohibit-password
  # or 'yes' for some places

A lot of Internet SSH probers direct most of their effort against the root account. With this setting you're assured that all of them will fail no matter what.

(This has come up before but I feel like repeating it.)

Thoughts on having SSH allow password authentication from the Internet

By: cks
18 January 2025 at 03:42

On the Fediverse, I recently saw a poll about whether people left SSH generally accessible on its normal port or if they moved it; one of the replies was that the person left SSH on the normal port but disallowed password based authentication and only allowed public key authentication. This almost led to me posting a hot take, but then I decided that things were a bit more nuanced than my first reaction.

As everyone with an Internet-exposed SSH daemon knows, attackers are constantly attempting password guesses against various accounts. But if you're using a strong password, the odds of an attacker guessing it are extremely low, since doing 'password cracking via SSH' has an extremely low guesses per second number (enforced by your SSH daemon). In this sense, not accepting passwords over the Internet is at most a tiny practical increase in security (with some potential downsides in unusual situations).

Not accepting passwords from the Internet protects you against three other risks, two relatively obvious and one subtle one. First, it stops an attacker that can steal and then crack your encrypted passwords; this risk should be very low if you use strong passwords. Second, you're not exposed if your SSH server turns out to have a general vulnerability in password authentication that can be remotely exploited before a successful authentication. This might not be an authentication bypass; it might be some sort of corruption that leads to memory leaks, code execution, or the like. In practice, (OpenSSH) password authentication is a complex piece of code that interacts with things like your system's random set of PAM modules.

The third risk is that some piece of software will create a generic account with a predictable login name and known default password. These seem to be not uncommon, based on the fact that attackers probe incessantly for them, checking login names like 'ubuntu', 'debian', 'admin', 'testftp', 'mongodb', 'gitlab', and so on. Of course software shouldn't do this, but if something does, not allowing password authenticated SSH from the Internet will block access to these bad accounts. You can mitigate this risk by only accepting password authentication for specific, known accounts, for example only your own account.

The potential downside of only accepting keypair authentication for access to your account is that you might need to log in to your account in a situation where you don't have your keypair available (or can't use it). This is something that I probably care about more than most people, because as a system administrator I want to be able to log in to my desktop even in quite unusual situations. As long as I can use password authentication, I can use anything trustworthy that has a keyboard. Most people probably will only log in to their desktops (or servers) from other machines that they own and control, like laptops, tablets, or phones.

(You can opt to completely disallow password authentication from all other machines, even local ones. This is an even stronger and potentially more limiting restriction, since now you can't even log in from another one of your machines unless that machine has a suitable keypair set up. As a sysadmin, I'd never do that on my work desktop, since I very much want to be able to log in to my regular account from the console of one of our servers if I need to.)

My bug reports are mostly done for work these days

By: cks
15 January 2025 at 03:33

These days, I almost entirely report bugs in open source software as part of my work. A significant part of this is that most of what I stumble over bugs in are things that work uses (such as Ubuntu or OpenBSD), or at least things that I mostly use as part of work. There are some consequences of this that I feel like noting today.

The first is that I do bug investigation and bug reporting on work time during work hours, and I don't work on "work bugs" outside of that, on evenings, weekends, and holidays. This sometimes meshes awkwardly with the time open source projects have available for dealing with bugs (which is often in people's personal time outside of work hours), so sometimes I will reply to things and do additional followup investigation out of hours to keep a bug report moving along, but I mostly avoid it. Certainly the initial investigation and filing of a work bug is a working hours activity.

(I'm not always successful in keeping it to that because there is always the temptation to spend a few more minutes digging a bit more into the problem. This is especially acute when working from home.)

The second thing is that bug filing work is merely one of the claims on my work time. I have a finite amount of work time and a variety of things to get done with varying urgency, and filing and updating bugs is not always the top of the list. And just like other work activity, filing a particular bug has to convince me that it's worth spending some of my limited work time on this particular activity. Work does not pay me to file bugs and make open source better; they pay me to make our stuff work. Sometimes filing a bug is a good way to do this but some of the time it's not, for example because the organization in question doesn't respond to most bug reports.

(Even when it's useful in general to file a bug report because it will result in the issue being fixed at some point in the future, we generally need to deal with the problem today, so filing the bug report may take a back seat to things like developing workarounds.)

Another consequence is that it's much easier for me to make informal Fediverse posts about bugs (often as I discover more and more disconcerting things) or write Wandering Thoughts posts about work bugs than it is to make an actual bug report. Writing for Wandering Thoughts is a personal thing that I do outside of work hours, although I write about stuff from work (and I can often use something to write about, so interesting work bugs are good grist).

(There is also that making bug reports is not necessarily pleasant, and making bad bug reports can be bad. This interacts unpleasantly with the open source valorization of public work. To be blunt, I'm more willing to do unpleasant things when work is paying me than when it's not, although often the bug reports that are unpleasant to make are also the ones that aren't very useful to make.)

PS: All of this leads to a surprisingly common pattern where I'll spend much of a work day running down a bug to the point where I feel I understand it reasonably well, come home after work, write the bug up as a Wandering Thoughts entry (often clarifying my understanding of the bug in the process), and then file a bug report at work the next work day.

IMAP clients can vary in their reactions to IMAP errors

By: cks
12 January 2025 at 03:55

For reasons outside of the scope of this entry, we recently modified our IMAP server so that it would only return 20,000 results from an IMAP LIST command (technically 20,001 results). In our environment, an IMAP LIST operation that generates this many results is because one of the people who can hit this have run into our IMAP server backward compatibility problem. When we made this change, we had a choice for what would happen when the limit was hit, and specifically we had a choice of whether to claim that the IMAP LIST operation had succeeded or had failed. In the end we decided it was better to report that the IMAP LIST operation had failed, which also allowed us to include a text message explaining what had happened (in IMAP these are relatively free form).

(The specifics of the situation are that the IMAP LIST command will report a stream of IMAP folders back to the client and then end the stream after 20,001 entries, with either an 'ok' result or an error result with text. So in the latter case, the IMAP client gets 20,001 folder entries and an error at the end.)

Unsurprisingly, after deploying this change we've seen that IMAP clients (both mail readers and things like server webmail code) vary in their behavior when this limit is hit. The behavior we'd like to see is that the client considers itself to have a partial result and uses it as much as possible, while also telling the person using it that something went wrong. I'm not sure any IMAP client actually does this. One webmail system that we use reports the entire output from the IMAP LIST command as an 'error' (or tries to); since the error message is the last part of the output, this means it's never visible. One mail client appears to throw away all of the LIST results and not report an error to the person using it, which in practice means that all of your folders disappear (apart from your inbox).

(Other mail clients appear to ignore the error and probably show the partial results they've received.)

Since the IMAP server streams the folder list from IMAP LIST to the client as it traverses the folders (ie, Unix directories), we don't immediately know if there are going to be too many results; we only find that out after we've already reported those 20,000 folders. But in hindsight, what we could have done is reported a final synthetic folder with a prominent explanatory name and then claimed that the command succeeded (and stopped). In practice this seems more likely to show something to the person using the mail client, since actually reporting the error text we provide is apparently not anywhere near as common as we might hope.

Using tcpdump to see only incoming or outgoing traffic

By: cks
9 January 2025 at 03:13

In the normal course of events, implementations of 'tcpdump' report on packets going in both directions, which is to say it reports both packets received and packets sent. Normally this isn't confusing and you can readily tell one from the other, but sometimes situations aren't normal and you want to see only incoming packets or only outgoing packets (this has come up before). Modern versions of tcpdump can do this, but you have to know where to look.

If you're monitoring regular network interfaces on Linux, FreeBSD, or OpenBSD, this behavior is controlled by a tcpdump command line switch. On modern Linux and on FreeBSD, this is '-Q in' or '-Q out', as covered in the Linux manpage and the FreeBSD manpage. On OpenBSD, you use a different command line switch, '-D in' or '-D out', per the OpenBSD manpage.

(The Linux and FreeBSD tcpdump use '-D' to mean 'list all interfaces'.)

There are network types where the in or out direction can be matched by tcpdump pcap filter rules, but plain Ethernet is not one of them. This implies that you can't write a pcap filter rule that matches some packets only inbound and some packets only outbound at the same time; instead you have to run two tcpdumps.

If you have a (software) bridge interface or bridged collection of interfaces, as far as I know on both OpenBSD and FreeBSD the 'in' and 'out' directions on the underlying physical interfaces work the way you expect. Which is to say, if you have ix0 and ix1 bridged together as bridge0, 'tcpdump -Q in -i ix0' shows packets that ix0 is receiving from the physical network and doesn't include packets forward out through ix0 by the bridge interface (which in some sense you could say are 'sent' to ix0 by the bridge).

The PF packet filter system on both OpenBSD and FreeBSD can log packets to a special network interface, normally 'pflog0'. When you tcpdump this interface, both OpenBSD and FreeBSD accept an 'on <interface>' (which these days is a synonym for 'ifname <interface>') clause in pcap filters, which I believe means that the packet was received on the specific interface (per my entry on various filtering options for OpenBSD). Both also have 'inbound' and 'outbound', which I believe match based on whether the particular PF rule that caused them to match was an 'in' or an 'out' rule.

(See the OpenBSD pcap-filter and the FreeBSD pcap-filter manual pages.)

I'm firmly attached to a mouse and (overlapping) windows

By: cks
31 December 2024 at 04:45

In the tech circles I follow, there are a number of people who are firmly in what I could call a 'text mode' camp (eg, also). Over on the Fediverse, I said something in an aside about my personal tastes:

(Having used Unix through serial terminals or modems+emulators thereof back in the days, I am not personally interested in going back to a single text console/window experience, but it is certainly an option for simplicity.)

(Although I didn't put it in my Fediverse post, my experience with this 'single text console' environment extends beyond Unix. Similarly, I've lived without a mouse and now I want one (although I have particular tastes in mice).)

On the surface I might seem like someone who is a good candidate for the single pane of text experience, since I do much of my work in text windows, either terminals or environments (like GNU Emacs) that ape them, and I routinely do odd things like read email from the command line. But under the surface, I'm very much not. I very much like having multiple separate blocks of text around, being able to organize these blocks spatially, having a core area where I mostly work from with peripheral areas for additional things, and being able to overlap these blocks and apply a stacking order to control what is completely visible and what's partly visible.

In one view, you could say that this works partly because I have enough screen space. In another view, it would be better to say that I've organized my computing environment to have this screen space (and the other aspects). I've chosen to use desktop computers instead of portable ones, partly for increased screen space, and I've consistently opted for relatively large screens when I could reasonably get them, steadily moving up in screen size (both physical and resolution wise) over time.

(Over the years I've gone out of my way to have this sort of environment, including using unusual window systems.)

The core reason I reach for windows and a mouse is simple: I find the pure text alternative to be too confining. I can work in it if I have to but I don't like to. Using finer grained graphical windows instead of text based ones (in a text windowing environment, which exist), and being able to use a mouse to manipulate things instead of always having to use keyboard commands, is nicer for me. This extends beyond shell sessions to other things as well; for example, generally I would rather start new (X) windows for additional Emacs or vim activities rather than try to do everything through the text based multi-window features that each has. Similarly, I almost never use screen (or tmux) within my graphical desktop; the only time I reach for either is when I'm doing something critical that I might be disconnected from.

(This doesn't mean that I use a standard Unix desktop environment for my main desktops; I have a quite different desktop environment. I've also written a number of tools to make various aspects of this multi-window environment be easy to use in a work environment that involves routine access to and use of a bunch of different machines.)

If I liked tiling based window environments, it would be easier to switch to a text (console) based environment with text based tiling of 'windows', and I would probably be less strongly attached to the mouse (although it's hard to beat the mouse for selecting text). However, tiling window environments don't appeal to me (also), either in graphical or in text form. I'll use tiling in environments where it's the natural choice (for example, in vim and emacs), but I consider it merely okay.

The TLS certificate multi-file problem (for automatic updates)

By: cks
25 December 2024 at 03:25

In a recent entry on short lived TLS certificates and graceful certificate rollover in web servers, I mentioned that one issue with software automatically reloading TLS certificates was that TLS certificates are almost always stored in multiple files. Typically this is either two files (the TLS certificate's key and a 'fullchain' file with the TLS certificate and intermediate certificates together) or three files (the key, the signed certificate, and a third file with the intermediate chain). The core problem this creates is the same one you have any time information is split across multiple files, namely making 'atomic' changes to the set of files, so that software never sees an inconsistent state with some updated files and some not.

With TLS certificates, a mismatch between the key and the signed certificate will cause the server to be unable to properly prove that it controls the private key for the TLS certificate it presented. Either it will load the new key and the old certificate or the old key and the new certificate, and in both cases they won't be able to generate the correct proof (assuming the secure case where your TLS certificate software generates a new key for each TLS certificate renewal, which you want to do since you want to guard against your private key having been compromised).

The potential for a mismatch is obvious if the file with the TLS key and the file with the TLS certificate are updated separately (or a new version is written out and swapped into place separately). At this point your mind might turn to clever tricks like writing all of the new files to a new directory and somehow swapping the whole directory in at once (this is certainly where mine went). Unfortunately, even this isn't good enough because the program has to open the two (or three) files separately, and the time gap between the opens creates an opportunity for a mismatch more or less no matter what we do.

(If the low level TLS software operates by, for example, first loading and parsing the TLS certificate, then loading the private key to verify that it matches, the time window may be bigger than you expect because the parsing may take a bit of time. The minimal time window comes about if you open the two files as close to each other as possible and defer all loading and processing until after both are opened.)

The only completely sure way to get around this is to put everything in one file (and then use an appropriate way to update the file atomically). Short of that, I believe that software could try to compensate by checking that the private key and the TLS certificate match after they're automatically reloaded, and if they don't, it should reload both.

(If you control both the software that will use the TLS certificates and the renewal software, you can do other things. For example, you can always update the files in a specific order and then make the server software trigger an automatic reload only when the timestamp changes on the last file to be updated. That way you know the update is 'done' by the time you're loading anything.)

Remembering to make my local changes emit log messages when they act

By: cks
21 December 2024 at 03:48

Over on the Fediverse, I said something:

Current status: respinning an Ubuntu package build (... painfully) because I forgot the golden rule that when I add a hack to something, I should always make it log when my hack was triggered. Even if I can observe the side effects in testing, we'll want to know it happened in production.

(Okay, this isn't applicable to all hacks, but.)

Every so often we change or augment some standard piece of software or standard part of the system to do something special under specific circumstances. A rule I keep forgetting and then either re-learning or reminding myself of is that even if the effects of my change triggering are visible to the person using the system, I want to make it log as well. There are at least two reasons for this.

The first reason is that my change may wind up causing some problem for people, even if we don't think it's going to. Should it cause such problems, it's very useful to have a log message (perhaps shortly before the problem happens) to the effect of 'I did this new thing'. This can save a bunch of troubleshooting, both at the time when we deploy this change and long afterward.

The second reason is that we may turn out to be wrong about how often our change triggers, which is to say how common the specific circumstances are. This can go either way. Our change can trigger a lot more than we expected, which may mean that it's overly aggressive and is affecting people more than we want, and cause us to look for other options. Or this could be because the issue we're trying to deal with could be more significant than we expect and justifies us doing even more. Alternately, our logging can trigger a lot less than we expect, which may mean we want to take the change out rather than have to maintain a local modification that doesn't actually do much (one that almost invariably makes the system more complex and harder to understand).

In the log message itself, I want to be clear and specific, although probably not as verbose as I would be for an infrequent error message. Especially for things I expect to trigger relatively infrequently, I should probably put as many details about the special circumstances as possible into the log message, because the log message is what me and my co-workers may have to work from in six months when we've forgotten the details.

PCIe cards we use and have used in our servers

By: cks
8 December 2024 at 03:00

In a comment on my entry on how common (desktop) motherboards are supporting more M.2 NVMe slots but fewer PCIe cards, jmassey was curious about what PCIe cards we needed and used. This is a good and interesting question, especially since some number of our 'servers' are actually built using desktop motherboards for various reasons (for example, a certain number of the GPU nodes in our SLURM cluster, and some of our older compute servers, which we put together ourselves using early generation AMD Threadrippers and desktop motherboards for them).

Today, we have three dominant patterns of PCIe cards. Our SLURM GPU nodes obviously have a GPU card (x16 PCIe lanes) and we've added a single port 10G-T card (which I believe are all PCIe x4) so they can pull data from our fileservers as fast as possible. Most of our firewalls have an extra dual-port 10G card (mostly 10G-T but a few use SFPs). And a number of machines have dual-port 1G cards because they need to be on more networks; our current stock of these cards are physically x4 PCIe, although I haven't looked to see if they use all the lanes.

(We also have single-port 1G cards lying around that sometimes get used in various machines; these are x1 cards. The dual-port 10G cards are probably some mix of x4 and x8, since online checks say they come in both varieties. We have and use a few quad-port 1G cards for semi-exotic situations, but I'm not sure how many PCIe lanes they want, physically or otherwise. In theory they could reasonably be x4, since a single 1G is fine at x1.)

In the past, one generation of our fileserver setup had some machines that needed to use PCIe SAS controller in order to be able to talk to all of the drives in their chassis, and I believe these cards were PCIe x8; these machines also used a dual 10G-T card. The current generation handles all of their drives through motherboard controllers, but we might need to move back to cards in future hardware configurations (depending on what the available server motherboards handle on the motherboard). The good news, for fileservers, is that modern server motherboards increasingly have at least one onboard 10G port. But in a worst case situation, a large fileserver might need two SAS controller cards and a 10G card.

It's possible that we'll want to add NVMe drives to some servers (parts of our backup system may be limited by SATA write and read speeds today). Since I don't believe any of our current servers support PCIe bifurcation, this would require one or two PCIe x4 cards and slots (two if we want to mirror this fast storage, one if we decide we don't care). Such a server would likely also want 10G; if it didn't have a motherboard 10G port, that would require another x4 card (or possibly a dual-port 10G card at x8).

The good news for us is that servers tend to make all of their available slots be physically large (generally large enough for x8 cards, and maybe even x16 these days), so you can fit in all these cards even if some of them don't get all the PCIe lanes they'd like. And modern server CPUs are also coming with more and more PCIe lanes, so probably we can actually drive many of those slots at their full width.

(I was going to say that modern server motherboards mostly don't design in M.2 slots that reduce the available PCIe lanes, but that seems to depend on what vendor you look at. A random sampling of Supermicro server motherboards suggests that two M.2 slots are not uncommon, while our Dell R350s have none.)

The modern world of server serial ports, BMCs, and IPMI Serial over LAN

By: cks
4 December 2024 at 04:30

Once upon a time, life was relatively simple in the x86 world. Most x86 compatible PCs theoretically had one or two UARTs, which were called COM1 and COM2 by MS-DOS and Windows, ttyS0 and ttyS1 by Linux, 'ttyu0' and 'ttyu1' by FreeBSD, and so on, based on standard x86 IO port addresses for them. Servers had a physical serial port on the back and wired the connector to COM1 (some servers might have two connectors). Then life became more complicated when servers implemented BMCs (Baseboard management controllers) and the IPMI specification added Serial over LAN, to let you talk to your server through what the server believed was a serial port but was actually a connection through the BMC, coming over your management network.

Early BMCs could take very brute force approaches to making this work. The circa 2008 era Sunfire X2200s we used in our first ZFS fileservers wired the motherboard serial port to the BMC and connected the BMC to the physical serial port on the back of the server. When you talked to the serial port after the machine powered on, you were actually talking to the BMC; to get to the server serial port, you had to log in to the BMC and do an arcane sequence to 'connect' to the server serial port. The BMC didn't save or buffer up server serial output from before you connected; such output was just lost.

(Given our long standing console server, we had feelings about having to manually do things to get the real server serial console to show up so we could start logging kernel console output.)

Modern servers and their BMCs are quite intertwined, so I suspect that often both server serial ports are basically implemented by the BMC (cf), or at least are wired to it. The BMC passes one serial port through to the physical connector (if your server has one) and handles the other itself to implement Serial over LAN. There are variants on this design possible; for example, we have one set of Supermicro hardware with no external physical serial connector, just one serial header on the motherboard and a BMC Serial over LAN port. To be unhelpful, the motherboard serial header is ttyS0 and the BMC SOL port is ttyS1.

When the BMC handles both server serial ports and passes one of them through to the physical serial port, it can decide which one to pass through and which one to use as the Serial over LAN port. Being able to change this in the BMC is convenient if you want to have a common server operating system configuration but use a physical serial port on some machines and use Serial over LAN on others. With the BMC switching which server serial port comes out on the external serial connector, you can tell all of the server OS installs to use 'ttyS0' as their serial console, then connect ttyS0 to either Serial over LAN or the physical serial port as you need.

Some BMCs (I'm looking at you, Dell) go to an extra level of indirection. In these, the BMC has an idea of 'serial device 1' and 'serial device 2', with you controlling which of the server's ttyS0 and ttyS1 maps to which 'serial device', and then it has a separate setting for which 'serial device' is mapped to the physical serial connector on the back. This helpfully requires you to look at two separate settings to know if your ttyS0 will be appearing on the physical connector or as a Serial over LAN console (and gives you two settings that can be wrong).

In theory a BMC could share a single server serial port between the physical serial connector and an IPMI Serial over LAN connection, sending output to both and accepting input from each. In practice I don't think most BMCs do this and there are obvious issues of two people interfering with each other that BMCs may not want to get involved in.

PS: I expect more and more servers to drop external serial ports over time, retaining at most an internal serial header on the motherboard. That might simplify BMC and BIOS settings.

My life has been improved by my quiet Prometheus alert status monitor

By: cks
29 November 2024 at 04:48

I recently created a setup to provide a backup for our email-based Prometheus alerts; the basic result is that if our current Prometheus alerts change, a window with a brief summary of current alerts will appear out of the way on my (X) desktop. Our alerts are delivered through email, and when I set up this system I imagined it as a backup, in case email delivery had problems that stopped me from seeing alerts. I didn't entirely realize that in the process, I'd created a simple, terse alert status monitor and summary display.

(This wasn't entirely a given. I could have done something more clever when the status of alerts changed, like only displaying new alerts or alerts that had been resolved. Redisplaying everything was just the easiest approach that minimized maintaining and checking state.)

After using my new setup for several days, I've ended up feeling that I'm more aware of our general status on an ongoing and global basis than I was before. Being more on top of things this way is a reassuring feeling in general. I know I'm not going to accidentally miss something or overlook something that's still ongoing, and I actually get early warning of situations before they trigger actual emails. To put it in trendy jargon, I feel like I have more situational awareness. At the same time this is a passive and unintrusive thing that I don't have to pay attention to if I'm busy (or pay much attention to in general, because it's easy to scan).

Part of this comes from how my new setup doesn't require me to do anything or remember to check anything, but does just enough to catch my eye if the alert situation is changing. Part of this comes from how it puts information about all current alerts into one spot, in a terse form that's easy to scan in the usual case. We have Grafana dashboards that present the same information (and a lot more), but it's more spread out (partly because I was able to do some relatively complex transformations and summarizations in my code).

My primary source for real alerts is still our email messages about alerts, which have gone through additional Alertmanager processing and which carry much more information than is in my terse monitor (in several ways, including explicitly noting resolved alerts). But our email is in a sense optimized for notification, not for giving me a clear picture of the current status, especially since we normally group alert notifications on a per-host basis.

(This is part of what makes having this status monitor nice; it's an alternate view of alerts from the email message view.)

My new solution for quiet monitoring of our Prometheus alerts

By: cks
23 November 2024 at 03:25

Our Prometheus setup delivers all alert messages through email, because we do everything through email (as a first approximation). As we saw yesterday, doing everything through email has problems when your central email server isn't responding; Prometheus raised alerts about the problems but couldn't deliver them via email because the core system necessary to deliver email wasn't doing so. Today, I built myself a little X based system to get around that, using the same approach as my non-interrupting notification of new email.

At a high level, what I now have is an xlbiff based notification of our current Prometheus alerts. If there are no alerts, everything is quiet. If new alerts appear, xlbiff will pop up a text window over in the corner of my screen with a summary of what hosts have what alerts; I can click the window to dismiss it. If the current set of alerts changes, xlbiff will re-display the alerts. I currently have xlbiff set to check the alerts every 45 seconds, and I may lengthen that at some point.

(The current frequent checking is because of what started all of this; if there are problems with our email alert notifications, I want to know about it pretty promptly.)

The work of fetching, checking, and formatting alerts is done by a Python program I wrote. To get the alerts, I directly query our Prometheus server rather than talking to Alertmanager; as a side effect, this lets me see pending alerts as well (although then I have to have the Python program ignore a bunch of pending alerts that are too flaky). I don't try to do the ignoring with clever PromQL queries; instead the Python program gets everything and does the filtering itself.

Pulling the current alerts directly from Prometheus means that I can't readily access the explanatory text we add as annotations (and that then appears in our alert notification emails), but for the purposes of a simple notification that these alerts exist, the name of the alert or other information from the labels is good enough. This isn't intended to give me full details about the alerts, just to let me know what's out there. Most of the time I'll get email about the alert (or alerts) soon anyway, and if not I can directly look at our dashboards and Alertmanager.

To support this sort of thing, xlbiff has the notion of a 'check' program that can print out a number every time it runs, and will get passed the last invocation's number on the command line (or '0' at the start). Using this requires boiling down the state of the current alerts to a single signed 32-bit number. I could have used something like the count of current alerts, but me being me I decided to be more clever. The program takes the start time of every current alert (from the ALERTS_FOR_STATE Prometheus metric), subtracts a starting epoch to make sure we're not going to overflow, and adds them all up to be the state number (which I call a 'checksum' in my code because I started out thinking about more complex tricks like running my output text through CRC32).

(As a minor wrinkle, I add one second to the start time of every firing alert so that when alerts go from pending to firing the state changes and xlbiff will re-display things. I did this because pending and firing alerts are presented differently in the text output.)

To get both the start time and the alert state, we must use the usual trick for pulling in extra labels:

ALERTS_FOR_STATE * ignoring(alertstate) group_left(alertstate) ALERTS

I understand why ALERTS_FOR_STATE doesn't include the alert state, but sometimes it does force you to go out of your way.

PS: If we had alerts going off all of the time, this would be far too obtrusive an approach. Instead, our default state is that there are no alerts happening, so this alert notifier spends most of its time displaying nothing (well, having no visible window, which is even better).

Our Prometheus alerting problem if our central mail server isn't working

By: cks
22 November 2024 at 04:04

Over on the Fediverse, I said something:

Ah yes, the one problem that our Prometheus based alert system can't send us alert email about: when the central mail server explodes. Who rings the bell to tell you that the bell isn't working?

(This is of course an aspect of monitoring your Prometheus setup itself, and also seeing if Alertmanager is truly healthy.)

There is a story here. The short version of the story is that today we wound up with a mail loop that completely swamped our central Exim mail server, briefly running its one minute load average up to a high water mark of 3,132 before a co-worker who'd noticed the problem forcefully power cycled it. Plenty of alerts fired during the incident, but since we do all of our alert notification via email and our central email server wasn't delivering very much email (on account of that load average, among other factors), we didn't receive any.

The first thing to note is that this is a narrow and short term problem for us (which is to say, me and my co-workers). On the short term side, we send and receive enough email that not receiving email for very long during working hours is unusual enough that someone would have noticed before too long, in fact my co-worker noticed the problems even without an alert actively being triggered. On the narrow side, I failed to notice this as it was going on because the system stayed up, it just wasn't responsive. Once the system was rebooting, I noticed almost immediately because I was in the office and some of the windows on my office desktop disappeared.

(In that old version of my desktop I would have noticed the issue right away, because an xload for the machine in question was right in the middle of these things. These days it's way off to the right side, out of my routine view, but I could change that back.)

One obvious approach is some additional delivery channel for alerts about our central mail server. Unfortunately, we're entirely email focused; we don't currently use Slack, Teams, or other online chatting systems, so sending selected alerts to any of them is out as a practical option. We do have work smartphones, so in theory we could send SMS messages; in practice, free email to SMS gateways have basically vanished, so we'd have to pay for something (either for direct SMS access and we'd build some sort of system on top, or for a SaaS provider who would take some sort of notification and arrange to deliver it via SMS).

For myself, I could probably build some sort of script or program that regularly polled our Prometheus server to see if there were any relevant alerts. If there were, the program would signal me somehow, either by changing the appearance of a status window in a relatively unobtrusive way (eg turning it red) or popping up some sort of notification (perhaps I could build something around a creative use of xlbiff to display recent alerts, although this isn't as simple as it looks).

(This particular idea is a bit of a trap, because I could spend a lot of time crafting a little X program that, for example, had a row of boxes that were green, yellow, or red depending on the alert state of various really important things.)

IPv6 networks do apparently get probed (and implications for address assignment)

By: cks
16 November 2024 at 03:30

For reasons beyond the scope of this entry, my home ISP recently changed my IPv6 assignment from a /64 to a (completely different) /56. Also for reasons beyond the scope of this entry, they left my old /64 routing to me along with my new /56, and when I noticed I left my old IPv6 address on my old /64 active, because why not. Of course I changed my DNS immediately, and at this point it's been almost two months since my old /64 appeared in DNS. Today I decided to take a look at network traffic to my old /64, because I knew there was some (which is actually another entry), and to my surprise much more appeared than I expected.

On my old /64, I used ::1/64 and ::2/64 for static IP addresses, of which the first was in DNS, and the other IPv6 addresses in it were the usual SLAAC assignments. The first thing I discovered in my tcpdump was a surprisingly large number of cloud-based IPv6 addresses that were pinging my ::1 address. Once I excluded that traffic, I was left with enough volume of port probes that I could easily see them in a casual tcpdump.

The somewhat interesting thing is that these IPv6 port probes were happening at all. Apparently there is enough out there on IPv6 that it's worth scraping IPv6 addresses from DNS and then probing potentially vulnerable ports on them to see if something responds. However, as I kept watching I discovered something else, which is that a significant number of these probes were not to my ::1 address (or to ::2). Instead they were directed to various (very) low-number addresses on my /64. Some went to the ::0 address, but I saw ones to ::3, ::5, ::7, ::a, ::b, ::c, ::f, ::15, and a (small) number of others. Sometimes a sequence of source addresses in the same /64 would probe the same port on a sequence of these addresses in my /64.

(Some of this activity is coming from things with DNS, such as various shadowserver.org hosts.)

As usual, I assume that people out there on the IPv6 Internet are doing this sort of scanning of low-numbered /64 IPv6 addresses because it works. Some number of people put additional machines on such low-numbered addresses and you can discover or probe them this way even if you can't find them in DNS.

One of the things that I take away from this is that I may not want to put servers on these low IPv6 addresses in the future. Certainly one should have firewalls and so on, even on IPv6, but even then you may want to be a little less obvious and easily found. Or at the least, only use these IPv6 addresses for things you're going to put in DNS anyway and don't mind being randomly probed.

PS: This may not be news to anyone who's actually been using IPv6 and paying attention to their traffic. I'm late to this particular party for various reasons.

Your options for displaying status over time in Grafana 11

By: cks
15 November 2024 at 03:41

A couple of years ago I wrote about your options for displaying status over time in Grafana 9, which discussed the problem of visualizing things how many (firing) Prometheus alerts there are of each type over time. Since then, some things have changed in the Grafana ecosystem, and especially some answers have recently become clearer to me (due to an old issue report), so I have some updates to that entry.

The generally best panel type you want to use for this is a state timeline panel, with 'merge equal consecutive values' turned on. State timelines are no longer 'beta' in Grafana 11 and they work for this, and I believe they're Grafana's more or less officially recommended solution for this problem. By default a state timeline panel will show all labels, but you can enable pagination. The good news (in some sense) is that Grafana is aware that people want a replacement for the old third party Discrete panel (1, 2, 3) and may at some point do more to move toward this.

You can also use bar graphs and line graphs, as mentioned back then, which continue to have the virtue that you can selectively turn on and off displaying the timelines of some alerts. Both bar graphs and line graphs continue to have their issues for this, although I think they're now different issues than they had in Grafana 9. In particular I think (stacked) line graphs are now clearly less usable and harder to read than stacked bar graphs, which is a pity because they used to work decently well apart from a few issues.

(I've been impressed, not in a good way, at how many different ways Grafana has found to make their new time series panel worse than the old graph panel in a succession of Grafana releases. All I can assume is that everyone using modern Grafana uses time series panels very differently than we do.)

As I found out, you don't want to use the status history panel for this. The status history panel isn't intended for this usage; it has limits on the number of results it can represent and it lacks the 'merge equal consecutive values' option. More broadly, Grafana is apparently moving toward merging all of the function of this panel into the Heatmap panel (also). If you do use the status history panel for anything, you want to set a general query limit on the number of results returned, and this limit is probably best set low (although how many points the panel will accept depends on its size in the browser, so life is fun here).

Since the status history panel is basically a variant of heatmaps, you don't really want to use heatmaps either. Using Heatmaps to visualize state over time in Grafana 11 continue to have the issues that I noted in Grafana 9, although some of them may be eliminated at some point in the future as the status history panel is moved further out. Today, if for some reason you have to choose between Heatmaps and Status History for this, I think you should use Status History with a query limit.

If we ever have to upgrade from our frozen Grafana version, I would expect to keep our line graph alert visualizations and replace our Discrete panel usage with State Timeline panels with pagination turned on.

Finding a good use for keep_firing_for in our Prometheus alerts

By: cks
13 November 2024 at 04:06

A while back (in 2.42.0), Prometheus introduced a feature to artificially keep alerts firing for some amount of time after their alert condition had cleared; this is 'keep_firing_for'. At the time, I said that I didn't really see a use for it for us, but I now have to change that. Not only do we have a use for it, it's one that deals with a small problem in our large scale alerts.

Our 'there is something big going on' alerts exist only to inhibit our regular alerts. They trigger when there seems to be 'too much' wrong, ideally fast enough that their inhibition effect stops the normal alerts from going out. Because normal alerts from big issues being resolved don't necessarily clean out immediately, we want our large scale alerts to linger on for some time after the amount of problems we have drop below their trigger point. Among other things, this avoids a gotcha with inhibitions and resolved alerts. Because we created these alerts before v2.42.0, we implemented the effect of lingering on by using max_over_time() on the alert conditions (this was the old way of giving an alert a minimum duration).

The subtle problem with using max_over_time() this way is that it means you can't usefully use a 'for:' condition to de-bounce your large scale alert trigger conditions. For example, if one of the conditions is 'there are too many ICMP ping probe failures', you'd potentially like to only declare a large scale issue if this persisted for more than one round of pings; otherwise a relatively brief blip of a switch could trigger your large scale alert. But because you're using max_over_time(), no short 'for:' will help; once you briefly hit the trigger number, it's effectively latched for our large scale alert lingering time.

Switching to extending the large scale alert directly with 'keep_firing_for' fixes this issue, and also simplifies the alert rule expression. Once we're no longer using max_over_time(), we can set 'for: 1m' or another useful short number to de-bounce our large scale alert trigger conditions.

(The drawback is that now we have a single de-bounce interval for all of the alert conditions, whereas before we could possibly have a more complex and nuanced set of conditions. For us, this isn't a big deal.)

I suspect that this may be generic to most uses of max_over_time() in alert rule expressions (fortunately, this was our only use of it). Possibly there are reasonable uses for it in sub-expressions, clever hacks, and maybe also using times and durations (eg, also, also).

Prometheus makes it annoyingly difficult to add more information to alerts

By: cks
12 November 2024 at 03:58

Suppose, not so hypothetically, that you have a special Prometheus meta-alert about large scale issues, that exists to avoid drowning you in alerts about individual hosts or whatever when you have a large scale issue. As part of that alert's notification message, you'd like to include some additional information about things like why you triggered the alert, how many down things you detected, and so on.

While Alertmanager creates the actual notification messages by expanding (Go) templates, it doesn't have direct access to Prometheus or any other source of external information, for relatively straightforward reasons. Instead, you need to pass any additional information from Prometheus to Alertmanager in the form (generally) of alert annotations. Alert annotations (and alert labels) also go through template expansion, and in the templates for alert annotations, you can directly make Prometheus queries with the query function. So on the surface this looks relatively simple, although you're going to want to look carefully at YAML string quoting.

I did some brief experimentation with this today, and it was enough to convince me that there are some issues with doing this in practice. The first issue is that of quoting. Realistic PromQL queries often use " quotes because they involve label values, and the query you're doing has to be a (Go) template string, which probably means using Go raw quotes unless you're unlucky enough to need ` characters, and then there's YAML string quoting. At a minimum this is likely to be verbose.

A somewhat bigger problem is that straightforward use of Prometheus template expansion (using a simple pipeline) is generally going to complain in the error log if your query provides no results. If you're doing the query to generate a value, there are some standard PromQL hacks to get around this. If you want to find a label, I think you need to use a more complex template with operation; on the positive side, this may let you format a message fragment with multiple labels and even the value.

More broadly, if you want to pass multiple pieces of information from a single query into Alertmanager (for example, the query value and some labels), you have a collection of less than ideal approaches. If you create multiple annotations, one for each piece of information, you give your Alertmanager templates the maximum freedom but you have to repeat the query and its handling several times. If you create a text fragment with all of the information that Alertmanager will merely insert somewhere, you basically split writing your alerts between Alertmanager and Prometheus alert rules, And if you encode multiple pieces of information into a single annotation with some scheme, you can use one query in Prometheus and not lock yourself into how the Alertmanager template will use the information, but your Alertmanager template will have to parse that information out again with Go template functions.

What all of this is a symptom of is that there's no particularly good way to pass structured information between Prometheus and Alertmanager. Prometheus has structured information (in the form of query results) and your Alertmanager template would like to use it, but today you have to smuggle that through unstructured text. It would be nice if there was a better way.

(Prometheus doesn't quite pass through structured information from a single query, the alert rule query, but it does make all of the labels and annotations available to Alertmanager. You could imagine a version where this could be done recursively, so some annotations could themselves have labels and etc.)

Doing general address matching against varying address lists in Exim

By: cks
30 October 2024 at 02:23

In various Exim setups, you sometimes want to match an email address against a file (or in general a list) of addresses and some sort of address patterns; for example, you might have a file of addresses and so on that you will never accept as sender addresses. Exim has two different mechanisms for doing this, address lists and nwildlsearch lookups in files that are performed through the '${lookup}' string expansion item. Generally it's better to use address lists, because they have a wildcard syntax that's specifically focused on email addresses, instead of the less useful nwildlsearch lookup wildcarding.

Exim has specific features for matching address lists (including in file form) against certain addresses associated with the email message; for example, both ACLs and routers can match against the envelope sender address (the SMTP MAIL FROM) using 'senders = ...'. If you want to match against message addresses that are not available this way, you must use a generic 'condition =' operation and either '${lookup}' or '${if match_address {..}{...}}', depending on whether you want to use a nwildlsearch lookup or an actual address list (likely in a file). As mentioned, normally you'd prefer to use an actual address list.

Now suppose that your file of addresses is, for example, per-user. In a straight 'senders =' match this is no problem, you can just write 'senders = /some/where/$local_part_data/addrs'. Life is not as easy if you want to match a message address that is not directly supported, for example the email address of the 'From:' header. If you have the user (or whatever other varying thing) in $acl_m0_var, you would like to write:

condition = ${if match_address {${address:$h_from:}} {/a/dir/$acl_m0_var/fromaddrs} }

However, match_address (and its friends) have a deliberate limitation, which is that in common Exim build configurations they don't perform string expansion on their second argument.

The way around this turns out to be to use an explicitly defined and named 'addresslist' that has the string expansion:

addresslist badfromaddrs = /a/dir/$acl_m0_var/fromaddrs
[...]
  condition = ${if match_address {${address:$h_from:}} {+badfromaddrs} }

This looks weird, since at the point we're setting up badfromaddrs the $acl_m0_var is not even vaguely defined, but it works. The important thing that makes this go is a little sentence at the start of the Exim documentation's Expansion of lists:

Each list is expanded as a single string before it is used. [...]

Although the second argument of match_address is not string-expanded when used, if it specifies a named address list, that address list is string expanded when used and so our $acl_m0_var variable is substituted in and everything works.

Speaking from personal experience, it's easy to miss this sentence and its importance, especially if you normally use address lists (and domain lists and so on) without any string expansion, with fixed arguments.

(Probably the only reason I found it was that I was in the process of writing a question to the Exim mailing list, which of course got me to look really closely at the documentation to make sure I wasn't asking a stupid question.)

Having rate-limits on failed authentication attempts is reassuring

By: cks
23 October 2024 at 03:24

A while back I added rate-limits to failed SMTP authentication attempts. Mostly I did it because I was irritated at seeing all of the failed (SMTP) authentication attempts in logs and activity summaries; I didn't think we were in any actual danger from the usual brute force mass password guessing attacks we see on the Internet. To my surprise, having this rate-limit in place has been quite reassuring, to the point where I no longer even bother looking at the overall rate of SMTP authentication failures or their sources. Attackers are unlikely to make much headway or have much of an impact on the system.

Similarly, we recently updated an OpenBSD machine that has its SSH port open to the Internet from OpenBSD 7.5 to OpenBSD 7.6. One of the things that OpenBSD 7.6 brings with it is the latest version of OpenSSH, 9.8, which has per-source authentication rate limits (although they're not quite described that way and the feature is more general). This was also a reassuring change. Attackers wouldn't be getting into the machine in any case, but I have seen the machine use an awful lot of CPU at times when attackers were pounding away, and now they're not going to be able to do that.

(We've long had firewall rate limits on connections, but they have to be set high for various reasons including that the firewall can't tell connections that fail to authenticate apart from brief ones that did.)

I can wave my hands about why it feels reassuring (and nice) to know that we have rate-limits in place for (some) commonly targeted authentication vectors. I know it doesn't outright eliminate the potential exposure, but I also know that it helps reduce various risks. Overall, I think of it as making things quieter, and in some sense we're no longer getting constantly attacked as much.

(It's also nice to hope that we're frustrating attackers and wasting their time. They do sort of have limits on how much time they have and how many machines they can use and so on, so our rate limits make attacking us more 'costly' and less useful, especially if they trigger our rate limits.)

PS: At the same time, this shows my irrationality, because for a long time I didn't even think about how many SSH or SMTP authentication attempts were being made against us. It was only after I put together some dashboards about this in our metrics system that I started thinking about it (and seeing temporary changes in SSH patterns and interesting SMTP and IMAP patterns). Had I never looked, I would have never thought about it.

Our various different types of Ubuntu installs

By: cks
17 October 2024 at 02:15

In my entry on how we have lots of local customizations I mentioned that the amount of customization we do to any particular Ubuntu server depends on what class or type of machine they are. That's a little abstract, so let's talk about how our various machines are split up by type.

Our general install framework has two pivotal questions that categorize machines. The first question is what degree of NFS mounting the machine will do, with the choices being all of the NFS filesystems from our fileservers (more or less), NFS mounting just our central administrative filesystem either with our full set of accounts or with just staff accounts, rsync'ing that central administrative filesystem (which implies only staff accounts), or being a completely isolated machine that doesn't have even the central administrative filesystem.

Servers that people will use have to have all of our NFS filesystems mounted, as do things like our Samba and IMAP servers. Our fileservers don't cross-mount NFS filesystems from each other, but they do need a replicated copy of our central administrative filesystem and they have to have our full collection of logins and groups for NFS reasons. Many of our more stand-alone, special purpose servers only need our central administrative filesystem, and will either NFS mount it or rsync it depending on how fast we want updates to propagate. For example, our local DNS resolvers don't particularly need fast updates, but our external mail gateway needs to be up to date on what email addresses exist, which is propagated through our central administrative filesystem.

On machines that have all of our NFS mounts, we have a further type choice; we can install them either as a general login server (called an 'apps' server for historical reasons), as a 'comps' compute server (which includes our SLURM nodes), or only install a smaller 'base' set of packages on them (which is not all that small; we used to try to have a 'core' package set and a larger 'base' package set but over time we found we never installed machines with only the 'core' set). These days the only difference between general login servers and compute servers is some system settings, but in the past they used to have somewhat different package sets.

The general login servers and compute servers are mostly not further customized (there are a few exceptions, and SLURM nodes need a bit of additional setup). Almost all machines that get only the base package set are further customized with additional packages and specific configuration for their purpose, because the base package set by itself doesn't make the machine do anything much or be particularly useful. These further customizations mostly aren't scripted (or otherwise automated) for various reasons. The one big exception is installing our NFS fileservers, which we decided was both large enough and we had enough of that we wanted to script it so that everything came out the same.

As a practical matter, the choice between NFS mounting our central administrative filesystem (with only staff accounts) and rsync'ing it makes almost no difference to the resulting install. We tend to think of the two types of servers it creates as almost equivalent and mostly lump them together. So as far as operating our machines goes, we mostly have 'all NFS mounts' machines and 'only the administrative filesystem' machines, with a few rare machines that don't have anything (and our NFS fileservers, which are special in their own way).

(In the modern Linux world of systemd, much of our customizations aren't Ubuntu specific, or even specific to Debian and derived systems that use apt-get. We could probably switch to Debian relatively easily with only modest changes, and to an RPM based distribution with more work.)

We have lots of local customizations (and how we keep track of them)

By: cks
15 October 2024 at 03:02

In a comment on my entry on forgetting some of our local changes to our Ubuntu installs, pk left an interesting and useful comment on how they manage changes so that the changes are readily visible in one place. This is a very good idea and we do something similar to it, but a general limitation of all such approaches is that it's still hard to remember all of your changes off the top of your head once you've made enough of them. Once you're changing enough things, you generally can't put them all in one directory that you can simply 'ls' to be reminded of everything you change; at best, you're looking at a list of directories where you change things.

Our system for customizing Ubuntu stores the master version of customizations in our central administrative filesystem, although split across several places for convenience. We broadly have one directory hierarchy for Ubuntu release specific files (or at least ones that are potentially version specific; in practice a lot are the same between different Ubuntu releases), a second hierarchy (or two) for files that are generic across Ubuntu versions (or should be), and then a per-machine hierarchy for things specific to a single machine. Each hierarchy mirrors the final filesystem location, so that our systemd unit files will be in, for example, <hierarchy root>/etc/systemd/system.

Our current setup embeds the knowledge of what files will or won't be installed on any particular class of machines into the Ubuntu release specific 'postinstall' script that we run to customize machines, in the form of a whole bunch of shell commands to copy each of the files (or collections of files). This gives us straightforward handling of files that aren't always installed (or that vary between types of machines), at the cost of making it a little unclear if a particular file in the master hierarchy will actually be installed. We could try to do something more clever, but it would be less obvious that tne current straightforward approach where the postinstall script has a lot of 'cp -a <src>/etc/<file> /etc/<file>' and it's easy to see what you need to do to add one or specially handle one.

(The obvious alternate approach would be to have a master file that listed all of the files to be installed on each type of machine. However, one advantage of the current approach is that it's easy to have various commentary about the files being installed and why, and it's also easy to run commands, install packages, and so on in between installing various files. We don't install them all at once.)

Based on some brute force approximation, it appears that we install around 100 customization files on a typical Ubuntu machine (we install more on some types of machines than on other types, depending on whether the machine will have all of our NFS mounts and whether or not it's a machine regular people will log in to). Specific machines can be significantly customized beyond this; for example, our ZFS fileservers get an additional scripted customization pass.

PS: The reason we have this stuff scripted and stored in a central filesystem is that we have over a hundred servers and a lot of them are basically identical to each other (most obviously, our SLURM nodes). In aggregate, we install and reinstall a fair number of machines and almost all of them have this common core.

Our local changes to standard (Ubuntu) installs are easy to forget

By: cks
14 October 2024 at 03:08

We have been progressively replacing a number of old one-off Linux machines with up to date replacements that run Ubuntu and so are based on our standard Ubuntu install. One of those machines has a special feature where a group of people are allowed to use passworded sudo to gain access to a common holding account. After we deployed the updated machine, these people got in touch with us to report that something had gone wrong with the sudo system. This was weird to me, because I'd made sure to faithfully replicate the old system's sudo customizations to the new one. When I did some testing, things got weirder; I discovered that sudo was demanding the root password instead of my password. This was definitely not how things were supposed to work for this sudo access (especially since the people with sudo access don't know the root password for the machine).

Whether or not sudo does this is controlled by the setting of 'rootpw' in sudoers or one of the files it includes (at least with Ubuntu's standard sudo.conf). The stock Ubuntu sudoers doesn't set 'rootpw', and of course this machine's sudoers customizations didn't set them either. But when I looked around, I discovered that we had long ago set up an /etc/sudoers.d customization file to set 'rootpw' and made it part of our standard Ubuntu install. When I rebuilt this machine based on our standard Ubuntu setup, the standard install stuff had installed this sudo customization. Since we'd long ago completely forgotten about its existence, I hadn't remembered it while customizing the machine to its new purpose, so it had stayed.

(We don't normally use passworded sudo, and we definitely want access to root to require someone to know the special root password, not just the password to a sysadmin's account.)

There are probably a lot of things that we've added to our standard install over the years that are like this sudo customization. They exist to make things work (or not work), and as long as they keep quietly doing their jobs it's very easy to forget them and their effects. Then we do something exceptional on a machine and they crop up, whether it's preventing sudo from working like we want it to or almost giving us a recursive syslog server.

(I don't have any particular lesson to draw from this, except that it's surprisingly difficult to de-customize a machine. One might think the answer is to set up the machine from scratch outside our standard install framework, but the reality is that there's a lot from the standard framework that we still want on such machines. Even with issues like this, it's probably easier to install them normally and then fix the issues than do a completely stock Ubuntu server install.)

Some thoughts on why 'inetd activation' didn't catch on

By: cks
13 October 2024 at 02:06

Inetd is a traditional Unix 'super-server' that listens on multiple (IP) ports and runs programs in response to activity on them; it dates from the era of 4.3 BSD. In theory inetd can act as a service manager of sorts for daemons like the BSD r* commands, saving them from having to implement things like daemonization, and in fact it turns out that one version of this is how these daemons were run in 4.3 BSD. However, running daemons under inetd never really caught on (even in 4.3 BSD some important daemons ran outside of inetd), and these days it's basically dead. You could ask why, and I have some thoughts on that.

The initial version of inetd only officially supported running TCP services in a mode where each connection ran a new instance of the program (call this the CGI model). On the machines of the 1980s and 1990s, this wasn't a particularly attractive way to run anything but relatively small and simple programs (and ones that didn't have to do much work on startup). In theory you could possibly run TCP services in a mode where they were passed the server socket and then accepted new connections themselves for a while; in practice, no one seems to have really written daemons that supported this. Daemons that supported an 'inetd mode' generally meant the 'run a copy of the program for each connection' mode.

(Possibly some of them supported both modes of inetd operation, but system administrators would pretty much assume that if a daemon's documentation said just 'inetd mode' that it meant the CGI model.)

Another issue is that inetd is not a service manager. It will start things for you, but that's it; it won't shut down things for you (although you can get it to stop listening on a port), and it won't tell you what's running (you get to inspect the process list). On Unixes with a System V init system or something like it, running your daemons as standalone things gave you access to start, stop, restart, status, and so on service management options that might even work (depending on the quality of the init.d scripts involved). Since daemons had better usability when run as standalone services, system administrators and others had relatively little reason to push for inetd support, especially in the second mode.

In general, running any important daemon under inetd has many of the same downside as systemd socket activation of services. As a practical matter, system administrators like to know that important daemons are up and running right away, and they don't have some hidden issue that will cause them to fail to start just when you want them. The normal CGI-like inetd mode also means that any changes to configuration files and the like take effect right away, which may not be what you want; system administrators tend to like controlling when daemons restart with new configurations.

All of this is likely tied to what we could call 'cultural factors'. I suspect that authors of daemons perceived running standalone as the more serious and prestigious option, the one for serious daemons like named and sendmail, and inetd activation to be at most a secondary feature. If you wrote a daemon that only worked with inetd activation, you'd practically be proclaiming that you saw your program as a low importance thing. This obviously reinforces itself, to the degree that I'm surprised sshd even has an option to run under inetd.

(While some Linuxes are now using systemd socket activation for sshd, they aren't doing it via its '-i' option.)

PS: There are some services that do still generally run under inetd (or xinetd, often the modern replacement, cf). For example, I'm not sure if the Amanda backup system even has an option to run its daemons as standalone things.

Brief notes on making Prometheus's SNMP exporter use additional SNMP MIB(s)

By: cks
30 September 2024 at 03:13

Suppose, not entirely hypothetically, that you have a DSL modem that exposes information about the state of your DSL link through SNMP, and you would like to get that information into Prometheus so that you could track it over time (for reasons). You could scrape this information by 'hand' using scripts, but Prometheus has an officially supported SNMP exporter. Unfortunately, in practice the Prometheus SNMP exporter pretty much has a sign on the front door that says "no user serviceable parts, developer access only"; how you do things with it if its stock configuration doesn't meet your needs is what I would call rather underdocumented.

The first thing you'll need to do is find out what generally known and unknown SNMP attributes ('OIDs') your device exposes. You can do this using tools like snmpwalk, and see also some general information on reading things over SNMP. Once you've found out what OIDs your device supports, you need to find out if there are public MIBs for them. In my case, my DSL modem exposed information about network interfaces in the standard and widely available 'IF-MIB', and ADSL information in the standard but not widely available 'ADSL-LINE-MIB'. For the rest of this entry I''ll assume that you've managed to fetch the ADSL-LINE-MIB and everything it depends on and put them in a directory, /tmp/adsl-mibs.

The SNMP exporter effectively has two configuration files (as I wrote about recently); a compiled ('generated') configuration file (or set of them) that lists in exhausting detail all of the SNMP OIDs to be collected, and an input file to a separate tool, the generator, that creates the compiled main file. To collect information from a new MIB, you need to set up a new SNMP exporter 'module' for it, and specify the root OID or OIDs involved to walk. This looks like:

---
modules:
  # The ADSL-LINE-MIB MIB
  adsl_line_mib:
    walk:
      - 1.3.6.1.2.1.10.94
      # or:
      #- adslMIB

Here adsl_line_mib is the name of the new SNMP exporter module, and we give it the starting OID of the MIB. You can't specify the name of the MIB itself as the OID to walk, although this is how 'snmpwalk' will present it. Instead you have to use the MIB's 'MODULE-IDENTITY' line, such as 'adslMIB'. Alternately, perusal of your MIB and snmpwalk results may suggest alternate names to use, such as 'adslLineMib'. Using the top level OID is probably easier.

The name of your new module is arbitrary, but it's conventional to use the name of the MIB in this form. You can do other things in your module; reading the existing generator.yml is probably the most useful documentation. As various existing modules show, you can walk multiple OIDs in one module.

This configuration file leaves out the 'auths:' section from the main generator.yml, because we only need one of them, and what we're doing is generating an additional configuration file for snmp_exporter that we'll use along with the stock snmp.yml. To actually generate our new snmp-adsl.yml, we do:

cd snmp_exporter/generator
go build
make # builds ./mibs
./generator generate \
   -m ./mibs \
   -m /tmp/adsl-mibs \
   -g generator-adsl.yml
   -o /tmp/snmp-adsl.yml

We give the generator both its base set of MIBs, which will define various common things, and the directory with our ADSL-LINE-MIB and all of the MIBs it may depend on. Although the input is small, the snmp-adsl.yml will generally be quite big; in my case, over 2,000 lines.

As I mentioned the other day, you may find that some of the SNMP OIDs actually returned by your device don't conform to the SNMP MIB. When this happens, your scrape results will not be a success but instead a HTTP 500 error with text that says things like:

An error has occurred while serving metrics:
error collecting metric Desc{fqName: "snmp_error", help: "BITS type was not a BISTRING on the wire.", constLabels: {}, variableLabels: {}}: error for metric adslAturCurrStatus with labels [1]: <nil>

This says that the the actual OID(s) for adslAturCurrStatus from my actual device didn't match what the MIB claimed. In this case, my raw snmpwalk output for this OID is:

.1.3.6.1.2.1.10.94.1.1.3.1.6.1 = BITS: 00 00 00 01 31

(I don't understand what this means, since I'm not anywhere near an SNMP expert.)

If the information is sufficiently important, you'll need to figure out how to modify either the MIB or the generated snmp-adsl.yml to get the information without snmp_exporter errors. Doing so is far beyond the scope of this entry. If the information is not that important, the simple way is to exclude it with a generator override:

---
modules:
  adsl_line_mib:
    walk:
      # ADSL-LINE-MIB
      #- 1.3.6.1.2.1.10.94
      - adslMIB
    overrides:
     # My SmartRG SR505N produces values for this metric
     # that make the SNMP exporter unhappy.
     adslAturCurrStatus:
       ignore: true

You can at least get the attribute name you need to ignore from the SNMP exporter's error message. Unfortunately this error message is normally visible only in scrape output, and you'll only see it if you scrape manually with something like 'curl'.

Brief notes on how the Prometheus SNMP exporter's configurations work

By: cks
28 September 2024 at 03:19

A variety of devices (including DSL modems) expose interesting information via SNMP (which is not simple, despite its name). If you have a Prometheus environment, it would be nice to get (some of) this information from your SNMP capable devices into Prometheus. You could do this by hand with scripts and commands like 'snmpget', but there is also the officially supported SNMP exporter. Unfortunately, in practice the Prometheus SNMP exporter pretty much has a sign on the front door that says "no user serviceable parts, developer access only". Understanding how to do things even a bit out of standard with it is, well, a bit tricky. So here are some notes.

The SNMP exporter ships with a 'snmp.yml' configuration file that's what the actual 'snmp_exporter' program uses at runtime (possibly augmented by additional files you provide). As you'll read when you look at the file, this file is machine generated. As far as I can tell, the primary purpose of this file is to tell the exporter what SNMP OIDs it could try to read from devices, what metrics generated from them should be called, and how to interpret the various sorts of values it gets back over SNMP (for instance, network interfaces have a 'ifType' that in raw format is a number, but where the various values correspond to different types of physical network types). These SNMP OIDs are grouped into 'modules', with each module roughly corresponding to a SNMP MIB (the correspondence isn't necessarily exact). When you ask the SNMP exporter to query a SNMP device, you normally tell the exporter what modules to use, which determines what OIDs will be retrieved and what metrics you'll get back.

The generated file is very verbose, which is why it's generated, and its format is pretty underdocumented, which certainly does help contribute to the "no user serviceable parts" feeling. There is very little support for directly writing a new snmp.yml module (which you can at least put in a separate 'snmp-me.yml' file) if you happen to have a few SNMP OIDs that you know directly, don't have a MIB for, and want to scrape and format specifically. Possibly the answer is to try to write a MIB yourself and generate a snmp-me.yml from it, but I haven't had to do this so I have no opinions on which way is better.

The generated file and its modules are created from various known MIBs by a separate program, the generator. The generator has its own configuration file to describe what modules to generate, what OIDs go into each module, and so on. This means that reading generator.yml is the best way to find out what MIBs the SNMP exporter already supports. As far as I know, although generator.yml doesn't necessarily specify OIDs by name, the generator requires MIBs for everything you want to be in the generated snmp.yml file and generate metrics for.

The generator program and its associated data isn't available as part of the pre-built binary SNMP exporter packages. If you need anything beyond the limited selection of MIBs that are compiled into the stock snmp.yml, you need to clone the repository, go to the 'generator' subdirectory, build the generator with 'go build' (currently), run 'make' to fetch and process the MIBs it expects, get (or write) MIBs for your additional metrics, and then write yourself a minimal generator-me.yml of your own to add one or more (new) modules for your new MIBs. You probably don't want to regenerate the main snmp.yml; you might as well build a 'snmp-me.yml' that just has your new modules in it, and run the SNMP exporter with snmp-me.yml as an additional configuration file.

As a practical matter, you may find that your SNMP capable device doesn't necessarily conform to the MIB that theoretically describes it, including OIDs with different data formats (or data) than expected. In the simple case, you can exclude OIDs or named attributes from being fetched so that the non-conformance doesn't cause the SNMP exporter to throw errors:

modules:
  adsl_line_mib:
[...]
    overrides:
     adslAturCurrStatus:
       ignore: true

More complex mis-matches between the MIB and your device will have you reading whatever you can find for the available options for generator.yml or even for snmp.yml itself. Or you can change your mind and scrape through scripts or programs in other languages instead of the SNMP exporter (it's what we do for some of our machine room temperature sensors).

(I guess another option is editing the MIB so that it corresponds to what your device returns, which should make the generator produce a snmp-me.yml that matches what the SNMP exporter sees from the device.)

PS: A peculiarity of the SNMP exporter is that the SNMP metrics it generates are all named after their SNMP MIB names, which produce metric names that are not at all like conventional Prometheus metric names. It's possible to put a common prefix, such as 'snmp_metric_', on all SNMP metrics to make them at least a little bit better. Technically this is a peculiarity of snmp.yml, but changing it is functionally impossible unless you hand-edit your own version.

The impact of the September 2024 CUPS CVEs depends on your size

By: cks
27 September 2024 at 03:16

The recent information security news is that there are a series of potentially serious issues in CUPS (via), but on the other hand a lot of people think that this isn't an exploit with a serious impact because, based on current disclosures, someone has to print something to a maliciously added new 'printer' (for example). My opinion is that how potentially serious this issue is for you depends on the size and scope of your environment.

Based on what we know, the vulnerability requires the CUPS server to also be running 'cups-browsed'. One of the things that cups-browsed does is allow remote printers to register themselves on the CUPS server; you set up your new printer, point it at your local CUPS print server, and everyone can now use it. As part of this registration, the collection of CUPS issues allows a malicious 'printer' to set up server side data (a CUPS PPD) that contains things that will run commands on the print server when a print job is sent to this malicious 'printer'. In order to get anything to happen, an attacker needs to get someone to do this.

In a personal environment or a small organization, this is probably unlikely. Either you know all the printers that are supposed to be there and a new one showing up is alarming, or at the very least you'll probably assume that the new printer is someone's weird experiment or local printer or whatever, and printing to it won't make either you or the owner very happy. You'll take your print jobs off to the printers you know about, and ignore the new one.

(Of course, an attacker with local knowledge could target their new printer name to try to sidestep this; for example, calling it 'Replacement <some existing printer>' or the like.)

In a larger organization, such as ours, people don't normally know all of the printers that are around and don't generally know when new printers show up. In such an environment, it's perfectly reasonable for people to call up a 'what printer do you want to use' dialog, see a new to them printer with an attractive name, and use it (perhaps thinking 'I didn't know they'd put a printer in that room, that's conveniently close'). And since printer names that include locations are perpetually misleading or wrong, most of the time people won't be particularly alarmed if they go to the location where they expect the printer (and their print job) to be and find nothing. They'll shrug, go back, and re-print their job to a regular printer they know.

(There are rare occasions here where people get very concerned when print output can't be found, but in most cases the output isn't sensitive and people don't care if there's an extra printed copy of a technical paper or the like floating around.)

Larger scale environments, possibly with an actual CUPS print server, are also the kind of environment where you might deliberately run cups-browsed. This could be to enable easy addition of new printers to your print server or to allow people's desktops to pick up what printers were available out there without you needing to even have a central print server.

My view is that this set of CVEs shows that you probably can't trust cups-browsed in general and need to stop running it, unless you're very confident that your environment is entirely secure and will never have a malicious attacker able to send packets to cups-browsed.

(I said versions of this on the Fediverse (1, 2), so I might as well elaborate on it here.)

Our broad reasons for and approach to mirroring disks

By: cks
21 September 2024 at 02:51

When I talked about our recent interest in FreeBSD, I mentioned the issue of disk mirroring. One of the questions this raises is what we use disk mirroring for, and how we approach it in general. The simple answer is that we mirror disks for extra redundancy, not for performance, but we don't go too far to get extra redundancy.

The extremely thorough way to do disk mirroring for redundancy is to mirror with different makes and ages of disks on each side of the mirror, to try to avoid both age related failures and model or maker related issues (either firmware or where you find out that the company used some common problematic component). We don't go this far; we generally buy a block of whatever SSD is considered good at the moment, then use them for a while, in pairs, either fresh in newly deployed servers or re-using a pair in a server being re-deployed. One reason we tend to do this is that we generally get 'consumer' drives, and finding decent consumer drives is hard enough at the best of times without having to find two different vendors of them.

(We do have some HDD mirrors, for example on our Prometheus server, but these are also almost always paired disks of the same model, bought at the same time.)

Because we have backups, our redundancy goals are primarily to keep servers operating despite having one disk fail. This means that it's important that the system keep running after a disk failure, that it can still reboot after a disk failure (including of its first, primary disk), and that the disk can be replaced and put into service without downtime (provided that the hardware supports hot swapping the drive). The less this is true, the less useful any system's disk mirroring is to us (including 'hardware' mirroring, which might make you take a trip through the BIOS to trigger a rebuild after a disk replacement, which means downtime). It's also vital that the system be able to tell us when a disk has failed. Not being able to reliably tell us this is how you wind up with systems running on a single drive until that single drive then fails too.

On our ZFS fileservers it would be quite undesirable to have to restore from backups, so we have an elaborate spares system that uses extra disk space on the fileservers (cf) and a monitoring system to rapidly replace failed disks. On our regular servers we don't (currently) bother with this, even on servers where we could add a third disk as a spare to the two system disks.

(We temporarily moved to three way mirrors for system disks on some critical servers back in 2020, for relatively obvious reasons. Since we're now in the office regularly, we've moved back to two way mirrors.)

Our experience so far with both HDDs and SSDs is that we don't really seem to have clear age related or model related failures that take out multiple disks at once. In particular, we've yet to lose both disks of a mirror before one could be replaced, despite our habit of using SSDs and HDDs in basically identical pairs. We have had a modest number of disk failures over the years, but they've happened by themselves.

(It's possible that at some point we'll run a given set of SSDs for long enough that they start hitting lifetime limits. But we tend to grab new SSDs when re-deploying important servers. We also have a certain amount of server generation turnover for important servers, and when we use the latest hardware it also gets brand new SSDs.)

Why we're interested in FreeBSD lately (and how it relates to OpenBSD here)

By: cks
16 September 2024 at 03:09

We have a long and generally happy history of using OpenBSD and PF for firewalls. To condense a long story, we're very happy with the PF part of our firewalls, but we're increasingly not as happy with the OpenBSD part (outside of PF). Part of our lack of cheer is the state of OpenBSD's 10G Ethernet support when combined with PF, but there are other aspects as well; we never got OpenBSD disk mirroring to be really useful and eventually gave up on it.

We wound up looking at FreeBSD after another incident with OpenBSD doing weird and unhelpful hardware things, because we're a little tired of the whole area. Our perception (which may not be reality) is that FreeBSD likely has better driver support for modern hardware, including 10G cards, and has gone further on SMP support for networking, hopefully including PF. The last time we looked at this, OpenBSD PF was more or less limited by single-'core' CPU performance, especially when used in bridging mode (which is what our most important firewall uses). We've seen fairly large bandwidth rates through our OpenBSD PF firewalls (in the 800 MBytes/sec range), but never full 10G wire bandwidth, so we've wound up suspecting that our network speed is partly being limited by OpenBSD's performance.

(To get to this good performance we had to buy servers that focused on single-core CPU performance. This created hassles in our environment, since these special single-core performance servers had to be specially reserved for OpenBSD firewalls. And single-core performance isn't going up all that fast.)

FreeBSD has a version of PF that's close enough to OpenBSD's older versions to accept much or all of the syntax of our pf.conf files (we're not exactly up to the minute on our use of PF features and syntax). We also perceive FreeBSD as likely more normal to operate than OpenBSD has been, making it easier to integrate into our environment (although we'd have to actually operate it for a while to see if that was actually the case). If FreeBSD has great 10G performance on our current generation commodity servers, without needing to buy special servers for it, and fixes other issues we have with OpenBSD, that makes it potentially fairly attractive.

(To be clear, I think that OpenBSD is (still) a great operating system if you're interested in what it has to offer for security and so on. But OpenBSD is necessarily opinionated, since it has a specific focus, and we're not really using OpenBSD for that focus. Our firewalls don't run additional services and don't let people log in, and some of them can only be accessed over a special, unrouted 'firewall' subnet.)

Getting maximum 10G Ethernet bandwidth still seems tricky

By: cks
15 September 2024 at 02:51

For reasons outside the scope of this entry, I've recently been trying to see how FreeBSD performs on 10G Ethernet when acting as a router or a bridge (both with and without PF turned on). This pretty much requires at least two more 10G test machines, so that the FreeBSD server can be put between them. When I set up these test machines, I didn't think much about them so I just grabbed two old servers that were handy (well, reasonably handy), stuck a 10G card into each, and set them up. Then I actually started testing their network performance.

I'm used to 1G Ethernet, where long ago it became trivial to achieve full wire bandwidth, even bidirectional full bandwidth (with test programs; there are many things that can cause real programs to not get this). 10G Ethernet does not seem to be like this today; the best I could do was get close to around 950 MBytes a second in one direction (which is not 10G's top speed). With the right circumstances, bidirectional traffic could total to just over 1 GByte a second, which is of course nothing like what we'd like to see.

(This isn't a new problem with 10G Ethernet, but I was hoping this had been solved in the past decade or so.)

There's a lot of things that could be contributing to this, like the speed of the CPU (and perhaps RAM), the specific 10G hardware I was using (including if it lacked performance increasing features that more expensive hardware would have had), and Linux kernel or driver issues (although this was Ubuntu 24.04, so I would hope that they were sorted out). I'm especially wondering about CPU limitations, because the kernel's CPU usage did seem to be quite high during my tests and, as mentioned, they're old servers with old CPUs (different old CPUs, even, one of which seemed to perform a bit better than the other).

(For the curious, one was a Celeron G530 in a Dell R210 II and the other a Pentium G6950 in a Dell R310, both of which date from before 2016 and are something like four generations back from our latest servers (we've moved on slightly since 2022).)

Mostly this is something I'm going to have to remember about 10G Ethernet in the future. If I'm doing anything involving testing its performance, I'll want to use relatively modern test machines, possibly several of them to create aggregate traffic, and then I'll want to start out by measuring the raw performance those machines can give me under the best circumstances. Someday perhaps 10G Ethernet will be like 1G Ethernet for this, but that's clearly not the case today (in our environment).

What admin access researchers have to their machines here

By: cks
13 September 2024 at 03:31

Recently on the Fediverse, Stephen Checkoway asked what level of access fellow academics had to 'their' computers to do things like install software (via). This is an issue very relevant to where I work, so I put a short-ish answer in the Fediverse thread and now I'm going to elaborate it at more length. Locally (within the research side of the department) we have a hierarchy of machines for this sort of thing.

At the most restricted are the shared core machines my group operates in our now-unusual environment, such as the mail server, the IMAP server, the main Unix login server, our SLURM cluster and general compute servers, our general purpose web server, and of course the NFS fileservers that sit behind all of this. For obvious reasons, only core staff have any sort of administrative access to these machines. However, since we operate a general Unix environment, people can install whatever they want to in their own space, and they can request that we install standard Ubuntu packages, which we mostly do (there are some sorts of packages that we'll decline to install). We do have some relatively standard Ubuntu features turned off for security reasons, such as "user namespaces", which somewhat limits what people can do without system privileges. Only our core machines live on our networks with public IPs; all other machines have to go on separate private "sandbox" networks.

The second most restricted are researcher owned machines that want to NFS mount filesystems from our NFS fileservers. By policy, these must be run by the researcher's Point of Contact, operated securely, and only the Point of Contact can have root on those machines. Beyond that, researchers can and do ask their Point of Contact to install all sorts of things on their machines (the Point of Contact effectively works for the researcher or the research group). As mentioned, these machines live on "sandbox" networks. Most often they're servers that the researcher has bought with grant funding, and there are some groups that operate more and better servers than we (the core group) do.

Next are non-NFS machines that people put on research group "sandbox" networks (including networks where some machines have NFS access); people do this with both servers and desktops (and sometimes laptops as well). The policies on who has what power over these machines is up to the research group and what they (and their Point of Contact) feel comfortable with. There are some groups where I believe the Point of Contact runs everything on their sandbox network, and other groups where their sandbox network is wide open with all sorts of people running their own machines, both servers and desktops. Usually if a researcher buys servers, the obvious person to have run them is their Point of Contact, unless the research work being done on the servers is such that other people need root access (or it's easier for the Point of Contact to hand the entire server over to a graduate student and have them run it as they need it).

Finally there are generic laptops and desktops, which normally go on our port-isolated 'laptop' network (called the 'red' network after the colour of network cables we use for it, so that it's clearly distinct from other networks). We (the central group) have no involvement in these machines and I believe they're almost always administered by the person who owns or at least uses them, possibly with help from that person's Point of Contact. These days, some number of laptops (and probably even desktops) don't bother with wired networking and use our wireless network instead, where similar 'it's yours' policies apply.

People who want access to their files from their self-managed desktop or laptop aren't left out in the cold, since we have a SMB (CIFS) server. People who use Unix and want their (NFS, central) home directory mounted can use the 'cifs' (aka 'smb3') filesystem to access it through our SMB server, or even use sshfs if they want to. Mounting via cifs or sshfs is in some cases superior to using NFS, because they can give you access to important shared filesystems that we can't NFS export to machines outside our direct control.

Rate-limiting failed SMTP authentication attempts in Exim 4.95

By: cks
12 September 2024 at 03:01

Much like with SSH servers, if you have a SMTP server exposed to the Internet that supports SMTP authentication, you'll get a whole lot of attackers showing up to do brute force password guessing. It would be nice to slow these attackers down by rate-limiting their attempts. If you're using Exim, as we are, then this is possible to some degree. If you're using Exim 4.95 on Ubuntu 22.04 (instead of a more recent Exim), it's trickier than it looks.

One of Exim's ACLs, the ACL specified by acl_smtp_auth, is consulted just before Exim accepts a SMTP 'AUTH <something>' command. If this ACL winds up returning a 'reject' or a 'defer' result, Exim will defer or reject the AUTH command and the SMTP client will not be able to try authenticating. So obviously you need to put your ratelimit statement in this ACL, but there are two complications. First, this ACL doesn't have access to the login name the client is trying to authenticate (this information is only sent after Exim accepts the 'AUTH <whatever>' command), so all you can ratelimit is the source IP (or a network area derived from it). Second, this ACL happens before you know what the authentication result is, so you don't want to actually update your ratelimit in it, just check what the ratelimit is.

This leads to the basic SMTP AUTH ACL of:

acl_smtp_auth = acl_check_auth
begin acl
acl_check_auth:
  # We'll cover what this is for later
  warn
    set acl_c_auth = true

  deny
    ratelimit = 10 / 10m / per_cmd / readonly / $sender_host_address
    delay = 10s
    message = You are failing too many authentication attempts.
    # you might also want:
    # log_message = ....

  # don't forget this or you will be sad
  # (because no one will be able to authenticate)
  accept

(The 'delay = 10s' usefully slows down our brute force SMTP authentication attackers because they seem to wait for the reply to their SMTP AUTH command rather than giving up and terminating the session after a couple of seconds.)

This ratelimit is read-only because we don't want to update it unless the SMTP authentication fails; otherwise, you will wind up (harshly) rate-limiting legitimate people who repeatedly connect to you, authenticate, perhaps send an email message, and then disconnect. Since we can't update the ratelimit in the SMTP AUTH ACL, we need to somehow recognize when authentication has failed and update the ratelimit in that place.

In Exim 4.97 and later, there's a convenient and direct way to do this through the events system and the 'auth:fail' event that is raised by an Exim server when SMTP authentication fails. As I understand it, the basic trick is that you make the auth:fail event invoke a special ACL, and have the user ACL update the ratelimit. Unfortunately Ubuntu 22.04 has Exim 4.95, so we must be more clever and indirect, and as a result somewhat imperfect in what we're doing.

To increase the ratelimit when SMTP authentication has failed, we add an ACL that is run at the end of the connection and increases the ratelimit if an authentication was attempted but did not succeed, which we detect by the lack of authentication information. Exim has two possible 'end of session' ACL settings, one that is used if the session is ended with a SMTP QUIT command and one that is ended if the SMTP session is just ended without a QUIT.

So our ACL setup to update our ratelimit looks like this:

[...]
acl_smtp_quit = acl_count_failed_auth
acl_smtp_notquit = acl_count_failed_auth

begin acl
[...]

acl_count_failed_auth:
  warn:
    condition = ${if bool{$acl_c_auth} }
    !authenticated = *
    ratelimit = 10 / 10m / per_cmd / strict / $sender_host_address

  accept

Our $acl_c_auth SMTP connection ACL variable tells us whether or not the connection attempted to authenticate (sometimes legitimate people simply connect and don't do anything before disconnecting), and then we also require that the connection not be authenticated now to screen out people who succeeded in their SMTP authentication. The settings for the two 'ratelimit =' settings have to match or I believe you'll get weird results.

(The '10 failures in 10 minutes' setting works for us but may not work for you. If you change the 'deny' to 'warn' in acl_check_auth and comment out the 'message =' bit, you can watch your logs to see what rates real people and your attackers actually use.)

The limitation on this is that we're actually increasing the ratelimit based not on the number of (failed) SMTP authentication attempts but on the number of connections that tried but failed SMTP authentication. If an attacker connects and repeatedly tries to do SMTP AUTH in the session, failing each time, we wind up only counting it as a single 'event' for ratelimiting because we only increase the ratelimit (by one) when the session ends. For the brute force SMTP authentication attackers we see, this doesn't seem to be an issue; as far as I can tell, they disconnect their session when they get a SMTP authentication failure.

I should probably reboot BMCs any time they behave oddly

By: cks
9 September 2024 at 03:13

Today on the Fediverse I said:

It has been '0' days since I had to reset a BMC/IPMI for reasons (in this case, apparently something power related happened that glitched the BMC sufficiently badly that it wasn't willing to turn on the system power). Next time a BMC is behaving oddly I should just immediately tell it to cold reset/reboot and see, rather than fiddling around.

(Assuming the system is already down. If not, there are potential dangers in a BMC reset.)

I've needed to reset a BMC before, but this time was more odd and less clear than the KVM over IP that wouldn't accept the '2' character.

We apparently had some sort of power event this morning, with a number of machines abruptly going down (distributed across several different PDUs). Most of the machines rebooted fine, either immediately or after some delay. A couple of the machines did not, and conveniently we had set up their BMCs on the network (although they didn't have KVM over IP). So I remotely logged in to their BMC's web interface, saw that the BMC was reporting that the power was off, and told the BMC to power on.

Nothing happened. Oh, the BMC's web interface accepted my command, but the power status stayed off and the machines didn't come back. Since I had a bike ride to go to, I stopped there. After I came back from the bike ride I tried some more things (still remotely). One machine I could remotely power cycle through its managed PDU, which brought it back. But the other machine was on an unmanaged PDU with no remote control capability. I wound up trying IPMI over the network (with ipmitool), which had no better luck getting the machine to power on, and then I finally decided to try resetting the BMC. That worked, in that all of a sudden the machine powered on the way it was supposed to (we set the 'what to do after power comes back' on our machines to 'last power state', which would have been 'powered on').

As they say, I have questions. What I don't have is any answers. I believe that the BMC's power control talks to the server's motherboard, instead of to the power supply units, and I suspect that it works in a way similar to desktop ATX chassis power switches. So maybe the BMC software had a bug, or some part of the communication between the BMC and the main motherboard circuitry got stuck or desynchronized, or both. Resetting the BMC would reset its software, and it could also force a hardware reset to bring the communication back to a good state. Or something else could be going on.

(Unfortunately BMCs are black boxes that are supposed to just work, so there's no way for ordinary system administrators like me to peer inside.)

Using rsync to create a limited ability to write remote files

By: cks
5 September 2024 at 02:56

Suppose that you have an isolated high security machine and you want to back up some of its data on another machine, which is also sensitive in its own way and which doesn't really want to have to trust the high security machine very much. Given the source machine's high security, you need to push the data to the backup host instead of pulling it. Because of the limited trust relationship, you don't want to give the source host very much power on the backup host, just in case. And you'd like to do this with standard tools that you understand.

I will cut to the chase: as far as I can tell, the easiest way to do this is to use rsync's daemon mode on the backup host combined with SSH (to authenticate either end and encrypt the traffic in transit). It appears that another option is rrsync, but I just discovered that and we have prior experience with rsync's daemon mode for read-only replication.

Rsync's daemon mode is controlled by a configuration file that can restrict what it allows the client (your isolated high security source host) to do, particularly where the client can write, and can even chroot if you run things as root. So the first ingredient we need is a suitable rsyncd.conf, which will have at least one 'module' that defines parameters:

[backup-host1]
comment = Backup module for host1
# This will normally have restricted
# directory permissions, such as 0700.
path = /backups/host1
hosts allow = <host1 IP>
# Let's assume we're started out as root
use chroot = yes
uid = <something>
gid = <something>

The rsyncd.conf 'hosts allow' module parameter works even over SSH; rsync will correctly pull out the client IP from the environment variables the SSH daemon sets.

The next ingredient is a shell script that forces the use of this rsyncd.conf:

#!/bin/sh
exec /usr/bin/rsync --server --daemon --config=/backups/host1-rsyncd.conf .

As with the read-only replication, this script completely ignores command line arguments that the client may try to use. Very cautious people could inspect the client's command line to look for unexpected things, but we don't bother.

Finally you need a SSH keypair and a .ssh/authorized_keys entry on the backup machine for that keypair that forces using your script:

from="<host1 IP>",command="/backups/host1-script",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty [...]

(Since we're already restricting the rsync module by IP, we definitely want to restrict the key usage as well.)

On the high security host, you transfer files to the backup host with:

rsync -a --rsh="/usr/bin/ssh -i /client/identity" yourfile LOGIN@SERVER::backup-host1/

Depending on what you're backing up and how you want to do things, you might want to set the rsyncd.conf module parameters 'write only = true' and perhaps 'refuse options = delete', if you're sure you don't want the high security machine to be able to retrieve its files once it has put them there. On the other hand, if the high security machine is supposed to be able to routinely retrieve its backups (perhaps to check that they're good), you don't want this.

(If the high security machine is only supposed to read back files very rarely, you can set 'write only = true' until it needs to retrieve a file.)

There are various alternative approaches, but this one is relatively easy to set up, especially if you already have a related rsync daemon setup for read-only replication.

(On the one hand it feels annoying that there isn't a better way to do this sort of thing by now. On the other hand, the problems involved are not trivial. You need encryption, authentication of both ends, a confined transfer protocol, and so on. Here, SSH provides the encryption and authentication and rsync provides the confined transfer protocol, at the cost of having to give access to a Unix account and trust rsync's daemon mode code.)

Some reasons why we mostly collect IPMI sensor data locally

By: cks
28 August 2024 at 02:40

Most servers these days support IPMI and can report various sensor readings through it, which you often want to use. In general, you can collect IPMI sensor readings either on the host itself through the host OS or over the network using standard IPMI networking protocols (there are several generations of them). Locally, we have almost always collected this information locally (and then fed it into our Prometheus based monitoring system), for an assortment of reasons, some of them general and some of them specific to us.

When we collect IPMI sensor data locally, we export it through the the standard Prometheus host agent, which has a feature where you can give it text files of additional metrics (cf). Although there is a 'standard' third party network IPMI metrics exporter, we ended up rolling our own for various reasons (through a Prometheus exporter that can run scripts for us). So we could collect IPMI sensor data either way, but we almost entirely collect the data locally.

(These days it is a standard part of our general Ubuntu customizations to set up sensor data collection from the IPMI if the machine has one.)

The generic reasons for not collecting IPMI sensor data over the network is that your server BMCs might not be on the network at all (perhaps they don't have a dedicated BMC network interface), or you've sensibly put them on a secured network and your monitoring system doesn't have access to it. We have two additional reasons for preferring local IPMI sensor data collection.

First, even when our servers have dedicated management network ports, we don't always bother to wire them up; it's often just extra work for relatively little return (and it exposes the BMC to the network, which is not always a good thing). Second, when we collect IPMI sensor data through the host, we automatically start and stop collecting sensor data for the host when we start or stop monitoring the host in general (and we know for sure that the IPMI sensor data really matches that host). We almost never care about IPMI data when either the host isn't otherwise being monitored or the host is off.

Our system for collecting IPMI sensor data over the network actually dates from when this wasn't true, because we once had some (donated) blade servers that periodically mysteriously locked up under some conditions that seemed related to load (so much so that we built a system to automatically power cycle them via IPMI when they got hung). One of the things we were very interested in was if these blade servers were hitting temperature or fan limits when they hung. Since the machines had hung we couldn't collect IPMI information through their host agent; getting it from the IPMI over the network was our only option.

(This history has created a peculiarity, which is that our script for collecting network IPMI sensor data used what was at the time the existing IPMI user that was already set up to remotely power cycle the C6220 blades. So now anything we want to remotely collect IPMI sensor data from has a weird 'reboot' user, which these days doesn't necessarily have enough IPMI privileges to actually reset the machine.)

PS: We currently haven't built a local IPMI sensor data collection system for our OpenBSD machines, although OpenBSD can certainly talk to a local IPMI, so we collect data from a few of those machines over the network.

JSON is usually the least bad option for machine-readable output formats

By: cks
25 August 2024 at 02:28

Over on the Fediverse, I said something:

In re JSON causing problems, I would rather deal with JSON than yet another bespoke 'simpler' format. I have plenty of tools that can deal with JSON in generally straightforward ways and approximately none that work on your specific new simpler format. Awk may let me build a tool, depending on what your format is, and Python definitely will, but I don't want to.

This is re: <Royce Williams Fediverse post>

This is my view as a system administrator, because as a system administrator I deal with a lot of tools that could each have their own distinct output format, each of which I have to parse separately (for example, smartctl's bespoke output, although that output format sort of gets a pass because it was intended for people, not further processing).

JSON is not my ideal output format. But it has the same virtue as gofmt does; as Rob Pike has said, "gofmt's style is no one's favorite, yet gofmt is everyone's favorite" (source, also), because gofmt is universal and settles the arguments. Everything has to have some output format, so having a single one that is broadly used and supported is better than having N of them. And jq shows the benefit of this universality, because if something outputs JSON, jq can do useful things with it.

(In turn, the existence of jq makes JSON much more attractive to system administrators than it otherwise would be. If I had no ready way to process JSON output, I'd be much less happy about it and it would stop being the easy output format to deal with.)

I don't have any particular objection to programs that want to output in their own format (perhaps a simpler one). But I want them to give me an option for JSON too, and most of the time I'm going to go with JSON. I've already written enough ad-hoc text processing things in awk, and a few too many heavy duty text parsing things in Python. I don't really want to write another one just for you. If your program does use only a custom output format, I want there to be a really good reason why you did it, not just that you don't like the aesthetics of JSON. As Rob Pike says, no one likes gofmt's style, but we all like that everyone uses it.

(It's my view that JSON's increased verbosity over alternates isn't a compelling reason unless there's either a really large amount of data or you have to fit into very constrained space, bandwidth, or other things. In most environments, disk space and bandwidth are much cheaper than people's time and the liability of yet another custom tool that has to be maintained.)

PS: All of this is for output formats that are intended to be further processed. JSON is a terrible format for people to read directly, so terrible that my usual reaction to having to view raw JSON is to feed it through 'jq . | less'. But your tool should almost always also have an option for some machine readable format (trust me, someday system administrators will want to process the information your tool generates).

Some brief notes on 'numfmt' from GNU Coreutils

By: cks
21 August 2024 at 03:20

Many years ago I learned about numfmt (also) from GNU Coreutils (see the comments on this entry and then this entry). An additional source of information is PΓ‘draig Brady's numfmt - A number reformatting utility. Today I was faced with a situation where I wanted to compute and print multi-day, cumulative Amanda dump total sizes for filesystems in a readable way, and the range went from under a GByte to several TBytes, so I didn't want to just convert everything to TBytes (or GBytes) and be done with it. I was doing the summing up in awk and briefly considered doing this 'humanization' in awk (again, I've done it before) before I remembered numfmt and decided to give it a try.

The basic pattern for using numfmt here was:

cat <amanda logs> | awk '...' | sort -nr | numfmt --to iec

This printed out '<size> <what ...>', and then numfmt turned the first field into humanized IEC values. As I did here, it's better to sort before numfmt, using the full precision raw number, rather than after numfmt (with 'sort -h'), with its rounded (printed) values.

Although Amanda records dump sizes in KBytes, I had my awk print them out in bytes. It turns out that I could have kept them in KBytes and had numfmt do the conversion, with 'numfmt --from-unit 1024 --to iec'.

(As far as I can tell, the difference between --from-unit and --to-unit is that the former multiplies the number and the latter divides it, which is probably not going to be useful with IEC units. However, I can see it being useful if you wanted to mass-convert times in sub-second units to seconds, or convert seconds to a larger unit such as hours. Unfortunately numfmt currently has no unit options for time, so you can only do pure numeric shifts.)

If left to do its own formatting, numfmt has two issues (at least when doing conversions to IEC units). First, it will print some values with one decimal place and others with no decimal place. This will generally give you a result that can be hard to skim because not everything lines up, like this:

 3.3T [...]
 581G [...]
 532G [...]
 [...]
  11G [...]
 9.8G [...]
 [...]
 1.1G [...]
 540M [...]

I prefer all of the numbers to line up, which means explicitly specifying the number of decimal places that everything gets. I tend to use one decimal place for everything, but none ('.0') is a perfectly okay choice. This is done with the --format argument:

 ... | numfmt --format '%.1f' --to iec

The second issue is that in the process of reformatting your numbers, numfmt will by and large remove any nice initial formatting you may have tried to do in your awk. Depending on how much (re)formatting you want to do, you may want another 'awk' step after the numfmt to pretty-print everything, or you can perhaps get away with --format:

... | numfmt --format '%10.1f  ' --to iec

Here I'm specifying a field width for enough white space and also putting some spaces after the number.

Even with the need to fiddle around with formatting afterward, using numfmt was very much the easiest and fastest way to humanize numbers in this script. Now that I've gone through this initial experience with numfmt, I'll probably use it more in the future.

Workarounds are often forever (unless you work to make them otherwise)

By: cks
16 August 2024 at 02:37

Back in 2018, ZFS on Linux had a bug that could panic the system if you NFS-exported ZFS snapshots. We were setting up ZFS based NFS fileservers and we knew about this bug, so at the time we set things so that only filesystems themselves were NFS exported and available on our servers. Any ZFS snapshots on filesystems were only visible if you directly logged in to the fileservers, which was (and is) something that only core system staff could do. This is somewhat inconvenient; we have to get involved any time people want to get stuff back from snapshots.

It is now 2024. ZFS on Linux became OpenZFS (in 2020) and has long since fixed that issue and released versions with the fix. If I'm retracing Git logs correctly, the fix was in 0.8.0, so it was included (among many others) in Ubuntu 22.04's ZFS 2.1.5 (what our fileservers are currently running) and Ubuntu 24.04's ZFS 2.2.2 (what our new fileservers will run).

When we upgraded the fileservers from 18.04 to 22.04, did we go back to change our special system for generating NFS export entries to allow NFS clients to access ZFS snapshots? You already know the answer to that. We did not, because we had completely forgotten about it. Nor did we go back to do it as we were preparing the 24.04 setup of our ZFS fileservers. It was only today that it came up, as we were dealing with restoring a file from those ZFS snapshots. Since it's come up, we're probably going to test the change and then do it for our future 24.04 fileservers, since it will make things a bit more convenient for some people.

(The good news is that I left comments to myself in one program about why we weren't using the relevant NFS export option, so I could tell for sure that it was this long since fixed bug that had caused us to leave it out.)

It's a trite observation that there's nothing so permanent as a temporary solution, but just because it's trite doesn't mean that it's wrong. A temporary workaround that code comments say we thought we might revert later in the life of our 18.04 fileservers has lasted about six years, despite being unnecessary since no later than when our fileservers moved to Ubuntu 22.04 (admittedly, this wasn't all that long ago).

One moral I take from this is that if I want us to ever remove a 'temporary' workaround, I need to somehow explicitly schedule us reconsidering the workaround. If we don't explicitly schedule things, we probably won't remember (unless it's something sufficiently painful that it keeps poking us until we can get rid of it). The purpose of the schedule isn't necessarily to make us do the thing, it's to remind us that the thing exists and maybe it shouldn't.

(As a corollary, the schedule entry should include pointers to a lot of detail, because when it goes off in a year or two we won't really remember what it's talking about. That's why we have to schedule a reminder.)

Traceroute, firewalls, and the modern Internet: a horrible realization

By: cks
15 August 2024 at 03:11

The venerable traceroute command sort of reports the hops your packets take to reach a host, and in the process can reveal where your packets are getting dropped or diverted. The traditional default way that traceroute works is by sending UDP packets to a series of high UDP ports with increasing IP TTLs, and seeing where each reply comes from. If the TTL runs out on the way, traceroute gets one reply; if the packet reaches the host, traceroute gets another one (assuming that nothing is listening on the particular UDP port on the host, which usually it isn't). Most versions of traceroute can also use ICMP based probes, while some of them can also use TCP based ones.

While writing my entry on using traceroute with a fixed target port, I had a horrible realization: traceroute's UDP probes mostly won't make it through firewalls. Traceroute's UDP probes are made to a series of high UDP ports (often starting at port 33434 and counting up). Most firewalls are set to block unsolicited incoming UDP traffic by default; you normally specifically configure them to pass only some UDP traffic through to limited ports (such as port 53 for DNS queries to your DNS servers). When traceroute's UDP packets, sent to effectively random high ports, arrive at such a firewall, the firewall will discard or reject them and your traceroute will go no further.

(If you're extremely confident no one will ever run something that listens on the UDP port range, you can make your firewall friendly to traceroute by allowing through UDP ports 33434 to 33498 or so. But I wouldn't want to take that risk.)

The best way around this is probably to use ICMP for traceroute (using a fixed UDP port is more variable and not always possible). Most Unix traceroute implementations support '-I' to do this.

This matters in two situations. First, if you're asking outside people to run traceroutes to your machines and send you the results, and you have a firewall; without having them use ICMP, their traceroutes will all look like they fail to reach your machines (although you may be able to tell whether or not their packets reach your firewall). Second, if you're running traceroute against some outside machine that is (probably) behind a firewall, especially if the firewall isn't directly in front of it. In that case, your traceroute will always stop at or just before the firewall.

A note to myself about using traceroute to check for port reachability

By: cks
15 August 2024 at 03:08

Once upon a time, the Internet was a simple place; if you could ping some remote IP, you could probably reach it with anything. The Internet is no longer such a simple place, or rather I should say that various people's networks no longer are. These days there are a profusion of firewalls, IDS/IDR/IPS systems, and so on out there in the world, and some of them may decide to block access only to specific ports (and only some of the time). In this much more complicated world, you can want to check not just whether a machine is responding to pings, but if a machine responds to a specific port and if it doesn't, where your traffic stops.

The general question of 'where does your traffic stop' is mostly answered by the venerable traceroute. If you think there's some sort of general block, you traceroute to the target and then blame whatever is just beyond the last reported hop (assuming that you can traceroute to another IP at the same destination to determine this). I knew that traceroute normally works by sending UDP packets to 'random' ports (with manipulated (IP) TTLs, and the ports are not actually picked randomly) and then looking at what comes back, and I superstitiously remembered that you could fix the target port with the '-p' argument. This is, it turns out, not actually correct (and these days that matters).

There are several common versions of (Unix) traceroute out there; Linux, FreeBSD, and OpenBSD all use somewhat different versions. In all of them, what '-p port' actually does by itself is set the starting port, which is then incremented by one for each additional hop. So if you do 'traceroute -p 53 target', only the first hop will be probed with a UDP packet to port 53.

In Linux traceroute, you get a fixed UDP port by using the additional argument '-U'; -U by itself defaults to using port 53. Linux traceroute can also do TCP traceroutes with -T, and when you do TCP traceroutes the port is always fixed.

In OpenBSD traceroute, as far as I can see you just can't get a fixed UDP port. OpenBSD traceroute also doesn't do TCP traceroutes. On today's Internet, this is actually a potentially significant limitation, so I suspect that you most often want to try ICMP probes ('traceroute -I').

In FreeBSD traceroute, you get a fixed UDP port by turning on 'firewall evasion mode' with the '-e' argument. FreeBSD traceroute sort of supports a TCP traceroute with '-P tcp', but as the manual page says you need to see the BUGS section; it's going to be most useful if you believe your packets are getting filtered well before their destination. Using the TCP mode doesn't automatically turn on fixed port numbers, so in practice you probably want to use, for example, 'traceroute -P tcp -e -p 22 <host>' (with the port number depending on what you care about).

Having written all of this down, hopefully I will remember it for the next time it comes up (or I can look it up here, to save me reading through manual pages).

Some thoughts on OpenSSH 9.8's PerSourcePenalties feature

By: cks
14 August 2024 at 03:06

One of the features added in OpenSSH 9.8 is a new SSH server security feature to slow down certain sorts of attacks. To quote the release notes:

[T]he server will now block client addresses that repeatedly fail authentication, repeatedly connect without ever completing authentication or that crash the server. [...]

This is the PerSourcePenalties configuration setting and its defaults, and also see PerSourcePenaltyExemptList and PerSourceNetBlockSize. OpenSSH 9.8 isn't yet in anything we can use at work, but it will be in the next OpenBSD release (and then I'll get it on Fedora).

On the one hand, this new option is exciting to me because for the first time it lets us block only rapidly repeating SSH sources that fail to authenticate, as opposed to rapidly repeating SSH sources that are successfully logging in to do a whole succession of tiny little commands. Right now our perimeter firewall is blind to whether a brief SSH connection was successful or not, so all it can do is block on total volume, and this means we need to be conservative in its settings. This is a single machine block (instead of the global block our perimeter firewall can do), but a lot of SSH attackers do seem to target single machines with their attacks (for a single external source IP, at least).

(It's also going to be a standard OpenSSH feature that won't require any configuration, firewall or otherwise, and will slow down rapid attackers.)

On the other hand, this is potentially an issue for anything that makes health checks like 'is this machine responding with a SSH banner' (used in our Prometheus setup) or 'does this machine have the SSH host key we expect' (used in our NFS mount authentication system). Both of these cases will stop before authentication and so fall into the 'noauth' category of PerSourcePenalties. The good news is that the default refusal duration for this penalty is only one second, which is usually not very long and you're probably not going to run into in health checks. The exception is if you're trying to verify multiple types of SSH host keys for a server, because you can only verify one host key in a given connection, so if you need to verify both a RSA host key and an Ed25519 host key, you need two connections.

(Even then, the OpenSSH 9.8 default is that only you only get blocked once you've built up 15 seconds of penalties. At the default settings, this would be hard with even repeated host key checks, unless the server has multiple IPs and you're checking all of them.)

It's going to be interesting to read practical experience reports with this feature as OpenSSH 9.8 rolls out to more and more people. And on that note I'm certainly going to wait for people's reports before doing things like increasing the 'authfail' penalty duration, as tempting as it is right now (especially since it's not clear from the current documentation how unenforced penalty times accumulate).

Uncertainties and issues in using IPMI temperature data

By: cks
13 August 2024 at 03:24

In a comment on my entry about a machine room temperature distribution surprise, tbuskey suggested (in part) using the temperature sensors that many server BMCs support and make visible through IPMI. As it happens, I have flirted with this and have some pessimistic views on it in practice in a lot of circumstances (although I'm less pessimistic now that I've looked at our actual data).

The big issue we've run into is limitations in what temperature sensors are available with any particular IPMI, which varies both between vendors and between server models even for the same vendor. Some of these sensors are clearly internal to the system and some are often vaguely described (at least in IPMI sensor names), and it's hit or miss if you have a sensor that either explicitly labels itself as an 'ambient' temperature or that is probably this because it's called an 'inlet' temperature. My view is that only sensors that report on ambient air temperature (at the intake point, where it is theoretically cool) are really useful, even for relative readings. Internal temperatures may not rise very much even if the ambient temperature does, because the system may respond with measures like ramping up fan speed; obviously this has limits, but you'd generally like to be alerted before things have gotten that bad.

(Out of the 85 of our servers that are currently reporting any IPMI temperatures at all, only 53 report an inlet temperature and only nine report an 'ambient' temperature. One server reports four inlet temperatures; 'ambient', two power supplies, and a 'board inlet' temperature. Currently its inlet ambient is 22C, the board inlet is 32C, and the power supplies are 31C and 36C.)

The next issue I'm seeing in our data is that either we have temperature differences of multiple degrees C between machines higher and lower in racks, or the inlet temperature sensors aren't necessarily all that accurate (even within the same model of server, which will all have the 'inlet' temperature sensor in the same place). I'd be a bit surprised if our machine room ambient air did have this sort of temperature gradient, but I've been surprised before. But that probably means that you have to care about where in the rack your indicator machines are, not just where in the room.

(And where in the room probably matters too, as discussed. I see about a 5C swing in inlet temperatures between the highest and lowest machines in our main machine room.)

We push all of the IPMI readings we can get (temperature and otherwise) into our Prometheus environment and we use some of the IPMI inlet temperature readings to drive alerts. But we consider them only a backup to our normal machine room temperature monitoring, which is done by dedicated units that we trust; if we can't get readings from the main unit for some reason, we'll at least get alerts if something also goes wrong with the air conditioning. I wouldn't want to use IPMI readings as our primary temperature monitoring unless I had no other choice.

(The other aspect of using IPMI temperature measurements is that either the server has to be up or you have to be able to talk to its BMC over the network, depending on how you're collecting the readings. We generally collect IPMI readings through the host agent, using an appropriate ipmitool sub-command. Doing this through the host agent has the advantage that the BMC doesn't even have to be connected to the network, and usually we don't care about BMC sensor readings for machines that are not in service.)

Allocating disk space (and all resources) is ultimately a political decision

By: cks
11 August 2024 at 02:52

In a multi-person or multi-group environment with shared resources, like a common set of fileservers, you often need to allocate resources like disk space between different uses. There are many different technical ways to do this, and also you can often not explicitly do this by shoving everyone into a big pile. Sometimes, you might be tempted to debate the technical merits of any particular approach, and while the technical merits of different ways potentially matter, in the end resource allocation is a political decision (although what is technically possible or feasible does limit the political options).

(Note that not specifically allocating resources is also a political decision; it is the decision to let resources like disk space be allocated on a first come, first served basis instead of anything else.)

In general, "political" is not a bad word. Politics, in the large, is about mediating social disagreements and, hopefully, making people feel okay about the results. Allocating limited resources is an area where there is no perfect answer and any answer that you choose will have unsatisfactory aspects. Weighing those tradeoffs and choosing a set of them is a (hard) social problem, which must be dealt with through a political decision.

Because resource allocation is a political decision, the specific decisions reached in your organization may well constrain your technical choices and, for example, complicate a storage migration (because you've chosen to allocate disk space in a specific way). Over the course of my career, I've come to understand that this isn't bad as such; it's just that social problems are more important and higher level than technical ones. It's more important to solve the social problems than it is to have an ideal technical world, because ultimately the technology exists to help the people.

One aspect of constraining your technical choices is that you may wind up not doing perfectly sensible and useful technical things because they go against the political decisions and goals around resource allocation. These decisions aren't irrational or wrong, exactly, although they can be hard to explain without explaining the political background.

(This doesn't mean that every design or operations decision that affects resource allocation has to be made at the political level in your organization, and in fact they generally can't be; you have to make some of them, even if it's to not specifically allocate resources and let them be used on a first come, first serve basis (or an 'everyone gets whatever portion they can right now'). But even if you make the decision and do so based on technical factors, it's best to remember that you're making a decision with political effects, and perhaps to think about who will be affected and how.)

PS: This aspect of why things work as they do being hard to explain isn't confined to technology; there are aspects of how the recreational bike club I'm part of operates that people have sometimes asked me about (sometimes in the form of 'why doesn't the club do <sensible seeming thing X>') and I've found hard to explain, especially concisely. Part of the answer is that the club has made a social ('political') decision to operate in a certain way.

A surprise with the temperature distribution in our machine room

By: cks
4 August 2024 at 02:37

Our primary machine room is quite old and is set up in an old fashioned way, so that we don't really have separate 'hot aisles' and 'cold aisles'; the closest we come is one aisle where both sides are the the front of servers. We have some long standing temperature monitoring in this machine room, and recently (for reasons outside the scope of this entry) we put a second (trustworthy) temperature monitoring unit into the room. The first temperature sensor is relatively near the room's AC unit, while the second unit is about as far away from it as you can get (by our rack of fileservers, not entirely coincidentally).

Before we set up the second temperature unit and started to get readings from it, I would have confidently predicted that it would report a higher temperature than the first unit, given that it was all the way diagonally across the room from the AC unit, and that row of racks sort of backs on to one of the room's walls (with space left for access and air circulation). Instead, it consistently reads lower than the first unit; how much lower depends on where the room is in the AC's cycle, because the second unit sees lower temperature swings than the first one.

(At their farthest apart, the two readings can be over 2 degrees Celsius different; at their closest, they can be only 0.2 C apart. Generally they're closest when the AC is on and the room temperature is at its coolest, and furthest apart when the room is at its warmest and the AC is about to come up for another cycle. Our temperature graphs also suggest that the cold air from the AC being on takes a bit longer to reach the far unit than the near unit.)

Temperature sensors can be fickle things, but this is an industrial unit with a good reputation (and an external sensor on a wire), so I believe the absolute numbers shown by its readings. So one of the lessons I take from this is that I can't predict the temperature distributions of our machine room (or more generally, any of our machine rooms and wiring closets). If we ever need to know where the hot and cold spots are, we can't guess based on factors like the distance from the AC units; we'll need to actively measure with something appropriate.

(I'm not sure what we'd use for relatively rapid temperature readings of the local ambient air temperature, but there are probably things that can be used for this.)

On not automatically reconnecting to IPMI Serial-over-LAN consoles

By: cks
31 July 2024 at 02:25

One of the things that the IPMI (network) protocol supports is Serial over LAN, which can be used to expose a server's serial console over your BMC's management network. These days, servers are starting to drop physical serial ports, making IPMI SOL your only way of getting console serial ports. The conserver serial console management software supports IPMI SOL (if built with the appropriate libraries), and you can directly access SOL serial consoles with IPMI programs. However, as I mentioned in passing in yesterday's entry, IPMI SOL access has a potential problem, which is that only one SOL connection is allowed at a time and if someone makes a new SOL connection, any old one is automatically disconnected. This disconnection is invisible to the IPMI SOL client until (and unless) it attempts to send something to the SOL console, at which point it apparently gets a timeout. This is bad for a program like conserver, which in many situations will only read SOL console output in order to log it, not send any input to the SOL console.

(This BMC behavior may not be universal, based on some comments in FreeIPMI.)

Conserver uses FreeIPMI for IPMI SOL access, which supports a special 'serial keepalive' option (which you can configure in libipmiconsole.conf) to detect and remedy this. As covered in comments in ipmiconsole.h, this option (normally) works by periodically sending a NUL character to the SOL console, which will make the BMC eventually tell you that the serial connection has been broken and you need to re-create your IPMI SOL session so that now you get serial output again.

When I first read about this option I was enthused about putting it into our configuration, so that conserver would automatically re-establish stolen SOL connections. Then I thought about it a bit more and decided that this probably wasn't a good idea. The problem is that there's no way to tell if another IPMI SOL session is active at the moment or not (at least with this option); all we can do is unconditionally take the SOL console back. If one of us has made a SOL connection, done some stuff, and disconnected again, this is fine. If one of us is in the process of using a live SOL connection right now, this is bad.

This is especially so because about the only time when we'd resort to using a direct IPMI SOL connection instead of logging in to the console server and using conserver is when either we can't get to the console server or the console server can't get to the BMC of the machine we want to connect to. These are stressful situations when something is already wrong, so the last thing we want is to compound our problems by having a serial console connection stolen in the middle of our work.

Not configuring FreeIPMI with serial keepalives doesn't completely eliminate this problem; it could still happen if the console server machine is (re)booted or conserver is restarted. Both of these will cause conserver to start up, make a bunch of IPMI SOL connections, and steal any current by-hand SOL connections away from us. But at least it's less likely.

Handling (or not) the serial console of our serial console server

By: cks
30 July 2024 at 02:29

We've had a central serial console server for a long time. It has two purposes; it logs all of the (serial) console output from servers and various other pieces of hardware (which on Linux machines includes things like kernel messages, cf), and it allows us to log in to machines over their serial console. For a long time this server was a hand built one-off machine, but recently we've been rebuilding it on our standard Ubuntu framework (much like our central syslog server). Our standard Ubuntu framework includes setting up a (kernel) serial console, which made me ask myself what we were going to do with the console server's serial console.

We have a matrix of options. We can direct the serial console to either a physical serial port or to the BMC's Serial over LAN system. Once the serial console is somewhere, we can ignore it except when we want to manually use it, connect it to the console server's regular conserver instance, or connect it to a new conserver instance on some other machine (which would have to be using either IPMI Serial-over-LAN or a USB serial port, depending on which serial console we pick).

Connecting the console server's serial port to itself would let us log routine serial console output in the same place that we put all of the other serial console output. However, it wouldn't allow us to capture kernel logs if the machine crashed for some reason, which is one valuable thing that our current serial console setup has, or log in through the serial console if the console server fell off the network. Setting up a backup, single-host conserver on another machine would allow us to do both, at the cost of having a second conserver machine to think about.

Using Serial-over-LAN would allow us to log in to the console server over its serial console from any other machine that had access to what has become our IPMI/BMC network, which is a number of them (it's that way for emergency access purposes). However it requires that the BMC network be up, which is to say that all of the relevant switches are working. A direct (USB) serial connection would only require the other machine to be up and reachable.

Of course we can split the difference. We could have the Linux kernel serial console on the physical serial port and also have logins enabled on the Serial-over-LAN serial port. In a lot of situations this would still give us remote access to the console server, although we wouldn't be able to trigger things like Magic SysRq over the SoL connection since it's not a kernel console.

(Unfortunately you can only have one kernel serial console.)

My current view is that the easiest thing to start with is to set the serial console to the Serial-over-LAN port and then not have anything collecting kernel messages from it. If we decide we want to change that, we can harvest SoL serial console messages from either the console server itself or from another machine. In an emergency, a SoL port can be accessed from any machine with BMC network access, not just from its conserver machine, unlike a physical serial port (which would have to be accessed from the other machine connected to it).

(In our current conserver setup, you don't really want to access the SoL port from another machine if you can avoid it. Doing so will quietly break the connection from conserver on the console server until you restart conserver. It's possible we could work around this with libipmiconsole.conf settings.)

Our slowly growing Unix monoculture

By: cks
29 July 2024 at 02:53

Once upon a time, we ran Ubuntu Linux machines, OpenBSD machines, x86 Solaris machines, and what was then RHEL machines (in the days of our first generation ZFS fileservers). Over time, Solaris changed to OmniOS (and RHEL to CentOS), but even at the time it was clear that both of those hadn't caught on here and after a while we replaced the OmniOS fileservers and CentOS iSCSI backends with our third generation Ubuntu-based fileservers. Then recently, the final pieces of CentOS have been getting removed, such as our central syslog servers because CentOS as it originally was is dead (the current 'CentOS Stream' doesn't meet our needs).

Our OpenBSD usage has also been dwindling. Originally we used OpenBSD for firewalls, most DNS service, a DHCP server, and several VPN servers (for different VPN protocols). Our internal DNS resolvers now run Bind on Ubuntu and we've been expecting to some day have to move our VPN servers away from OpenBSD in order to get more up to date versions of the various VPNs (although this hasn't happened yet). The OpenBSD DHCP server is fine so far, but we have three DHCP servers and two of them are Ubuntu machines, so I wouldn't be surprised if we switch the third to Ubuntu as well when we next rebuild it.

(There's basically no prospect of us switching away from OpenBSD on the firewalls, but the firewalls are effectively appliances.)

It's probably been plural decades since our users logged in to anything other than x86 Ubuntu machines, and at least a decade since any of them were 32-bit x86 instead of 64-bit x86. It seems unlikely that we'll get ARM-based machines, especially ones that we expose to people to log in to and use. I expect we'll have to switch away from Ubuntu someday, but that will be a switch, not a long term plan of running Ubuntu as well as something else, and the most likely candidate (Debian) won't look particularly different to most people.

The old multi-Unix, multi-architecture days had their significant drawbacks, but sometimes I wonder what we're losing by increasingly becoming a monoculture that runs Ubuntu Linux and (almost) nothing else. I feel that as system administrators, there's something we gain by having exposure to different Unixes that make different choices and have different tools than Ubuntu Linux. To put it one way, I think we get a wider perspective and wind up with more ideas and approaches in our mental toolkit. We have that today because of our history, so hopefully it won't atrophy too badly when we really narrow down to being a monoculture.

How I almost set up a recursive syslog server

By: cks
26 July 2024 at 02:48

Over on the Fediverse, I mentioned an experience I had today:

Today I experienced that when you tell a syslog server to forward syslog to another server, it forwards everything. Including anything it was sent by other servers. And to confuse you, those forwarded messages will often be logged with the original host names, so you can wonder what these weird servers are that are sending you unexpected traffic.

At least I caught this before we had the central syslog server forward to itself. That probably would have been funβ„’.

You might wonder how on earth you do this to yourself without noticing, and the answer is the (dangerous) power of standardized installs.

We've had a central syslog server for a long time, along with another syslog server that we run for machines run by Points of Contact that are on internal sandbox networks. For much of this time, these syslog servers have been completely custom-installed machines; for example, they ran RHEL and then CentOS when we'd switched to Ubuntu for the rest of our machines. The current hardware and OS setup on these machines has been aging, so we've been working on replacing them. This time around, rather than doing a custom install, we decided to make these machines one variant of our standard Ubuntu install, supplemented by a small per-machine customization process. There are some potential downsides to this, since the machines have somewhat less security isolation, but we felt the advantages were worth it (for example, now they'll be part of our standard update system).

Part of our standard Ubuntu install configures the installed machine's syslog daemon to forward a copy of all syslog messages to our central syslog server; specifically this is part of the standard scripts that are run on a machine to give it our general baseline setup. This is standard and so basically invisible, so I didn't think of this syslog forwarding when putting together the post-install customization instructions for these syslog servers. Fortunately, the first syslog server we rebuilt and put into production was the additional syslog server for other people's logs, not the central server for our own logs. It was fortunate that today I had a reason to look at one set of logs on our central syslog server that had low enough log volume that I could spot out of place entries immediately, and then start trying to track them down.

This sort of thing is fairly closely related to the general large environment issue where you have recursive dependencies or recursive relationships between services, often without realizing it. You can even get direct self-dependencies, for example if you don't remember to change your DHCP server away from getting its network configuration by DHCP, although in that sort of case you're probably going to notice the first time you reboot the machine in production (assuming you don't have redundant DHCP servers; if you do, you might not find this out until you're cold-starting your entire environment).

(Some self-usage is harmless and even a good thing. For example, you probably want your internal DNS resolvers to do any necessary DNS lookups through themselves, instead of trying to find some other DNS resolver for them.)

Our giant login server: solving resource problems with brute force

By: cks
22 July 2024 at 03:05

One of the moderately peculiar aspects of our environment is that we still have general Unix multiuser systems that people with accounts can log in to and do stuff on. As part of this we have some general purpose login servers, and in particular we have one that's always been the most popular, partly because it was what you got when you did 'ssh cs.toronto.edu'. For years and years we had a succession of load and usage issues on this server, where someone would log in and start doing something that was CPU or memory intensive, hammering the machine for everyone on it (which was generally a lot of people, and so this could be pretty visible). We spent a non-trivial amount of time keeping an eye on the machine's load, sending email to people, terminating people's heavy-duty processes, and in a few cases having to block logins from specific people until they paid attention to their email.

Then a few years ago we had a chunk of spare money and decided to spend it on getting rid of the problem once and for all. We did this by buying a ridiculously overpowered server to become the new version of our primary login server, with 512 GB of RAM and 112 CPUs (AMD Epyc 7453s); in fact we bought two at once and put the other one into our SLURM cluster, where it was at the time one of the most powerful compute machines there (back in 2022).

By itself this wouldn't be sufficient to protect us from having to care about what people were doing on the machine, because (some) modern software can eat any amount of CPUs and RAM that's available (due to things like auto-sizing how many things it does in parallel based on the available CPU count). So we set up per-user CPU and memory resource limits for all users. Because this server is so big, we can actually give people quite large limits; our current settings are 30 GBytes of RAM and 8 CPUs, which is effectively a reasonable desktop machine (we figure people can't really complain at that point).

(In completely unsurprising news, people do manage to run into the memory limit from time to time and have their giant processes killed.)

These limits don't completely guarantee avoiding problems, since enough different people doing enough at once could still overload the machine. But this hasn't happened yet, so in practice we've been able to basically stop caring about what people run on our primary login server, and with it we've stopped watching things like its load average and free memory. For people using our primary login server, the benefit is that they can do a lot more than they could before without problems and they don't get affected by what other people are doing.

My home wireless network and convenience versus security

By: cks
21 July 2024 at 02:46

The (more) secure way to do a home wireless network (or networks) is relatively clear. Your wireless network (or networks) should exist on its own network segment, generally cut off from any wired networking you have and definitely cut off from direct access to your means of Internet connectivity. To get out of the network it should always have to go through a secure gateway that firewalls your home infrastructure from the random wireless devices you have to give wifi access to and their random traffic. One of the things that this implies is that you should implement your wireless with a dedicated wireless access point, not with the wifi capabilities of some all in one device.

When I set up my wireless network, I didn't do it this way, and I've kept not doing it this way ever since. My internet connection uses VDSL and when I upgraded to VDSL you couldn't get things that were just a 'VDSL modem'; the best you could do was all in one routers that could have the router bit turned off. My VDSL 'modem' also could be a wifi AP, so when I wanted a wireless network all of a sudden I just turned that on and then set up my home desktop to be a DHCP server, NAT gateway, and so on. This put wifi clients on the same network segment as the VDSL modem, and in fact I lazily used the same subnet rather than running two subnets over the same physical network segment.

(Because all Internet access runs through my desktop, there's always been some security there. I only NAT'd specific IPs that I'd configured, not anything that happened to randomly show up on the network.)

Every so often since then I've thought about changing this situation. I could get a dedicated wifi AP (and it might well have better performance and reach more areas than the current VDSL modem AP does; the VDSL modem doesn't even have an external wifi antenna), and add another network interface to my desktop to segment wifi traffic to the new wifi AP network. It would get its own subnet and client devices wouldn't be able to talk directly to the VDSL modem or potentially snoop (PPPoE) traffic between my desktop and the VDSL modem.

However, much as with other tradeoffs of security versus convenience, in practice I've come down on the side of convenience. Even though it's a bit messy and not as secure as it could be, my current setup works well enough and hasn't caused problems. By sticking with the current situation, I avoid the annoyance of trying to find and buy a decent wifi AP, reorganizing things physically, changing various system configurations, and so on.

(This also avoids adding another little device I'd want to keep powered from my UPS during a power outage. I'm always going to power the VDSL modem, and I'd want to power the wifi AP too because otherwise things like my phone stop being able to use my local Internet connection and have to fall back to potentially congested or unavailable cellular signal.)

SSH has become our universal (Unix) external access protocol

By: cks
18 July 2024 at 02:59

When I noted that brute force attackers seem to go away rapidly if you block them, one reaction was to suggest that SSH shouldn't be exposed to the Internet. While this is viable in some places and arguably broadly sensible (since SSH has a large attack surface, as we've seen recently in CVE-2024-6387), it's not possible for us. Here at a university, SSH has become our universal external access protocol.

One of the peculiarities of universities is that people travel widely, and during that travel they need access to our systems so they can continue working. In general there are a lot of ways to give people external access to things; you can set up VPN servers, you can arrange WireGuard peer to peer connections, and so on. Unfortunately, often two issues surface; our people have widely assorted devices that they want to work from, with widely varying capabilities and ease of using VPN and VPN like things, and their (remote) network environments may or may not like any particular VPN protocol (and they probably don't want to route their entire Internet traffic the long way around through us).

The biggest advantage of SSH is that pretty much everything can do SSH, especially because it's already a requirement for working with our Unix systems when you're on campus and connecting from within the department's networks; this is not necessarily so true of the zoo of different VPN options out there. Because SSH is so pervasive, it's also become a lowest common denominator remote access protocol, one that almost everyone allows people to use to talk to other places. There are a few places where you can't use SSH, but most of them are going to block VPNs too.

In most organizations, even if you use SSH (and IMAP, our other universal external access protocol), you're probably operating with a lot less travel and external access in general, and hopefully a rather more controlled set of client setups. In such an environment you can centralize on a single VPN that works on all of your supported client setups (and meets your security requirements), and then tell everyone that if they need to SSH to something, first they bring up their VPN connection. There's no need to expose SSH to the world, or even let the world know about the existence of specific servers.

(And in a personal environment, the answer today is probably WireGuard, since there are WireGuard clients on most modern things and it's simple enough to only expose SSH on your machines over WireGuard. WireGuard has less exposed attack surface and doesn't suffer from the sort of brute force attacks that SSH does.)

Brute force attackers seem to switch targets rapidly if you block them

By: cks
12 July 2024 at 02:27

Like everyone else, we have a constant stream of attackers trying brute force password guessing against us using SSH or authenticated SMTP, from a variety of source IPs. Some of the source IPs attack us at a low rate (although there can be bursts when a lot of them are trying), but some of them do so at a relatively high rate, high enough to be annoying. When I notice such IPs (ones making hundreds of attempts an hour, for example), I tend to put them in our firewall blocks. After recently starting to pay attention to what happens next, what I've discovered is that at least currently, most such high volume IPs give up almost immediately. Within a few minutes of being blocked their activity typically drops to nothing.

Once I thought about it, this behavior feels like an obvious thing for attackers to do. Attackers clearly have a roster of hosts they've obtained access to and a whole collection of target machines to try brute force attacks against, with very low expectations of success for any particular attack or target machine; to make up for the low success rate, they need to do as much as possible. Wasting resources on unresponsive machines cuts down the number of useful attacks they can make, so over time attackers have likely had a lot of motivation to move on rapidly when their target stops responding. If the target machine comes back some day, well, they have a big list, they'll get around to trying it again sometime.

The useful thing about this attacker behavior is that if attackers are going to entirely stop using an IP to attack you (at least for a reasonable amount of time) within a few minutes of it being blocked, you only need to block attacker IPs for those few minutes. After five or ten or twenty minutes, you can remove the IP block again. Since the attackers use a lot of IPs and their IPs may get reused later for innocent purposes, this is useful for keeping the size of firewall blocks down and limiting the potential impact of a mis-block.

(A traditional problem with putting IPs in your firewall blocks is that often you don't have a procedure to re-assess them periodically and remove them again. So once you block an IP, it can remain blocked for years, even after it gets turned over to someone completely different. This is especially the case with cloud provider IPs, which are both commonly used for attacks and then commonly turn over. Fast and essentially automated expiry helps a lot here.)

"Out of band" network management is not trivial

By: cks
7 July 2024 at 02:24

One of the Canadian news items of the time interval is that a summary of the official report on the 2022 Rogers Internet and phone outage has been released (see also the CBC summary of the summary, and the Wikipedia page on the outage). This was an extremely major outage that took down both Internet and phone service for a lot of people for roughly a day and caused a series of failures in services and systems that turned out to rely on Rogers for (enough of) their phone and Internet connectivity. In the wake of the report, some people are (correctly) pointing to Rogers not having any "Out of Band" network management capability as one of the major contributing factors. Some people have gone so far as to suggest that out of band network management is an obvious thing that everyone should have. As it happens I have some opinions on this, and the capsule summary is that out of band network management is non-trivial.

(While the outage 'only' cut off an estimated 12 million people, the total population of Canada is about 40 million people, so it directly affected more than one in four Canadians.)

Obviously, doing out of band network management means that you need a dedicated set of physical hardware for your OOB network; separate switches, routers, local network cabling, and long distance fiber runs between locations (whether that is nearby university buildings or different cities). If you're serious, you probably want your OOB fiber runs to have different physical paths than your regular network fiber, so one backhoe accident can't cut both of them. This separate network infrastructure has to run to everything you want to manage and also to everywhere you want to manage your network from. This is potentially a lot of physical hardware and networking, and as they say it can get worse.

(This out of band network also absolutely has to be secure, because it's a back door to your entire network.)

When you set up OOB network management, you have a choice to make; is your OOB network the only way to manage equipment, or can you manage equipment either 'in-band' through your regular network or through the out of band network. If your OOB network is your only way of managing things, you not only have to build a separate network, you have to make sure it is fully redundant, because otherwise you've created a single point of failure for (some) management. If your OOB network is a backup, you don't necessarily need as much redundancy (although you may want some), but now you need to actively monitor and verify that both access paths work. You also have two access paths to keep secure, instead of just one.

Security or rather access authentication is another complication for out of band management networks. If you need your OOB network, you have to assume that all other networks aren't working, which means that everything your network routers, switches, and so on need to authenticate your access has to be accessible through the OOB management network (possibly in addition to through your regular networks, if you also have in-band management). This may not be trivial to arrange, depending on what sort of authentication system you're using. You also need to make sure that your overall authentication flow can complete using only OOB network information and services (so, for example, your authentication server can't reach out to a third party provider's MFA service to send push notifications to authentication apps on people's phones).

Locally, we have what I would describe as a discount out of band management network. It has a completely separate set of switches, cabling, and building to building fiber runs, and some things have their management interfaces on it. It doesn't have any redundancy, which is acceptable in our particular environment. Unfortunately, because it's a completely isolated network, it can be a bit awkward to use, especially if you want to put a device on it that would appreciate modern conveniences like the ability to send alert emails if something happens (or even send syslog messages to a remote server; currently our central syslog server isn't on this network, although we should probably fix that).

In many cases I think you're better off having redundant and and hardened in-band management, especially with smaller networks. Running an out of band network is effectively having two separate networks to look after instead of just one; if you have limited resources (including time and attention), I think you're further ahead focusing on making a single network solid and redundant rather than splitting your efforts.

Structured log formats are not really "plaintext" logs

By: cks
5 July 2024 at 02:25

As sort of a follow on to how plaintext is not a great format for logs, I said something on the Fediverse:

A hill that I will at least fight on is that text based structured log formats are not 'plain text logs' as people understand them, unless perhaps you have very little metadata attached to your log messages and don't adopt one of the unambiguous encoding formats. Sure you can read them with 'less', sort of, but not really well (much less skim them rapidly).

"Plaintext" logs are a different thing than log formats that are stored using only printable and theoretically readable text. JSON is printable text, but if you dump a sequence of JSON objects into a file and call it a 'plaintext log', I think everyone will disagree with you. For system administrators, a "plaintext log" is something that we can readily view and follow using basic Unix text tools. If we can't really read through log messages with 'less' or follow the log file live with 'tail -f' or similar things, you don't have a plaintext log, you have a text encoded log.

Unfortunately, structured log formats may produce text output but often not plaintext output. Consider, for example:

ts=<...> caller=main.go:190 module=dns_amazonca target=8.8.8.8:53 level=info msg="Beginning probe" probe=dns timeout_seconds=30
ts=<...> caller=dns.go:200 module=dns_amazonca target=8.8.8.8:53 level=info msg="Resolving target address" target=8.8.8.8 ip_protocol=ip4
[...]
ts=<...> caller=dns.go:302 module=dns_amazonca target=8.8.8.8:53 level=info msg="Validating RR" rr="amazon.ca.\t17\tIN\tA\t54.239.18.172"

This is all text. You can sort of read it (especially since I've left out the relatively large timestamps). But trying to read through all of these messages with 'less' at any volume would be painful, especially if you care about the specific values of those 'rr=' things, which you're going to have to mentally decode to see through the '\t's (and other characters that may be quoted in strings).

There are text structured log formats that are somewhat better than this, for example ones that put a series of metadata labels and their values at the front then end the log line with the main log message. At least there you can look at the end of the line in things like 'tail' and 'less' to see the message, although it may not be in a consistent column. But the more labels there are, the more the message text gets pushed aside.

One of the most common example of a plaintext log format is the traditional syslog format:

Jul  1 17:58:53 HOST sshd[PID]: error: beginning MaxStartups throttling
Jul  1 17:58:53 HOST sshd[PID]: drop connection #10 from [SOMEIP]:36039 on [MYIP]:22 past MaxStartups

This is almost entirely the message with relatively little metadata (and a minimal timestamp that doesn't even include the year). This is what you need to maximize human readability with 'less', 'tail', and so on.

At this point people will note that the information added by structured logging is potentially important and it's useful to represent it relatively unambiguously. Some other people might ask if traditional Apache common log format, or Exim's log format, are 'plaintext logs'. My answer to both is that this illustrates why plaintext is not a great format for logs. True maximally readable plaintext logs are highly constrained and wind up leaving lots of information out or being ambiguous and hard to process or both. The more additional information you include in a clearly structured format, the more potentially useful it is but the less straightforwardly readable the result is and the less you have plaintext logs.

If you want to use a structured log format, where you sit on the spectrum between plaintext logs and JSON blobs appended to something depends on how you expect your logs to be used and consumed (and stored). If people are only ever going to consume them through special tools, you might as well go full JSON or the equivalent. If people will sometimes read your logs in raw format with 'less' or 'tail' or whatever, or your logs will be comingled with logs from other programs in random line-focused formats, you should probably choose a format that's more readable by eye, perhaps some version of logfmt.

Plaintext is not a great format for (system) logs

By: cks
30 June 2024 at 02:32

Recently I saw some grumpiness on the Fediverse about systemd's journal not using 'plain text' for storing logs. I have various feelings here, but one of the probably controversial ones is that in general, plain text is not a great format for logs, especially system logs. This is independent of systemd's journal or of anything else, and in fact looking back I can see signs of this in my own experiences long before the systemd journal showed up (for instance, it's part of giving up on syslog priorities).

The core problem is that log messages themselves almost invariably come with additional metadata, often fairly rich metadata, but if you store things in plain text it's difficult to handle that metadata. You have more or less three things you can do with any particular piece of metadata:

  • You can augment the log message with the metadata in some (text) format. For example, the traditional syslog 'plain text' format augments the basic syslog message with the timestamp, the host name, the program, and possibly the process ID. The downside of this is that it makes log messages themselves harder to pick out and process; the more metadata you add, the more the log message itself becomes obscured.

    (One can see this in syslog messages from certain sorts of modern programs, which augment their log messages with a bunch of internal metadata that they put in the syslog log message as a series of 'key=value' text.)

  • You can store the metadata by implication, for example by writing log messages to separate files based on the metadata. For example, syslog is often configured to use metadata (such as the syslog facility and the log level) to control which files a log message is written to. One of the drawbacks of storing metadata by implication is that it separates out log messages, making it harder to get a global picture of what was going on. Another drawback is that it's hard to store very many different pieces of metadata this way.

  • You can discard the metadata. Once again, the traditional syslog log format is an example, because it normally discards the syslog facility and the syslog log level (unless they're stored by implication).

The more metadata you have, the worse this problem is. Perhaps unsurprisingly, modern systems can often attach rich metadata to log messages, and this metadata can be quite useful for searching and monitoring. But if you write your logs out in plain text, either you get clutter and complexity or you lose metadata.

Of course if you have standard formats for attaching metadata to log messages, you can write tools that strip or manipulate this metadata in order to give you (just) the log messages. But the more you do this and rely on it, the less your logs are really plain text instead of 'structured logs stored in a somewhat readable text format'.

(The ultimate version of this is giving up on readability in the raw and writing everything out as JSON. This is technically just text, but it's not usefully plain text.)

Is blocking outgoing traffic by default a good firewall choice now?

By: cks
28 June 2024 at 02:54

A few years ago I wrote about how HTTP/3 needed us (and other people) to make firewall changes to allow outgoing UDP port 443 traffic. Recently this entry got discussed on lobste.rs, and the discussion made me think about if our (sort of) default of blocking outgoing traffic was a good idea these days, at least in an environment where we don't know what's happening on our networks.

(If you do know exactly what should be on your networks and what it should be talking to, then blocking everything else is a solid security precaution against various sorts of surprises.)

I say that we 'sort of' block outgoing traffic by default because the composite of our firewall rules (on the firewalls for internal 'sandbox' networks and the perimeter firewall between our overall networks and the university's general network) already default to allowing a lot of things. In practice, mostly we default to blocking access to 'privileged' TCP ports; most or all UDP traffic and most TCP traffic to ports above 1023 is just allowed through. Then of course there is a variegated list of TCP ports that we just always allow through, some of them clearly mostly for historical reasons (we allow gopher (port 70) and finger (port 79), for example).

(Our general allowance for TCP ports above 1023 may have been partly due to FTP, back in the days. Our firewalls and their rules have been there for a long time.)

Historically, ports under 1024 were where interesting services hung out, and so you could block outgoing access to them for a combination of being a good network neighbor and to stop your people from accidentally doing things like use insecure protocols across the Internet (but then, we still allow telnet). These days this logic still sort of applies, but there are a lot of unencrypted and potentially insecure protocols that are found on high TCP ports and so could be accessed fine by people here. And outgoing access to UDP based things (including HTTP/3) is surprisingly open for most of our internal networks (it varies somewhat by network).

There are definitely outgoing low TCP ports that you don't want to let people connect to; the obvious candidate is the constellation of TCP ports associated with Microsoft CIFS (aka 'Samba'). But beyond a few known candidates I'm not sure there's a strong reason to block access to low-numbered TCP ports if we're already allowing access to high ones.

(Pragmatically we're probably not going to change our firewalls at this point. They work as it is and people aren't complaining. Of course we're making a little contribution to an environment where very few people bother trying to get a low numbered port assigned for their new system, because it often wouldn't do them much good. Instead they'll run it over HTTPS.)

A Prometheus Blackbox gotcha: (UDP) DNS replies have a low size limit

By: cks
23 June 2024 at 01:52

For reasons beyond the scope of this entry, we use our Prometheus setup to monitor if we can resolve certain external host names, by doing Blackbox probes to various DNS servers, both our internal resolvers and external ones. Ever since we added this check, we've had weird issues where one of our internal resolvers would periodically fail the check for one particular host name for tens of minutes. This host name involves a long chain of CNAME records and ends with some A records. According to more detailed Blackbox information, the query wasn't failing, it was just not returning all of the information, omitting the A records that we needed. We came up with all sorts of theories about why our DNS server might not be able to fully resolve the CNAME chain, but couldn't find a smoking gun or a firm fix.

Then the other day I was looking at debug output and noticed this:

[...] level=info msg="Got response" response=";; [...] \n;; flags: qr tc rd ra; [...]

(This is in a very long line that puts 'dig' style output for the entire answer in the message, and this whole collection of diagnostic log information is not normally logged as such, merely visible for a while in the Blackbox web interface.)

Did you notice that 'tc' in the flags? That's the flag that is set to indicate a DNS response that has been truncated because it doesn't fit within the size limit. This truncation is what was actually going wrong in our DNS check. This particular DNS name has a chain of CNAMEs, and the providers involved change the CNAMEs relatively rapidly, and some of the time the CNAMEs used were long enough that they pushed the A records our module was looking for out of the truncated DNS reply from our internal DNS resolvers.

As of Blackbox 0.25.0, the Blackbox DNS prober defaults to using UDP, doesn't set any EDNS options to increase the allowed reply size, and doesn't fall back to retrying queries over TCP if a UDP query is truncated. This means Blackbox has the old default UDP DNS reply size limit of 512 bytes, which can easily be exceeded with a large enough CNAME chain, among other things. Unfortunately, there is currently no probe metric that will tell you this has happened.

(If you are sure you know how many answer, authority, and additional DNS RRs will be returned by the query, you can check those metrics, but that won't distinguish between a truncated reply and the DNS server doing something odd.)

The current Blackbox workaround is to change your Blackbox module to use TCP instead of UDP, which doesn't have this sort of size limit. Unfortunately not all DNS servers we care about accept TCP connections (they're not ours, don't ask), so in practice we had to duplicate our Blackbox module to get a TCP version of it, and then switch our internal DNS servers to using the new TCP query module.

I think this behavior has some uses, for example you may want to know if your DNS replies are now too big for non-EDNS UDP clients. However, I think that Blackbox should definitely let you find out if the DNS reply was truncated (ie, had the 'tc' flag set). I also wouldn't mind if a more friendly and modern DNS query process was the Blackbox default, and you had to specifically request a limited version. I suspect that there are various people using Blackbox who don't know just how minimal their DNS probes currently are.

(All of this behavior comes about not directly through Blackbox but through Blackbox doing its DNS queries with github.com/miekg/dns, which documents its behavior in Client.Exchange(). I've filed Blackbox issue #1258 and issue #1259 about this overall situation, so maybe someday we'll be able to see the truncation status in probe metrics and set the EDNS option for a larger message size.)

The IMAP LIST command as it interacts with client prefixes in Dovecot

By: cks
22 June 2024 at 03:53

Good IMAP clients will support the notion of a prefix the client puts on IMAP paths under some name. When I've written about this in the past (cf) I've been abstract about how it worked at the level of IMAP commands, and in particular at the level of the IMAP LIST command, which lists (some of) your folders and mailboxes. The IMAP LIST command is special here because it has two basic arguments, which in RFC 9051 are vaguely described as 'the reference name' and 'the mailbox name with possible wildcards'. This probably makes sense to IMAP experts who understand the difference, but for the rest of us it gets a bit confusing.

(The reason it's confusing is that other IMAP commands involving mailboxes (such as 'SELECT') only have a single argument, and also LIST reports folders and mailboxes as single arguments. This makes it a bit mysterious why LIST has two basic arguments, and how you're supposed to ask for things like 'please list all folders under X'. Or, if you're a system administrator reading Dovecot event logs, it makes it hard to know if a client is behaving badly or is just unusual.)

In particular, you might ask where an IMAP client puts its client prefix in a LIST command. The answer is that it depends on the IMAP client, as we discovered recently during some testing. If the client prefix is 'IMail/' and the client wants to know all of your folders and mailboxes, some clients will send:

x LIST "IMail/" "*"

Other IMAP clients will send:

x LIST "" "IMail/*"

As I have experimentally verified, Dovecot 2.3.16 will accept wildcards in either LIST argument, although this may not be something RFC 9051 requires. In a basic LIST syntax (the two-argument form I've given above), a second argument of "" has a special purpose and needs to be avoided (unless you want the information it provides). On Dovecot, you can write 'LIST "IMail/*" "%"' if you want to put all the heavy wildcard lifting into the first argument.

(Dovecot will let you play special tricks this way. Suppose that you want to find a mailbox called 'Barney' and you know it's somewhere in your mail hierarchy; then you can ask for 'LIST "*" "Barney"' and get just it. I suspect no actual IMAP clients attempt to do this.)

In our logs, we see recursively wildcarded IMAP LIST commands (ie, ones using the '*' wildcard operator) with a first argument that both includes and omits a trailing slash, that is we see clients generate both 'LIST "mail" "*"' and 'LIST "mail/" "*"'. For the recursive wildcard I think this gets you the same result, but there is a difference between them when you use '%':

a LIST "Imap-List-Test" "%"
* LIST (\Noselect \HasChildren) "/" Imap-List-Test
a OK List completed (0.001 + 0.000 secs).

a LIST "Imap-List-Test/" "%"
* LIST (\Noselect \HasChildren) "/" Imap-List-Test/One
a OK List completed (0.001 + 0.000 secs).

Given this, if I was writing an IMAP client program that used LIST and had a non-blank first argument, I would always put the trailing slash on (technically I think this is the folder separator, which doesn't have to be '/' but always is for us).

(I'm not going to write an actual mail reader, but I do sometimes need to write IMAP tests and similar automated things.)

PS: I'm sure that this is all familiar ground to IMAP mavens but we dip into this area only occasionally and I want to write this down for future use before it falls out of my head again.

(We're currently looking at IMAP client prefixes and how clients handle them for reasons well outside the scope of this entry.)

We don't know what's happening on our networks

By: cks
16 June 2024 at 03:37

In some organizations, a foundational principle of their network security (both internal and external) is that you should know about everything that is happening on the network. No program, no network service, no system should be accepting or sending unknown network traffic, and you should be able to completely inventory your expected traffic patterns. In some environments, this will include not just protocol level knowledge but also things like what DNS names should be being looked up. This detailed knowledge is obviously great for network security and for detecting intrusions; unexpected network traffic can be used to trigger investigations and maybe alerts.

(I suspect that this is often an aspirational goal that is not necessarily achieved.)

This is completely impossible in our (network) environment, which I can broadly describe as providing general networking to the research side of a university department. There are two aspects to this. The first aspect is that in our general network environment, there are plenty of desktops, laptops, phones, and other such devices on various pieces of our network, many of them personally owned by people. All of these devices are often running random software that phones home to random places at random times, doing all sorts of random outbound traffic (and no doubt pulling in some amount of inbound traffic in the process). Often the owner of the device has no idea that this traffic is happening, never mind where it's going to, since modern software feels free to talk to wherever it wants without telling you (and of course, the details change all the time).

The second aspect is that we don't quiz people here on what they're doing or demand that they tell us what they're up to before they do it. More broadly, our entire environment doesn't run neatly contained 'services', which can be inventoried before they're deployed, given security reviews, and so on. Instead, we provide an environment to people and they are free to use it as they like to get their (research) work done. If their work or their software needs to talk to something and our firewalls allow it, then they can just do it without having to slow down to talk to us. So even for servers (either ours or those run by people here), we can't predict the network traffic because it depends on what people are doing with them.

(If our firewalls don't allow some needed traffic, we'll generally change that once we know about the issue. In practice our outbound firewalls are relatively porous so a lot of internally-initiated activity will just work.)

But all of this leads to a broad issue, which is that in a university environment, it is not our business what people are doing, on the network or otherwise. If you want an analogy, we are in effect an ISP with some additional services, like printing (still surprisingly popular), (inbound) network security, email, web hosting, and general purpose computation. To have good knowledge of what was happening on our networks we'd have to be gatekeepers or panopticon observers (or both), and we are neither.

(In addition, many of the people using our environment are not employees of the university.)

Fundamentally we don't operate a tightly controlled network environment. Trying to operate as if we did (or should) to any significant degree would be a great way to cause all sorts of problems and get in the way of people doing a wide variety of reasonable things.

Using prime numbers for our Prometheus scrape intervals

By: cks
14 June 2024 at 03:06

When I wrote about the current size of our Prometheus setup I mentioned that some of our Prometheus scrape intervals (how often metrics are collected from a metrics source) were unusual looking numbers like 59 seconds and 89 seconds, instead of conventional ones like 15, 30, or 60 seconds. These intervals are prime numbers, and we use them deliberately so that our metrics collection and checks can't become synchronized to some regular process that happens, for example, once a minute.

Prometheus already scatters the start times of metrics collection within their interval, so synchronization isn't necessarily very likely, but using prime numbers adds an extra level of insurance. At the same time, using prime numbers that are very close to exact times like '60 seconds' or '90 seconds' means that we have relatively good odds of periodically doing our check at exactly the start of a minute, or a 30 second interval, or the like, so that if there is something that happens at :00 or :30 or the like we'll probably observe it sooner or later (although we may not understand what we're seeing).

My feeling is that this irregularity is less important in things that provide cumulative metrics (like most of the metrics from the Prometheus host agent) and more important for 'point in time' metrics of what the current state is, which generally includes Blackbox checks. Cumulative metrics will capture both spikes and quiet periods, but point in time metrics may be distorted by only being collected at busy times (or only at quiet times).

However, our current Prometheus configuration is certainly not being particularly systematic about what has a regular collection interval (like 15, 30, or 60 seconds) and what doesn't. I should probably go back through every collection target, figure out if it falls more into the 'cumulative' category or the 'point in time' category, and set its collection interval to match. This will probably wind up moving some things from being checked every 30 seconds to being checked every 29 (and maybe some from 60 to 59).

(All of this is probably not very important in practice, since the odds of synchronization are relatively low to start with.)

The size of our Prometheus setup as of June 2024

By: cks
12 June 2024 at 03:15

At this point we've been running our Prometheus setup since November 21st 2018, and have still not expired any metrics, so we have full resolution metrics data right back to the beginning. Three years ago, I wrote how big our setup was as of May 2021, and since someone on the Prometheus mailing list was recently asking how big a Prometheus setup you could run, I'm going to do an update on our numbers.

Our core Prometheus server is still a Dell 1U server, with 64 GB of RAM because we could put that much in and it's cheap insurance against high memory usage. The Prometheus time series database (TSDB) is in a mirrored pair of 20 TB HDDs (in 2021 we used 4 TB HDDs, but since then we ran out of space and moved). At the moment we have what 'du -h' says is 6.3 TB of disk space used. The disk space usage has been rising steadily over time; in 2019, 20 days of metrics took 35 GB, and these days they take 104 GB.

(In these two 20-day chunk directories I'm looking at, in 2019 we had 50073266852 samples for 465988 series, and in 2024 we had 130974619588 samples for 1460523 series, which we can broadly approximate as about triple.)

These days, we're running at an ingestion rate of about 73,000 samples a second, scraped from 947 different sample sources; the largest single source of things continues to be Blackbox probes. Our largest single source of samples is no longer Pushgateway (it is now way down) but instead the ZFS exporter we use to get highly detailed ZFS metrics; our most chatty ZFS fileserver generates 95,000 samples from it. Apart from that, the most chatty sample sources are the Prometheus host agents on some of our machines, which can generate up to 19,000 metrics, primarily due to some of our servers having a lot of CPUs. About 700 of our scrape sources generate less than 50 samples per scrape.

(Our scrape rates vary. Host agents and the Cloudflare eBPF exporter are scraped every 15 seconds, we ping most machines every 30 seconds, the ZFS exporter is scraped every 30 seconds, most other Blackbox checks happen every 89 seconds, and a bunch of other scrape targets are every 60 seconds or every 59 seconds (and I should probably regularize that).)

At the moment we're pulling host agent information from 143 machines, doing Blackbox ping checks for 232 different targets, performing 375 assorted Blackbox checks other than pings (a lot of them SSH checks), and running assorted other Prometheus exporters in smaller quantities that we scrape for various things. Every server with a host agent also gets at least two Blackbox checks (ICMP ping and a SSH connection), but we ping and check other things too as you can see from the numbers.

We've grown to 158 alert rules, all running on the default 15 second rule evaluation rate. Evaluation time of all of these alert rules appears to be relatively trivial.

The Prometheus host server has six CPUs and typically runs about 3% user CPU usage. Average inbound bandwidth is about 800 Kbytes/sec. Somewhat to my surprise, this CPU usage does include some amount of Prometheus queries (outside of rules evaluation), because it looks like some people do routinely look at Grafana dashboards and thus trigger Prometheus queries (although I believe it's all for recent data and queries for historical data are relatively rare).

None of this is necessarily a guide to what anyone else could do with Prometheus, or how much resources it would take to handle a particular environment. One of the things that may make our environment unusual is that since we use physical hardware, we don't have hosts coming and going on a regular basis and churning labels like 'instance'. Using Prometheus in the cloud, with a churn of cloud host instances, might have different resource needs.

(But I do feel it's an indication that you don't need a heavy duty server to handle a reasonable Prometheus environment.)

OpenSSH can chose (or force) the 'shell' used for a specific SSH key

By: cks
10 June 2024 at 02:56

One of the perhaps less known and under-utilized features of OpenSSH is that you can make connections using specific authorized SSH keys use specific 'shells', although actually using this may be a little bit tricky. The basic ingredient to do this is a command= setting on the specific key in your .ssh/authorized_keys file, but of course there are some wrinkles and you may not be happy if you just set this to a shell-like program.

The first wrinkle is that sshd runs this command using your regular /etc/passwd shell, as '$SHELL -c <whatever>'. This is presumably done so that you can't evade a restricted shell by writing yourself an authorized_keys file with a more liberal command= command, such as 'command="/bin/bash" ssh-ed25519 ...'. The second wrinkle is that this command is always run with no arguments regardless of how you ran 'ssh', and it's up to the command to work out what you want to do.

If you ran 'ssh u@h echo hi', then the command will be run with a $SSH_ORIGINAL_COMMAND that contains 'echo hi'. If you just ran 'ssh u@h', there will be no $SSH_ORIGINAL_COMMAND in the command's environment. This means that if you simply use a shell as your 'command=', it will only half work. You can do 'ssh u@h' and get a shell environment, but it won't be a login shell, while 'ssh u@h echo hi' won't work at all (it will typically hang). And in both cases, '$SHELL' will be your /etc/passwd login shell. To use this to selectively change your login shell based on the SSH key you use to authenticate, you'll need a cover script.

(Why might you want such a thing? Well, suppose you like to use an alternate shell with unusual behavior, and you also want to use remote access systems, such as Emacs' TRAMP, that are tightly connected to using a standard shell on the remote end. Changing your shell via 'command=' shifts the problem to getting the remote access system to use a specific SSH key, which may be much easier than any alternatives.)

The most straightforward thing to do with a 'command=' script is to narrowly restrict what the particular SSH key can do; we use this in our rsync replication setup. As covered in both the sshd manual page and that entry, you'll need to add additional restrictions to the key to make things solid.

A more complex thing is to use the cover script to do additional access control, authorization checks, and logging before you run what you were asked to. For example, if you have a special 'break glass' system access SSH key, you might want to have it forced to a command= that loudly logs its use, perhaps complains (and errors out) if it wasn't used in exactly the way you expected (for instance, you might never intend to use the 'break glass' key to run remote commands, just to log in), and maybe even test that your regular authentication methods are down. If all of the tests pass, you can then invoke a regular shell (probably just '$SHELL', unless you have a reason to want another one). Especially energetic people could run the entire 'break glass' shell session within script(1), so you hopefully have a record of absolutely everything in the session.

(You'd want the record not so much for security as so that you can later reconstruct what you did in that frantic session when you were entirely focused on putting things back together.)

Sidebar: A plausible alternate shell cover script for command=

The following is only lightly tested and it assumes that your shell supports '-l', '-c <string>' and '-i' options, but since I bothered to work it out and test it a bit I'm going to put it down here.

#!/bin/sh
SHELL=<whatever alternate shell you want>
export SHELL
if [ -n "$SSH_ORIGINAL_COMMAND" ]; then
    exec $SHELL -c "$SSH_ORIGINAL_COMMAND"
elif [ -t 0 ]; then
    exec $SHELL -l
else
    exec $SHELL -i
fi

This handles the two most likely cases (running a command and logging in), and defaults to an interactive shell session if you're not on a PTY and didn't supply a command to run.

Operating services versus operating an "adequate environment"

By: cks
9 June 2024 at 02:50

A while back I wrote about how metrics have many different uses, not all of them actionable ones, and used network bandwidth as an example of a non-actionable metric. In a comment, it was suggested that network bandwidth was sort of actionable in that if we reached capacity limits, that should cause us to add more capacity in one of several ways. My first reaction was that this was non-actionable for us because it's mostly something we're not in a position to do. My second reaction was to think about why this is so. My current and not entirely fully baked view is that it comes down to a difference between what we do and what most people are doing. To put it briefly, many people operate services, while we operate an (adequate or good enough) environment.

If you operate services and people have a bad time with them, that is a direct problem. You can probably put some sort of cost figure on the effects of that bad time, and assuming that the services are important (or the numbers large enough), you can probably get funding to fix the problem. If you are running out of bandwidth, you go out and get more bandwidth.

If you operate an environment, you're generally providing the best environment you can with the funding and support you have (and the priorities for where to direct that funding, for example to prioritize network speed over the amount of storage). If this environment is not good enough for some people's tastes, for example because they're running into bandwidth limits, your reaction is probably to shrug sadly. Those people need to talk to the powers that be, and if the powers that be want you to improve some aspect of the environment, they need to either provide more funding or reduce what features you support so that you can focus more money on less surface area.

If you're used to working in a services situation, the environment situation is probably enraging; here you are, having a bad time and no one cares enough to do anything about it. If you're used to operating an adequate environment, the services mindset feels weird. Get more network bandwidth when you run into performance limits? How, and who is going to fund it?

(Generally an environment is adequate if enough people can get enough work done. And I think that in practice there's a spectrum between these two positions, and a mix of situations within a single organization where there are some 'services' in an environment style organization and some 'this environment is good enough' things in a generally service-focused place (I suspect that these are likely to be internal things).)

Some notes on Grafana Loki's new "structured metadata" (as of 3.0.x)

By: cks
28 May 2024 at 02:28

Grafana Loki somewhat bills itself as "Prometheus for logs", and so it's unsurprising that it started with a data model much like Prometheus. Log lines in Loki are stored with some amount of metadata in labels, just as Prometheus metrics values have labels (including the name of the metric, which is sort of a label). Unfortunately Loki made implementation choices that caused this data model to be relatively catastrophic for system logs (either syslog logs or the systemd journal). Unlike Prometheus, Loki stores each set of label values separately and it never compacts its log storage. Your choices are to either throw away a great deal of valuable system log metadata in order to keep label cardinality down, contaminate your log lines with metadata, making them hard to use, or run Loki in a way that causes periodic explosions and is in general very far outside the configurations that Grafana Inc develops Loki for.

Eventually Grafana Inc worked out that this was less than ideal and sort of did something about it, by introducing "structured metadata". The simple way to describe Loki's structured metadata is that it is labels (and label values) that are not used to separate out log line storage. In theory this is just what I've wanted for some time, but in practice as of Loki 3.0.0, structured metadata is undercooked and not something we can use. However, you probably want to use it in a new greenfield development to ingest system logs (via promtail's somewhat underdocumented support for it), although I can't recommend that you use Loki at all, at least in simple configurations.

The first problem is that structured metadata labels are not actual labels as Loki treats them. If you have a structured metadata label 'fred', you cannot write a LogQL query of '{...,fred="value"}'. Instead you must write this as '{....} | fred="value"'. This means that all of your queries care deeply about whether a particular thing is a Loki label or merely a structured metadata label. I feel strongly that your queries should not depend on the details of your database schema, partly because it makes changing your database schema harder. Loki tools are inconsistent about this distinction; for example 'logcli query' will mostly print structured metadata labels as if they were real labels.

Speaking of changing your database schema, that is the large other piece of bad news about structured metadata. If you have an existing Loki environment from before structured metadata, complete with lots of real labels because that's how you had to capture log metadata, there is no obvious way to switch over to using structured metadata for that log metadata. There are some interesting ways to fail to do so, because the current Loki will accept a client submitting 'structured metadata' that the Loki server thinks should be actual labels. If you add some new, higher cardinality structured metadata along side the labels you'd like to convert, I've seen this add that high cardinality structured metadata as actual labels (the result wasn't pretty). If you want to switch, the easiest way is to stop Loki, delete all of your existing log data, and start from scratch with all clients sending all of the log metadata you care about as structured metadata instead of labels.

I haven't tested what happens in a greenfield configuration if most clients send some client-side labels as structured metadata but one client fumbles things and sends them as labels. I would like to think that Loki rejects this, rather than accepts it and silently converts the structured metadata labels from other clients into real labels (possibly high cardinality ones). Unfortunately this isn't a theoretical mistake, because of an implementation choice in the current (3.0.x) version of Promtail. In Promtail, in order to send syslog or systemd journal fields as structured metadata, you must first materialize them as regular labels (via relabeling) and then converted to structured metadata in the structured metadata stage. If you do the first but not the second, your Promtail configuration will send that metadata to Loki as actual labels, possibly to your deep regret.

I was initially hopeful that structured metadata would let us change our Loki configuration to something closer to a mainstream one. Unfortunately, my investigation has ruled this out for now; we would need to change too many existing queries and there are too many uncertainties over whether we would be able to do it without deleting all of our existing log data (and then living in fear of a cardinality explosion due to an outdated or mis-configured client). Maybe in Loki 4.0.

Flaky alerts are telling you something

By: cks
27 May 2024 at 02:42

Sometimes, monitoring and alerting systems have flaky alerts, either in the form of flapping alerts (where the alert will repeatedly trigger and then go away) or alerts that go off when there is no problem. Broadly speaking, these flaky alerts aren't just noise; they're telling you something.

To put it one way, flaky monitoring system alerts are like flaky tests in programming. Each of these is telling you that your understanding of things is incorrect or that something odd and unusual is going on, and sometimes both. This comes about because you don't generally create either alerts or tests intending them to be flaky; you intend for them to work (or sometimes for tests, to reliably fail before you fix things). If the result of your work is flaky, either you didn't correctly understand how your system (or your code) behaves when you did your work, such that you aren't actually testing what you think you're testing, or there is something going on that genuinely causes unexpected sporadic failures.

(For example, our discovery of OpenSSH sshd's 'MaxStartups' setting came from investigating a 'flaky' alert.)

In both flaky alerts and flaky tests, you can deal with the noise by either disabling the alert or test, or by making it 'try harder' in some way (for alerts this is often 'make this condition have to be true for longer than before'). However, this doesn't change the underlying reality of what is happening, nor does it improve your understanding of the system (at least, not beyond a superficial level of 'I was wrong that this is a reliable signal of ...'). There are obvious drawbacks to this non-approach to the underlying issues.

This doesn't mean that every flaky alert deserves a deep investigation. Sometimes the range of things that might be misunderstood or going wrong is not important enough to justify an investigation. And even if you plan an investigation, it's perfectly reasonable to remove the alert until then, or de-flake it with various 'try harder' brute force mechanisms. For that matter, it's okay to remove a flaky alert if you simply have higher priorities right now. If the flaky alert is trying to tell you about something serious, sooner or later it will probably escalate to obvious, non-flaky symptoms.

(This isn't necessarily how programmers should deal with flaky tests, but system administration is in part an art of tradeoffs. We can never do everything, so we need to pick the important somethings.)

There are multiple uses for metrics (and collecting metrics)

By: cks
24 May 2024 at 02:56

In a comment on my entry on the overhead of the Prometheust host agent's 'perf' collector, a commentator asked a reasonable question:

Not to be annoying, but: is any of the 'perf data' you collect here honestly 'actionable data' ? [...] In my not so humble opinion, you should only collect the type of data that you can actually act on.

It's true that the perf data I might collect isn't actionable data (and thus not actionable metrics), but in my view this is far from the only reason to collect metrics. I can readily see at least three or four different reasons to collect metrics.

The first and obvious purpose is actionable metrics, things that will get you to do things, often by triggering alerts. This can be the metric by itself, such as free disk space on the root of a server (or the expiry time of a TLS certificate), or the metric in combination with other data, such as detecting that the DNS SOA record serial number for one of your DNS zones doesn't match across all of your official DNS servers.

The second reason is to use the metrics to help understand how your systems are behaving; here your systems might be either physical (or at least virtual) servers, or software systems. Often a big reason to look at this information is because something mysterious happened and you want to look at relatively detailed information on what was going on at the time. While you could collect this data only when you're trying to better understand ongoing issues, my view is that you also want to collect it when things are normal so that you have a baseline to compare against.

(And since sometimes things go bad slowly, you want to have a long baseline. We experienced this with our machine room temperatures.)

Sometimes, having 'understanding' metrics available will allow you to head off problems before hand, because metrics that you thought were only going to be for understanding problems as and after they happened can be turned into warning signs of a problem so you can mitigate it. This happened to us when server memory usage information allowed us to recognize and then mitigate a kernel memory leak (there was also a case with SMART drive data).

The third reason is to understand how (and how much) your systems are being used and how that usage is changing over time. This is often most interesting when you look at relatively high level metrics instead of what are effectively low-level metrics from the innards of your systems. One popular sub-field of this is projecting future resource needs, both hardware level things like CPU, RAM, and disk space and larger scale things like the likely future volume of requests and other actions your (software) systems may be called on to handle.

(Both of these two reasons can combine together in exploring casual questions about your systems that are enabled by having metrics available.)

A fourth semi-reason to collect metrics is as an experiment, to see if they're useful or not. You can usually tell what are actionable metrics in advance, but you can't always tell what will be useful for understanding your various systems or understanding how they're used. Sometimes metrics turn out to be uninformative and boring, and sometimes metrics turn out to reveal surprises.

My impression of the modern metrics movement is the general wisdom is to collect everything that isn't too expensive (either to collect or to store), because more data is better than less data and you're usually not sure in advance what's going to be meaningful and useful. You create alerts carefully and to a limited extend (and in modern practice, often focusing on things that people using your services will notice), but for the underlying metrics, the more the potentially better.

The trade-offs in not using WireGuard to talk to our cloud server

By: cks
18 May 2024 at 03:42

We recently set up our first cloud server in order to check the external reachability of some of our services, where the cloud server runs a Prometheus Blackbox instance and our Prometheus server talks to it to have it do checks and return the results. Originally, I was planning for there to be a WireGuard tunnel between our Prometheus server and the cloud VM, which Prometheus would use to talk to Blackbox. In the actual realized setup, there's no WireGuard and we use restrictive firewall rules to restrict potentially dangerous access to Blackbox to the Prometheus server.

I had expected to use WireGuard for a combination of access control to Blackbox and to deal with the cloud server having a potentially variable public IP. In practice, this cloud provider gives us a persistent public IP (as far as I can tell from their documentation) and required us to set up firewall rules either way (by default all inbound traffic is blocked), so not doing WireGuard meant a somewhat simpler configuration. Especially, it meant not needing to set up WireGuard on the Prometheus server.

(My plan for WireGuard and the public IP problem was to have the cloud server periodically ping the Prometheus server over WireGuard. This would automatically teach the Prometheus server's WireGuard the current public IP, while the WireGuard internal IP of the cloud server would stay constant. The cloud server's Blackbox would listen only on its internal WireGuard IP, not anything else.)

In some ways the result of relying on a firewall instead of WireGuard is more secure, in that an attacker would have to steal our IP address instead of stealing our WireGuard peer private key. In practice neither are worth worrying about, since all an attacker would get is our Blackbox configuration (and the ability to make assorted Blackbox probes from our cloud VM, which has no special permissions).

The one clear thing we lose in not using WireGuard is that the Prometheus server is now querying Blackbox using unencrypted HTTP over the open Internet. If there is some Intrusion Prevention System (IPS) in the path between us and the cloud server, it may someday decide that it is unhappy with this HTTP traffic (perhaps it trips some detection rule) and that it should block said traffic. An encrypted WireGuard tunnel would hide all of our Prometheus HTTP query traffic (and responses) from any in-path IPS.

(Of course we have alerts that would tell us that we can't talk to the cloud server's Blackbox. But it's better not to have our queries blocked at all.)

There are various ways to work around this, but they all give us a more complicated configuration on at least the cloud server so we aren't doing any of them (yet). And of course we can switch to the WireGuard approach when (if) we have this sort of problem.

Thoughts on (not) automating the setup of our first cloud server

By: cks
17 May 2024 at 02:52

I recently set up our first cloud server, in a flailing way that's probably familiar to anyone who still remembers their first cloud VM (complete with a later discovery of cloud provider 'upsell'). The background for this cloud server is that we want to check external reachability of some of our systems, in addition to the internal reachability already checked by our metrics and monitoring system. The actual implementation of this is quite simple; the cloud server runs an instance of the Prometheus Blackbox agent for service checks, and our Prometheus server performs a subset of our Blackbox service checks through it (in addition to the full set of service checks that are done through our local Blackbox instance).

(Access to the cloud server's Blackbox instance is guarded with firewall rules, because giving access to Blackbox is somewhat risky.)

The proper modern way to set up cloud servers is with some automated provisioning system, so that you wind up with 'cattle' instead of 'pets' (partly because every so often the cloud provider is going to abruptly terminate your server and maybe lose its data). We don't use such an automation system for our existing physical servers, so I opted not to try to learn both a cloud provider's way of doing things and a cloud server automation system at the same time, and set up this cloud server by hand. The good news for us is that the actual setup process for this server is quite simple, since it does so little and reuses our existing Blackbox setup from our main Prometheus server (all of which is stored in our central collection of configuration files and other stuff).

(As a result, this cloud server is installed in a way fairly similar to our other machine build instructions. Since it lives in the cloud and is completely detached from our infrastructure, it doesn't have our standard local setup and customizations.)

In a way this is also the bad news. If this server and its operating environment was more complicated to set up, we would have more motivation to pick one of the cloud server automation systems, learn it, and build our cloud server's configuration in it so we could have, for example, a command line 'rebuild this machine and tell me its new IP' script that we could run as needed. Since rebuilding the machine as needed is so simple and fast, it's probably never going to motivate us into learning a cloud server automation system (at least not by itself, if we had a whole collection of simple cloud VMs we might feel differently, but that's unlikely for various reasons).

Although setting up a new instance of this cloud server is simple enough, it's also not trivial. Doing it by hand means dealing with the cloud vendor's website and going through a bunch of clicking on things to set various settings and options we need. If we had a cloud automation system we knew and already had all set up, it would be better to use it. If we're going to do much more with cloud stuff, I suspect we'll soon want to automate things, both to make us less annoyed at working through websites and to keep everything consistent and visible.

(Also, cloud automation feels like something that I should be learning sooner or later, and now I have a cloud environment I can experiment with. Possibly my very first step should be exploring whatever basic command line tools exist for the particular cloud vendor we're using, since that would save dealing with the web interface in all its annoyance.)

Where NS records show up in DNS replies depends on who you ask

By: cks
12 May 2024 at 02:08

Suppose, not hypothetically, that you're trying to check the NS records for a bunch of subdomains to see if one particular DNS server is listed (because it shouldn't be). In DNS, there are two places that have NS records for a subdomain; the nameservers for the subdomain itself (which lists NS records as part of the zone's full data), and the nameservers for the parent domain, which have to tell resolvers what the authoritative DNS servers for the subdomain are. Today I discovered that these two sorts of DNS servers can return NS records in different parts of the DNS reply.

(These parent domain NS records are technically not glue records, although I think they may commonly be called that and DNS people will most likely understand what you mean if you call them 'NS glue records' or the like.)

A DNS server's answer to your query generally has three sections, although not all of them may be present in any particular reply. The answer section contains the 'resource records' that directly answer your query, the 'authority' section contains NS records of the DNS servers for the domain, and the 'additional' section contains potentially helpful additional data, such as the addresses of some of the DNS servers in the authority section. Now, suppose that you ask a DNS server (one that has the data) for the NS records for a (sub)domain.

If you send your NS record query to either a DNS resolver (a DNS server that will make recursive queries of its own to answer your question) or to an authoritative DNS server for the domain, the NS records will show up in the answer section. You asked a (DNS) question and you got an answer, so this is exactly what you'd expect. However, if you send your NS record query to an authoritative server for the parent domain, its reply may not have any NS records in the answer section (in fact the answer section can be empty); instead, the NS records show up in the authority section. This can be surprising if you're only printing the answer section, for example because you're using 'dig +noall +answer' to get compact, grep'able output.

(If the server you send your query to is authoritative for both the parent domain and the subdomain, I believe you get NS records in the answer section and they come from the subdomain's zone records, not any NS records explicitly listed in the parent.)

This makes a certain amount of sense in the DNS mindset once you (I) think about it. The DNS server is authoritative for the parent domain but not for the subdomain you're asking about, so it can't give you an 'answer'; it doesn't know the answer and isn't going to make a recursive query to the subdomain's listed DNS servers. And the parent domain's DNS server may well have a different list of NS records than the subdomain's authoritative DNS servers have. So all the parent domain's DNS server can do is fill in the authority section with the NS records it knows about and send this back to you.

So if you (I) are querying a parent domain authoritative DNS server for NS records, you (I) should remember to use 'dig +noall +authority +answer', not my handy 'cdig' script that does 'dig +noall +answer'. Using the latter will just lead to some head scratching about how the authoritative DNS server for the university's top level domain doesn't seem to want to tell me about its DNS subdomain delegation data.

All configuration files should support some form of file inclusion

By: cks
9 May 2024 at 03:17

Over on the Fediverse, I said something:

Every configuration file format should have a general 'include this file' feature, and it should support wildcards (for 'include subdir/*.conf'). Sooner or later people are going to need it, especially if your software gets popular.

It's unfortunate that standard YAML does not support this, although it's also sort of inevitable (YAML doesn't require files at all). This leaves everyone using YAML for their configuration file format to come up with various hacks.

(If this feature is hard-coded, it should use file extensions.)

There are a variety of reasons why people wind up wanting to split up a configuration file into multiple pieces. Obvious ones include that it's easier to coordinate multiple people or things wanting to add settings, a single giant file can be hard to read and deal with, and it's easy to write some parts by hand and automatically generate others. A somewhat less obvious reason is that this makes it easy to disable or re-enable an entire cluster of configuration settings; you can do it by simply renaming or moving around a file, instead of having to comment out a whole block in a giant file and then comment it back in later.

(All of these broadly have to do with operating the software in the large, possibly at scale, possibly packaged by one group of people and used by another. I think this is part of why file inclusion is often not an initial feature in configuration file formats.)

One of the great things about modern (Linux) systems and some modern software is the pervasive use of such 'drop-in' included configuration files (or sub-files, or whatever you want to call them). Pretty much everyone loves them and they've turned out to be very useful for eliminating whole classes of practical problems. Implementing them is not without issues, since you wind up having to decide what to do about clashing configuration directives (usually 'the last one read wins', and then you definite it that files are read in name-sorted order) and often you have to implement some sort of section merging (so that parts of some section can be specified in more than one file). But the benefits are worth it.

As mentioned, one subtle drawback of YAML as a configuration file format is that there's no general, direct YAML feature for 'include a file'. Programs that use YAML have to implement this themselves, by defining schemas that have elements with special file inclusion semantics, such as Prometheus's scrape_config_files: section in its configuration file, which lets you include files of scrape_config directives:

# Scrape config files specifies a list of globs.
# Scrape configs are read from all matching files
# and appended to the list of scrape configs.
scrape_config_files:
  [ - <filepath_glob> ... ]

That this only includes scrape_config directives and not anything else shows some of the limitations of this approach. And since it's not a general YAML feature, general YAML linters and so on won't know to look at these included files.

However, this sort of inclusion is still much better than not having any sort of inclusion at all. Every YAML based configuration file format should support something like it, at least for any configuration section that get large (for example, because it can have lots of repeated elements).

Some thoughts on when you can and can't lower OpenSSH's 'LoginGraceTime'

By: cks
8 May 2024 at 01:48

In a comment on my entry on sshd's 'MaxStartups' setting, Etienne Dechamps mentioned that they lowered LoginGraceTime, which defaults to two minutes (which is rather long). At first I was enthusiastic about making a similar change to lower it here, but then I start thinking it through and now I don't think it's so simple. Instead, I think you can look at three broad situations for the amount of time to log in you give people connecting to your SSH server.

The best case for a quite short login grace time is if everyone connecting authenticates through an already unlocked and ready SSH keypair. If this is the case, the only thing slowing down logins is the need to bounce a certain amount of packets back and forth between the client and you, possibly on slow networks. You're never waiting for people to do something, just for computers to do some calculations and for the traffic to get back and forth. Etienne Dechamps' 20 seconds ought to be long enough for this even under unfavourable network situations and in the face of host load.

(If you do only use keypairs, you can cut off a lot of SSH probes right away by configuring sshd to not even offer password authentication as an option.)

The intermediate case is if people have to unlock their keypair or hardware token, touch their hardware token to confirm key usage, say yes to a SSH agent prompt, or otherwise take manual action that is normally short. In addition to the network and host delays you had with unlocked and ready keypairs, now you have to give fallible people time to notice the need for action and respond to carry it out accurately. Even if 20 seconds is often enough for this, it feels rushed to me and I think you're likely to see some amount of people failing to log in; you really want something longer, although I don't know how much longer.

The worst case is if people authenticate with passwords. Here you have fallible humans carefully typing in their password, getting it wrong (because they have N passwords they've memorized and have to pick the right one, among other things), trying again, and so on. Sometimes this will be a reasonably fast process, much like in the intermediate case, but some of the time it will not be. Setting a mere 20 second timeout on this will definitely cut people off at the knees some of the time. Plus, my view is that you don't want people entering their passwords to feel that they're in a somewhat desperate race against time; that feels like it's going to cause various sorts of mistakes.

For our sins, we have plenty of people who authenticate to us today using passwords. As a result I think we're not in a good position to lower sshd's LoginGraceTime by very much, and so it's probably simpler to leave it at two minutes. Two minutes is fine and generous for people, and it doesn't really cost us anything when dealing with SSH probes (well, once we increase MaxStartups).

What affects what server host key types OpenSSH will offer to you

By: cks
7 May 2024 at 03:17

Today, for reasons beyond the scope of this entry, I was checking the various SSH host keys that some of our servers were using, by connecting to them and trying to harvest their SSH keys. When I tried this with a CentOS 7 host, I discovered that while I could get it to offer its RSA host key, I could not get it to offer an Ed25519 key. At first I wrote this off as 'well, CentOS 7 is old', but then I noticed that this machine actually had an Ed25519 host key in /etc/ssh, and this evening I did some more digging to try to turn up the answer, which turned out to not be what I expected.

(CentOS 7 apparently didn't used to support Ed25519 keys, but it clearly got updated at some point with support for them.)

So, without further delay and as a note to myself, the SSH host key types a remote OpenSSH server will offer to you are controlled by the intersection of three (or four) things:

  • What host key algorithms your client finds acceptable. With modern versions of OpenSSH you can find out your general list with 'ssh -Q HostKeyAlgorithms', although this may not be the algorithms offered for any particular connection. You can see the offered algorithms with 'ssh -vv <host>', in the 'debug2: host key algorithms' line (well, the first line).

    (You may need to alter this, among other settings, to talk to sufficiently old SSH servers.)

  • What host key algorithms the OpenSSH server has been configured to offer in any 'HostKeyAlgorithms' lines in sshd_config, or some default host key algorithm list if you haven't set this. I think it's relatively uncommon to set this, but on some Linuxes this may be affected by things like system-wide cryptography policies that are somewhat opaque and hard to inspect.

  • What host keys are on the server configured in 'HostKey' directives, in your sshd_config (et al). If you have no HostKey directives, a default set is used. Once you have any HostKey directive, only explicitly listed keys are ever used. Related to this is that the host key files must actually exist and have the proper permissions.

(I believe that you can see the union of the latter two with 'ssh -vv' in the second 'debug2: host key algorithms:' line. I wish ssh would put 'client' and 'server' into these lines.)

This last issue was the problem with this particular CentOS 7 server. Somehow, it had wound up with an /etc/ssh/sshd_config that had explicit HostKey lines but didn't include its Ed25519 key file. It supported Ed25519 fine, but it couldn't offer an Ed25519 key because it didn't have one. Oops, as they say.

(It's possible that this is the result of CentOS 7's post-release addition of Ed25519 keys combined with us customizing this server's /etc/ssh/sshd_config before then, since this server has an sshd_config.rpmnew.)

This also illustrates that your system may generate keys (or have generated keys) for key algorithms it's not going to use. The mere presence of an Ed25519 key in /etc/ssh doesn't mean that it's actually going to get used, or at least used by the server.

Just to be confusing, what SSH key types the OpenSSH ssh program will offer for host-based authentication aren't necessarily the same as what will be offered by the server on the same machine. The OpenSSH ssh doesn't have a 'HostKey' directive and will use any host key it finds using a set of hard-coded names, provided that it's allowed ny the client 'HostKeyAlgorithms' setting. So you can have your ssh client trying to use an Ed25519 or ECDSA host key that will never be offered by the OpenSSH server.

PS: Yes, we still have CentOS 7 machines running, although not for much longer. That was sort of why I was looking at the SSH host keys for this machine.

OpenSSH sshd's 'MaxStartups' setting and Internet-accessible machines

By: cks
6 May 2024 at 02:34

Last night, one of our compute servers briefly stopped accepting SSH connections, which set off an alert in our monitoring system. On compute servers, the usual cause for this is that some program (or set of them) has run the system out of memory, but on checking the logs I saw that this wasn't the case. Instead, sshd had logged the following (among other things):

sshd[649662]: error: beginning MaxStartups throttling
sshd[649662]: drop connection #11 from [...]:.. on [...]:22 past MaxStartups

I'm pretty sure I'd seen this error before, but this time I did some reading up on things.

MaxStartups is a sshd configuration setting that controls how many concurrent unauthenticated connections there can be. This can either be a flat number or a setup that triggers random dropping of such connections with a certain probability. According to the manual page (and to comments in the current Ubuntu 22.04 /etc/ssh/sshd_config), the default value is '10:30:100', which drops 30% of new connections if there are already 10 unauthenticated connections and all of them if there are 100 such connections (and a scaled drop probability between those two).

(OpenSSH sshd also can apply a per-'source' limit using PerSourceMaxStartups, where a source can be an individual IPv4 or IPv6 address or a netblock, based on PerSourceNetBlockSize.)

Normal systems probably don't have any issue with this setting and its default, but for our sins some of our systems are exposed to the Internet for SSH logins, and attackers probe them (and these attackers are back in action these days after a pause we noticed in February). Apparently enough attackers were making enough attempts early this morning to trigger this limit. Unfortunately this limit is a global setting, with no way to give internal IPs a higher limit than external ones (MaxStartups is not one of the directives that can be included in Match blocks).

Now that I've looked into this, I think that we may want to increase this setting in our environment. Ten unauthenticated connections is not all that many for an Internet-exposed system that's under constant SSH probes, and our Internet-accessible systems aren't short of resources; they could likely afford a lot more such connections. Our logs suggest we see this periodically across a number of systems, which is more or less what I'd expect if they come from attackers randomly hitting our systems. Probably we want to keep the random drop bit instead of creating a hard wall, but increase the starting point of the random drops to 20 or 30 or so.

(Unfortunately I don't think sshd reports how many concurrent unauthenticated connections it has until it starts dropping them, so you can't see how often you're coming close to the edge.)

We have our first significant batch of servers that only have UEFI booting

By: cks
5 May 2024 at 02:48

UEFI has been the official future of x86 PC firmware for a very long time, and for much of that time your machine's UEFI firmware has still been willing to boot your systems the traditional way x86 PCs booted before UEFI, with 'BIOS MBR' (generally using UEFI CSM booting). Some people have no doubt switched to booting their servers with UEFI (booting) years ago, but for various reasons we have long preferred BIOS (MBR) booting and almost always configured our servers that way if given a choice. Over the years we've wound up with a modest number of servers which only supported UEFI booting, but the majority of our servers and especially our generic 1U utility servers all supported BIOS MBR booting.

Well, those days are over now. We're refreshing our stock of generic 1U utility servers and the new generation are UEFI booting only. This is probably not surprising to anyone, as Intel has been making noises about getting rid of UEFI CSM booting for some time, and was apparently targeting 'by 2024' for server platforms. Well, it is 2024 and here we are with new Intel based server hardware without what Intel calls 'legacy boot support'.

(I'm aware we're late to this party, and it's quite possible that server vendors dropped legacy boot mode a year or two ago. We don't buy generic 1U servers very often; we tend to buy them in batches when we have the money and this doesn't happen regularly.)

To be honest, I don't expect UEFI booting to make much of a visible difference in our lives, and it may improve them in some ways (for example if our Linux kernels use UEFI to store crash information). I think we were right to completely avoid the early implementations of UEFI booting, but it ought to work fine by now if server vendors are accepting Intel shoving legacy boot support overboard. There will be new things we'll have to do on servers with mirrored system disks when we replace a failed disk, but Ubuntu's multi-disk UEFI boot story is in decent shape these days and our system disks don't fail that often.

(However, UEFI booting does introduce some new failures modes. We probably won't run into corrupted EFI System Partitions, since their contents don't get changed very often these days.)

Having a machine room can mean having things in your machine room

By: cks
2 May 2024 at 02:09

Today we discovered something:

Apparently our (university) machine room now comes with the bonus of a visiting raccoon. I have nothing against Toronto's charming trash pandas, but I do have a strong preference for them to be outdoors and maybe a bit distant.

(There are so far no signs that the raccoon has decided to be a resident of the machine room. Hopefully it is too cool in the room for it to be interested in that.)

Naturally there is a story here. This past Monday morning (what is now two days ago), we discovered that over the weekend, one of the keyboards we keep sitting around our machine room had been fairly thoroughly smashed, with keycaps knocked off and some scattered some distance around the rack. This was especially alarming because the keyboard (and its associated display) were in our rack of fileservers, which are some of our most critical servers. The keyboard had definitely not been smashed up last Friday, and nothing else seemed to have been disturbed or moved, not even the wires dangling near the keyboard.

Initially we suspected that some contractor had been in the room over the weekend to do work on the air conditioning, wire and fiber runs that go through it (and are partially managed by other people in entirely other groups), or something of that nature, had dropped something on the keyboard, and had decided not to mention it to anyone. Today people poked around the assorted bits of clutter in the corners of the room and discovered increasingly clear evidence of animal presence near our rack of fileservers. The fileserver rack (and the cluttered corner where further evidence was uncovered) are right by a vertical wiring conduit that runs up through the ceiling to higher floors. One speculation is that our (presumed) raccoon was jumping into our fileserver rack in order to climb up to get back into the wiring conduit.

Probably not coincidentally, we had recently had some optical fiber runs between floors suddenly go bad after years of service and with no activity near them that we knew of. One cause we had already been speculating about was animals either directly damaging a fiber strand or bending it enough to cause transmission problems. And in the process of investigating this, last week we'd found out that there was believed to be some degree of animal presence up in the false ceiling of the floor our machine room is on.

We haven't actually seen the miscreant in question, and I hope we don't (trapping it is the job of specialists that the university has already called in). My hope is that the raccoon has decided that our machine room is entirely boring and not worth coming back to, because a raccoon that felt like playing around with the blinking lights and noise-making things could probably do an alarming amount of damage.

(I've always expected that we periodically have mice under the raised floor of our machine room, but the thought of a raccoon is a new one. I'll just consider it a charm of having physical servers in our own modest machine room.)

How I (used to) handle keeping track of how I configured software

By: cks
29 April 2024 at 03:26

Once upon a time, back a reasonable while ago, I used to routinely configure (in the './configure' sense) and build a fair amount of software myself, software that periodically got updates and so needed me to rebuild it. If you've ever done this, you know that one of the annoying things about this process is keeping track of just what configuration options you built the software with, so you can re-run the configuration process as necessary (which may be on new releases of the software, but also when you do things like upgrade your system to a new version of your OS). Since I'm that kind of person, I naturally built a system to handle this for me.

How the system worked was that the actual configuration for each program or package was done by a little shell script snippet that I stored in a directory under '$HOME/lib'. Generally the file name of the snippet was the base name of the source directory I would be building in, so for example 'fvwm-cvs'. Also in this directory was a 'MAPPINGS' file that mapped from full or partial paths of the source directory to the snippet to use for that particular thing. To actually configure a program, I ran a script, inventively called 'doconfig'. Doconfig searched the MAPPINGS file for, well, let me just quote from comments in the script:

Algorithm: we have a file called MAPPINGS.
We search for first the full path of the current directory and then it with successive things sawn off the front; if we get a match, we use the filename named.
Otherwise, we try to use the basename of the directory as a file. Otherwise we error out.

There's nothing particularly special about my script and my system for keeping track of how I built software. There probably are tons of versions and variations of it that people have created for themselves over the years. This is just the sort of thing you want to do when you get tired of trying to re-read 'config.log' files or whatever, and realize that you forgot how you built the software the last time around, and so on.

(Having written this up I've realized that I should still be using it, because these days I'm building or re-building a number of things and I've slide back to the old silly ways of trying to do it all by hand.)

PS: At work we don't have any particular system for keeping track of software build instructions. Generally, if we have to build something from source, we put the relevant command lines and other information in our build instructions.

Autoconf and configure features that people find valuable

By: cks
28 April 2024 at 02:29

In the wake of the XZ Utils backdoor, which involved GNU Autoconf, it has been popular to suggest that Autoconf should go away. Some of the people suggesting this have also been proposing that the replacement for Autoconf and the 'configure' scripts it generates be something simpler. As a system administrator who interacts with configure scripts (and autoconf) and who deals with building projects such as OpenZFS, it is my view that people proposing simpler replacements may not be seeing the features that people like me find valuable in practice.

(For this I'm setting aside the (wasteful) cost of replacing Autoconf.)

Projects such as OpenZFS and others rely on their configuration system to detect various aspects of the system they're being built on that can't simply be assumed. For OpenZFS, this includes various aspects of the (internal) kernel 'API'; for other projects, such as conserver, this covers things like whether or not the system has IPMI libraries available. As a system administrator building these projects, I want them to automatically detect all of this rather than forcing me to do it by hand to set build options (or demanding that I install all of the libraries and so on that they might possibly want to use).

As a system administrator, one large thing that I find valuable about configure is that it doesn't require me to change anything shipped with the software in order to configure the software. I can configure the software using a command line, which means that I can use various means to save and recall that command line, ranging from 'how to build this here' documentation to automated scripts.

Normal configure scripts also let me and other people set the install location for the software. This is a relatively critical feature for programs that may be installed as a Linux distribution package, as a *BSD third party package, by the local system administrator, or by an individual user putting them somewhere in their own home directory, since all four of these typically need different install locations. If a replacement configure system does not accept at least a '--prefix' argument or the equivalent, it becomes much less useful in practice.

Many GNU configure scripts also let the person configuring the software set various options for what features it will include, how it will behave by default, and so on. How much these are used varies significantly between programs (and between people building the program), but some of the time they're critical for selecting defaults and enabling (or disabling) features that not everyone wants. A replacement configure system that doesn't support build options like these is less useful for anyone who wants to build such software with non-standard options, and it may force software to drop build options entirely.

(There are some people who would say that software should not have build options any more than it should have runtime configuration settings, but this is not exactly a popular position.)

This is my list, so other people may well value other features that are supported by Autoconf and configure (for example, the ability to set C compiler flags, or that it's well supported for building RPMs).

I wish projects would reliably use their release announcements mechanisms

By: cks
27 April 2024 at 03:19

Today, not for the first time, I discovered that one project that we use locally had made a new release (of one component) by updating my local copy of their Git repository and noticing that 'git pull' had fetched a new tag. Like various other projects, this project has an official channel to announce new releases of their various components; in this case, a mailing list. Sadly, the new release had not been announced on that mailing list, although other releases have been in the past.

This isn't the only project that does things like this and as a busy system administrator, I wish that they wouldn't. In some ways it's more frustrating to have an official channel for announcements and then to not use it consistently than to have no such channel and force me to rely on things like how Github based projects have a RSS feed of releases. With no channel (or a channel that never gets used), at least I know that I can't rely on it and I'm on my own. An erratic announcement channel makes me miss things.

(It may also cause me to use a release before it is completely ready. There are projects that publish tags and releases in their VCS repositories before they consider the releases to be officially released and something you should use. If I have to go to the VCS repository to find out about (some) new releases, I'm potentially going to be jumping the gun some of the time. Over the years I've built up a set of heuristics for various projects where I know that, for example, a new major release will almost always be officially announced somehow so I should wait to see that, but a point release may not get anything beyond a VCS tag.)

In today's modern Internet world, some of the projects that do this may have a different (and not well communicated) view of what their normal announcement mechanism actually is. If a project has an announcements mailing list and an online discussion forum, for example, perhaps their online forum is where they expect people to go for this sort of thing and there's a de facto policy that only major releases are sent to the mailing list. I tend not to look at such forums, so I'd be missing this sort of thing.

(Some projects may also have under-documented policies on what is worth 'bothering' people about through their documented announcements mechanism and what isn't. I wish they would announce everything, but perhaps other people disagree.)

Pruning some things out with (GNU) find options

By: cks
25 April 2024 at 02:32

Suppose that you need to scan your filesystems and pass some files with specific names, ownerships, or whatever, except that you want to exclude scanning under /tmp and /var/tmp (as illustrative examples). Perhaps also you're feeding the file names to a shell script, especially in a pipeline, which means that you'd like to screen out directory and file names that have (common) problem characters in them, like spaces.

(If you can use Bash for your shell script, the latter problem can be dealt with because you can get Bash to read NUL-terminated lines that can be produced by 'find ... -print0'.)

Excluding things from 'find' results is done with find's -prune action, which is a little bit tricky to use when you want to exclude absolute paths (well okay it's a little bit tricky in general; see this SO question and answers). To start with, you're going to want to generate a list of filesystems and then scan them by absolute path:

FSES="$(... something ...)"
for fs in $FSES; do
    find "$fs" -xdev [... magic ...]
done

Starting with an absolute path to the filesystem (instead of cd'ing into the root of the filesystem and doing 'find . -xdev [...]' means that we can now use absolute paths in find's -path argument instead of ones relative to the filesystem root:

find "$fs" -xdev '(' -path /tmp -o -path /var/tmp ')' -prune -o ....

With absolute paths, we don't have to worry about what if /var or /tmp (or /var/tmp) are separate filesystems, instead of being directories on the root filesystem. Although it's hard to work out without experimentation, -xdev and -prune combine the way we want.

(If we're running 'find' on a filesystem that doesn't contain either /tmp or /var/tmp, we'll waste a bit of CPU time having 'find' evaluate those -path arguments all the time despite it never being possible for them to match. This is unimportant when compared to having a simpler, less error prone script.)

If we want to exclude paths with spaces in them, this is easily done with '-name "* *"'. If we want to get all whitespace, we need GNU Find and its '-regex' argument, documented best in "Regular Expressions" in the info documentation. Because we want to use a character class to match whitespace, we need to use one of the regular expression types that include this, so:

find "$fs" -regextype grep ... -regex '.*[[:space:]].*' ...

On the whole, 'find' is an awkward tool to use for this sort of filtering. Unfortunately it's sometimes what we turn to because our other options involve things like writing programs that consume and filter NUL-terminated file paths.

(And having 'find' skip entire directory trees is more efficient than letting it descend into them, print all their file paths, and then filtering the file paths out later.)

PS: One of the little annoyances of Unix for system administrators is that so many things in a stock Unix environment fall apart the moment people start putting odd characters in file names, unless you take extreme care and use unusual tools. This often affects sysadmins because we frequently have to deal with other people's almost arbitrary choices of file and directory names, and we may be dealing with actively malicious attackers for extra concern.

Sidebar: Reading null-terminated lines in Bash

Bash's version of the 'read' builtin supports a '-d' argument that can be used to read NUL-terminated lines:

while IFS= read -r -d '' line; do
  [ ... use "$line" ... ]
done

You still have to properly quote "$line" in every single use, especially as you're doing this because you expect your lines to (or filenames) to sometimes contain troublesome characters. You should definitely use Shellcheck and pay close attention to its warnings (they're good for you).

IPMI connections have privilege levels, not just IPMI users

By: cks
17 April 2024 at 02:56

If you want to connect to a server's IPMI over the network, you normally need to authenticate as some IPMI user. When you set that IPMI user up, you'll give it one of three or four privilege levels; ADMINISTRATOR, OPERATOR, USER, or what I believe is rarely used, CALLBACK. For years, when I tried to set up IPMIs for things like reading sensors over the network, remote power cycling, or Serial over LAN console access, I'd make a special IPMI user for the purpose and try to give it a low privilege level, but the low privilege level basically never worked so I'd give up, grumble, and make yet another ADMINISTRATOR user. Recently I discovered that I had misunderstood what was going on, which is that both IPMI users and IPMI connections have a privilege level.

When you make an IPMI connection with, for example, ipmitool, it will ask for that connection to be at some privilege level. Generally the default privilege level that things ask for is 'ADMINISTRATOR', and it's honestly hard to blame them. As far as I know there is no standard for what operations require what privilege level; instead it's up to the server or BMC vendor to decide what level they want to require for any particular IPMI command. But everyone agrees that 'ADMINISTRATOR' is the highest level, so it's the safest to ask for as the connection privilege level; if the BMC doesn't let you do it at ADMINISTRATOR, you probably can't do it at all.

The flaw in this is that an IPMI user's privilege level constraints what privilege level you can ask for when you authenticate as that user. If you make a 'USER' privileged IPMI user, connect as it, and ask for ADMINISTRATOR privileges, the BMC is going to tell you no. Since ipmitool and other tools were always asking for ADMINISTRATOR by default, they would get errors unless I made my IPMI users have that privilege level. Once I discovered and realized this, I could explicitly tell ipmitool and other things to ask for less privilege and then work out exactly what privilege level I needed for a particular operation on a particular BMC.

(It is probably safe to assume that a 'USER' privileged IPMI user (well, connection) can read sensor data. Experimentally, at least one vendor's BMC will do Serial over LAN at 'OPERATOR' privilege, but I wouldn't be surprised if some require 'ADMINISTRATOR' for that, since serial console access is often the keys to the server itself. Hopefully power cycling the server is an 'OPERATOR' level thing, but again perhaps not on some BMCs.)

PS: If there's a way to have ipmitool and other things ask for 'whatever the (maximum) privilege level this user has', it's not obvious to me in things like the ipmitool manual page.

NAT'ing on the firewall versus host routes for public IPs

By: cks
8 April 2024 at 02:38

In a comment on my entry on solving the hairpin NAT problem with policy based routing, Arnaud Gomes suggested an alternative approach:

Since you are adding an IP address to the server anyway, why not simply add the public address to a loopback interface, add a route on the firewall and forgo the DNAT completely? In most situations this leads to a much simpler configuration.

This got me to thinking about using this approach as a general way to expose internal servers on internal networks, as an alternative to NAT'ing them on our external firewall. This approach has some conceptual advantages, including that it doesn't require NAT, but unfortunately it's probably significantly more complex in our network environment and so much less attractive than NAT'ing on the external firewall.

There are two disadvantages of the routing approach in an environment like ours. The first disadvantage is that it probably only works easily for inbound connections. If such an exposed server wants to make outgoing connections that will appear to come from its public IP, it needs to explicitly set the source IP for those connections instead of allowing the system to chose the default. Potentially you can solve this on the external firewall by NAT'ing outgoing connections to its public IP, but then things are getting complicated, since you can have two machines generating traffic with the same IP.

The second disadvantage is that we'd have to establish and maintain a collection of host source routes in multiple places. Our core router would need the routes, the routing firewall each such machine was behind would need to have the route, and probably we'd want other machines and firewalls to also have these host routes. And every time we added, removed, or changed such a machine we'd have to update these routes. We especially don't like to frequently update our core router precisely because it is our core router.

The advantage of doing bidirectional NAT on our external firewall for these machines is the reverse of these issues. There's only one place in our entire network really has to know about which internal machine is which public IP. Of course this leaves us with the hairpin NAT problem and needing split horizon DNS, but those are broadly considered solved problems, unlike maintaining a set of host routes.

On the other hand, if we already had a full infrastructure for maintaining and updating routing tables, the non-NAT approach might be easy and natural. I can imagine an environment where you propagate route announcements through your network so that everyone can automatically track and know where certain public IPs are. We'd still need firewall rules to allow only certain sorts of traffic in, though.

An issue with Alertmanager inhibitions and resolved alerts

By: cks
3 April 2024 at 03:02

Prometheus Alertmanager has a feature called inhibitions, where one alert can inhibit other alerts. We use this in a number of situations, such as our special 'there is a large scale problem' alert inhibiting other alerts and some others. Recently I realized that there is a complication in how inhibitions interact with being notified about resolved alerts (due to this mailing list thread).

Suppose that you have an inhibition rule to the effect that alert A ('this host is down') inhibits alert B ('this special host daemon is down'), and you send notifications on resolved alerts. With alert A in effect, every time Alertmanager goes to send out a notification for the alert group that alert B is part of, Alertmanager will see that alert B is inhibited and filter it out (as far as I can tell this is the basic effect of Alertmanager silences, inhibitions, and mutes). Such notifications will (potentially) happen on every group_interval tick.

Now suppose that both alert A and alert B resolve at more or less the same time (because the host is back up along with its special daemon). Alertmanager doesn't immediately send notifications for resolved alerts; instead, just like all other alert group re-notifications, they wait for the next group_interval tick. When this tick happens, alert B will be a resolved alert that you should normally be notified about, and alert A will no longer be active and so no longer inhibiting it. You'll receive a potentially surprising notification about the now-resolved alert B, even though it was previously inhibited while it was active (and so you may never have received an initial notification that it was active).

(Although I described it as both alerts resolving around the same time, it doesn't have to be that way; alert A might have ended later than B, with some hand-waving and uncertainty. The necessary condition is for alert A and its inhibition to no longer be in effect when Alertmanager is about to process a notification that includes alert B's resolution.)

The consequence of this is that if you want inhibitions to reliably suppress notification about resolved alerts, you need the inhibiting alert to be active at least one group_interval longer than the alerts it's inhibiting. In some cases this is easy to arrange, but in other cases it may be troublesome and so you may want to simply live with the extra notifications about resolved alerts.

(The longer your 'group_interval' setting is, the worse this gets, but there are a number of reasons you probably want group_interval to be relatively short, including prompt notifications about resolved alerts under normal circumstances.)

What Prometheus Alertmanager's group_interval setting means

By: cks
3 April 2024 at 00:43

One of the configuration settings in Prometheus Alertmanager for 'routes' is the alert group interval, the 'group_interval' setting. The Alertmanager configuration describes the setting this way:

How long to wait before sending a notification about new alerts that are added to a group of alerts for which an initial notification has already been sent.

As has come up before more than once, this is not actually accurate. The group interval is not a (minimum) delay; it is instead a timer that ticks every so often (a ticker). If you have group_interval set to five minutes, Alertmanager will potentially send another notification only at every five minute interval after the first notification (what I'll call a tick). If the initial notification happened at 12:10, the first re-notification might happen at 12:15, and then at 12:20, and then at 12:25, and so on.

(The timing of these ticks is based purely on when the first notification for an alert group is sent, so usually they will not be so neatly lined up with the clock.)

If a new alert (or a resolved alert) misses the group_interval tick by even a second, a notification including it won't go out until the next tick. If the initial alert group notification happened at 12:10 and then nothing changed until a new alert was raised at 12:31, Alertmanager will not send another notification until the group_interval tick at 12:35, even though it's been much more than five minutes since the last notification.

This gives you an unfortunate tradeoff between prompt notification of additional alerts in an alert group (or of alerts being resolved) and not receiving a horde of notifications. If you want to receive a prompt notification, you need a short group_interval, but then you can receive a stream of notifications as alert after alert after alert pops up one by one. It would be nicer if Alertmanager didn't have this group_interval tick behavior but would instead treat it as a minimum delay between successive notifications, but I don't expect Alertmanager to change at this point.

(I've written all of this down before in various entries, so this is mostly to have a single entry I can link to in the future when group_interval comes up.)

The power of being able to query your servers for unpredictable things

By: cks
2 April 2024 at 03:04

Today, for reasons beyond the scope of this entry, we wanted to find out how much disk space /var/log/amanda was using on all of our servers. We have a quite capable metrics system that captures the amount of space filesystems are using (among many other things), but /var/log/amanda wasn't covered by this because it wasn't a separate filesystem; instead it was just one directory tree in either the root filesystem (on most servers) or the /var filesystem (on a few fileservers that have a separate /var). Fortunately we don't have too many servers in our fleet and we have a set of tools to run commands across all of them, so answering our question was pretty simple.

This isn't the first time we've wanted to know some random thing about some or all of our servers, and it won't be the last time. The reality of life is that routine monitoring can't possibly capture every fact you'll ever want to know, and you shouldn't even try to make it do so (among other issues, you'd be collecting far too much information). Sooner or later you're going to need to get nearly arbitrary information from your servers, using some mechanism.

This mechanism doesn't necessarily need to be SSH, and it doesn't even need to involve connecting to servers, depending in part on how many of them you have. Perhaps you'll normally do it by peering inside one of your immutable system images to answer questions about it. But on a moderate scale my feeling is that 'run a command on some or all of our machines and give me the output' is the basic primitive you're going to wind up wanting, partly because it's so flexible.

(One advantage of using SSH for this is that SSH has a mature, well understood and thoroughly hardened authentication and access control system. Other methods of what are fundamentally remote command or code execution may not be so solid and trustworthy. And if you want to, you can aggressively constrain what a SSH connection can do through additional measures like forcing it to run in a captive environment that only permits certain things.)

PS: The direct answer is that on everything except our Amanda backup servers, /var/log/amanda is at most 20 Mbytes or so, and often a lot less. After the Amanda servers, our fileservers have the largest amount of data there. In our environment, this directory tree is only used for what are basically debugging logs, and I believe that on clients, the amount of debugging logs you wind up with scales with the number of filesystems you're dealing with.

The Prometheus scrape interval mistake people keep making

By: cks
31 March 2024 at 02:22

Prometheus gathers metrics by scraping metrics exporters every so often, which means that it has a concept of the scrape interval, how frequently it should scrape a metrics source (a target). Prometheus also has recording rules and alerting rules, both of which have to be evaluated every so often; these also have an evaluation interval. Every so often, someone shows up on the Prometheus mailing list to say, more or less, 'I have a source of metrics that only updates every half hour, so I set my scrape interval to half an hour and everything went mysteriously wrong'.

The reason everything goes wrong if you set a long scrape interval (or a rule evaluation interval) is that Prometheus has an under-documented idea that metric samples go stale after a while. Or to put it another way, when you make a Prometheus query, it only looks back so far to find 'the current value of a metric'. This period is five minutes by default, and changing it is not at all obvious. If you scrape a metric too slowly, the most recent sample will routinely go stale and stop being visible to your queries and alerts. If you scrape something only every half an hour, your metrics from that scrape will be good for five minutes and then stale (and invisible) for the next 25 (more or less). This is unlikely to be what you want.

(Because recording rules and alerting rules create metrics, their evaluation intervals are also subject to this issue. This is pretty clear with recording rules, since their whole purpose is to create new metrics, but isn't as obvious with alerting rules.)

Unfortunately, Prometheus does nothing to stop you from configuring this by accident or ignorance, and people routinely do. You can set a scrape interval of ten minutes, or a half an hour, or an hour, and get not so much as a vague warning from Prometheus when it checks your configuration and starts up. Nor is there so much as a caution about this in the configuration documentation, at least currently.

(The usual safe recommendation is that your scrape interval be no longer than about two minutes, so that you can miss one scrape without metrics going stale.)

If you have a source of metrics that both change infrequently and are expensive to generate, the usual recommendation is that you generate them under your own control and then publish them somewhere, for example in Pushgateway or as text files that are collected through the Prometheus host agent's 'textfiles' collector. If the metrics merely change infrequently but are cheap to collect, Prometheus is quite efficient about storing unchanged metrics so you might as well scrape frequently.

PS: The way you change this staleness interval is the command line Prometheus switch '--query.lookback-delta', although making it larger will likely have various effects that increase resource usage. I also suspect that Prometheus is not tested very much with larger settings for this, especially ones substantially longer than the default.

The effects of silences (et al) in Prometheus Alertmanager

By: cks
29 March 2024 at 03:11

Prometheus Alertmanager has various features that make it 'silence' alerts. Alerts can be inhibited by other alerts, they can be explicitly silenced, and a route can be muted at certain times or only active at certain times. The Alertmanager documentation generally describes all of these as "suppressing notifications" or causing a route to "not send any notifications". However, this limited description is what I would call under-specified, because there are some questions to ask about exactly what happens when you 'silence' alerts. As of Alertmanager 0.27.0, its actual behavior is somewhat complex and definitely hard to understand.

There are two pieces of behavior that seem straightforward:

  • if an alert starts within the silence and is still in effect at the end, its alert group will receive a new notification at its next group_interval point; this notification will include the new alert (or alerts).

  • if an alert group (of one or more alerts) is created within the silence and all of its alerts end sufficiently before the end of the silence, you will get no notification about the alert group.

The area with big question marks is notifications about resolved alerts (if you have Alertmanager set to send notifications on them at all). If the alert resolves sufficiently early, well before the end of the silence, you appear to get no notification for it. If the alert resolves close enough to the end of the silence and its alert group still has active alerts, you will sometimes get an alert group notification that includes the resolved alert. Sometimes this notification will come immediately, and sometimes it seems to only come if the alert group experiences another change in alert status sufficiently soon after the silence has ended.

(There are a lot of variables here and I haven't experimented extensively. Generally I think the sooner that Alertmanager has some reason to send a notification for the alert group, the higher your chances of hearing about resolved alerts are. One source of such a notification is if there are active alerts that started within the silence.)

What I believe is happening is that Alertmanager is keeping track of what alerts have had notifications delivered about them (through a specific receiver), so that Alertmanager can tell if there are new alerts in an alert group that would cause it to send a notification at the next group_interval point. When a silence, mute, or inhibition is in effect, no affected alerts are marked as 'delivered (to receiver X)'. When the silence ends, any such unmarked alerts that still exist are (once again) considered to be undelivered new alerts and will prompt an alert group notification at the alert group's next group_interval point, just as if they had suddenly shown up after the silence ended.

The complication is resolved alerts, because I believe that resolved alerts only linger in Alertmanager for a certain amount of time. After that time they are quietly removed. If an alert is resolved sufficiently early before the end of the silence, this linger time will end before the silence does and the resolved alert will disappear before its new status could trigger any notifications. If the alert is resolved sufficiently close to the end of the silence, it will still be in Alertmanager when notifications start happening again. I'm pretty sure this explanation is incomplete, but it at least gives me a starting point.

PS: Since all of this is under-documented, Alertmanager's behavior could change in the future, either deliberately or accidentally.

(This somewhat elaborates on some things I said on the Fediverse.)

Some questions to ask about what silencing alerts means

By: cks
28 March 2024 at 03:25

A common desired feature for an alert notification system is that you can silence (some) alert notifications for a while. You might silence alerts about things that are under planned maintenance, or do it generally in the dead of night for things that aren't important enough to wake someone. This sounds straightforward but in practice my simple description here is under-specified and raises some questions about how things behave (or should behave).

The simplest implementation of silencing alert notifications is for the alerting system to go through all of its normal process for sending notifications but not actually deliver the notifications; the notifications are discarded, diverted to /dev/null, or whatever. In the view of the overall system, the alert notifications were successfully delivered, while in your view you didn't get emailed, paged, notified in some chat channel, or whatever.

However, there are a number of situations where you may not want to discard alert notifications this way, but instead defer them until after the silence has ended. Here are some cases:

  • If an alert starts during the silence and is still in effect when the silence ends, many people will want to get an alert notification about it at (or soon after) the end of the silence. Otherwise, you have to remember to look at dashboards or other sources of alert information to see what current problems you have.

  • If an alert started before the silence and ends (resolves) during the silence, some people will want to get an alert notification about the alert having been resolved at the end of the silence. Otherwise you're once again left to look at your dashboards to notice that some things cleaned up during the silence.

    (This assumes you normally send notifications about resolved alerts, which not everyone does.)

  • If an alert both starts and ends during the silence, most people will say that you shouldn't get an alert notification about it afterward. Otherwise silences would simply defer alert notifications about things like planned maintenance, not eliminate them. However, some people would like to get some sort of summary or general notification about alerts that came up and got resolved during the silence.

    (This is perhaps especially likely for the 'silence in the depths of the night' or 'silence over the weekend' sorts of schedule based silencing. You may still want to know that things happened, just not bother people with them on the spot.)

Whether you want post-silence alert notifications in some or all of these situations will depend in part on what you use alert notifications for (or how the designers of your system expect this to work). In some environments, an alert notification is in effect a message that says 'go look at your dashboards', so you don't need this at the end of a planned maintenance since you're probably already doing that. In other environments, the alert notification is either the primary signal that something is wrong or the primary source of information for what to do about it (by carrying links to runbooks, suggested remediations, relevant dashboards, and so on). Getting an alert notification for 'new' alerts is then vital because that's primarily how you know you have to do something and maybe know what to do.

(And in some environments, getting alert notifications about resolved alerts is the primary method people use to track outstanding alerts, making those important.)

How I would automate monitoring DNS queries in basic Prometheus

By: cks
27 March 2024 at 03:06

Recently I wrote about the problem of using basic Prometheus to monitor DNS query results, which comes about primarily because the Blackbox exporter requires a configuration stanza (a module) for every DNS query you want to make and doesn't expose any labels for what the query type and name are. In a comment, Mike Kohne asked if I'd considered using a script to generate the various configurations needed for this, where you want to check N DNS queries across M different DNS servers. I hadn't really thought about it and we're unlikely to do it, but here is how I would if we did.

The input for the generation system is a list of DNS queries we want to confirm work, which is at least a name and a DNS query type (A, MX, SOA, etc), possibly along with an expected result, and a list of the DNS servers that we want to make these queries against. A full blown system would allow multiple groups of queries and DNS servers, so that you can query your internal DNS servers for internal names as well as external names you want to always be resolvable.

First, I'd run a completely separate Blackbox instance for this purpose, so that its configuration can be entirely script-generated. For each DNS query to be made, the script will work out the Blackbox module's name and then put together the formulaic stanza, for example:

autodns_a_utoronto_something:
  prober: dns
  dns:
    query_name: "utoronto.example.com"
    query_type: "A"
    validate_answer_rrs:
      fail_if_none_matches_regexp:
        - ".*\t[0-9]*\tIN\tA\t.*"

Then your generation program combines all of these stanzas together with some stock front matter and you have this Blackbox instance's configuration file. It only needs to change if you add a new DNS name to query.

The other thing the script generates is a list of scrape targets and labels for them in the format that Prometheus file discovery expects. Since we're automatically generating this file we might as well put all of the smart stuff into labels, including specifying the Blackbox module. This would give us one block for each module that lists all of the DNS servers that will be queried for that module, and the labels necessary. This could be JSON or YAML, and in YAML form it would look like (for one module):

- labels:
    # Hopefully you can directly set __param_module in
    # a label like this.
    __param_module: autodns_a_utoronto_something
    query_name: utoronto.example.com
    query_type: A
    [... additional labels based on local needs ...]
  targets:
  - dns1.example.org:53
  - dns2.example.org:53
  - 8.8.8.8:53
  - 1.1.1.1:53
  [...]

(If we're starting with data in a program it's probably better to generate JSON. Pretty much every language can create JSON by now, and it's a more forgiving format than trying to auto-generate YAML even if the result is less readable. But if I was going to put the result in a version control repository, I'd generate YAML.)

More elaborate breakdowns are possible, for example to separate external DNS servers from internal ones, and other people's DNS names from your DNS names. You'll get an awful lot of stanzas with various mixes of labels, but the whole thing is being generated automatically and you don't have to look at it. In our local configuration we'd wind up with at least a few extra labels and a more complicated set of combinations.

We need the query name and query type available as labels because we're going to write one generic alert rule for all of these Blackbox modules, something like:

- alert: DNSGeneric
  expr: probe_success{probe=~"autodns_.*"} == 0
  for: 5m
  annotations:
    summary: "We cannot get the {{$labels.query_type}} record for {{$labels.query_name}} from the DNS server ..."

(If Blackbox put these labels in DNS probe metrics we could skip adding them in the scrape target configuration. We'd also be able to fold a number of our existing DNS alerts into more generic ones.)

If you go the extra distance to have some DNS lookups require specific results (instead of just 'some A record' or 'some MX record'), then you might need additional labels to let you write a more specific alert record.

For us, both generated files would be relatively static. As a practical matter we don't add extra names to check or DNS servers to test against very often.

We could certainly write such a configuration file generation system and get more comprehensive coverage of our DNS zones and various nameservers than we currently have. However, my current view is that the extra complexity almost certainly wouldn't be worth it in terms of detecting problems and maintaining the system. We'd make more queries against more DNS servers if it was easier, if it would be with such a generation system, but those queries would almost never detect anything we didn't already know.

Options for diverting alerts in Prometheus

By: cks
26 March 2024 at 02:58

Suppose, not hypothetically, that you have a collection of machines and some machines are less important than others or are of interest only to a particular person. Alerts about normal machines should go to everyone; alerts about the special machines should go elsewhere. There are a number of options to set this up in Prometheus and Alertmanager, so today I want to run down a collection of them for my own future use.

First, you have to decide the approach you'll use in Alertmanager. One option is to is to specifically configure an early Alertmanager route that specifically knows the names of these machines. This is the most self-contained option, but it has the drawback that Alertmanager routes can often intertwine in complicated ways that are hard to keep track of. For instance, you need to keep your separate notification routes for these machines in sync.

(I should write down in one place the ordering requirements for routes in our Alertmanager configuration, because several times I've made changes that didn't have the effect I wanted because I had the route in the wrong spot.)

The other Alertmanager option is to set up general label-based markers for alerts that should be diverted and rely on Prometheus to get the necessary label on to the alerts about these special machines. My view is that you're going to want to have such 'testing' alerts in general, so sooner or later you're going to wind up with this in your Alertmanager configuration.

Once Prometheus is responsible for labeling the specific alerts that should be diverted, you have some options:

  • The Prometheus alert rule can specifically add the appropriate label. This works great if it's a testing alert rule that you always want to divert, but less well if it's a general alert that you only want to divert some of the time.

  • You can arrange for metrics from the specific machines to have the special label values necessary. This has three problems. First, it creates additional metrics series if you change how a machine's alerts are handled. Second, it may require ugly contortions to pull some scrape targets out to different sections of a static file, so you can put different labels on them. And lastly, it's error-prone, because you have to make sure all of the scrape targets for the machine have the label on them.

    (You might even be doing special things in your alert rules to create alerts for the machine out of metrics that don't come from scraping it, which can require extra work to add labels to them.)

  • You can add the special label marker in Prometheus alert relabeling, by matching against your 'host' label and creating a new label. This will be something like:

    - source_labels: [host]
      regex: vmhost1
      target_label: send
      replacement: testing
    

    You'll likely want to do this at the end, or at least after any other alert label canonicalization you're doing to clean up host names, map service names to hosts, and so on.

Now that I've sat down and thought about all of these options, the one I think I like the best is alert relabeling. Alert relabeling in Prometheus puts this configuration in one central place, instead of spreading it out over scrape targets and alert rules, and it does so in a setting that doesn't have quite as many complex ordering issues as Alertmanager routes do.

(Adding labels in alert rules is still the right answer if the alert itself is in testing, in my view.)

The problem of using basic Prometheus to monitor DNS query results

By: cks
16 March 2024 at 02:37

Suppose that you want to make sure that your DNS servers are working correctly, for both your own zones and for outside DNS names that are important to you. If you have your own zones you may also care that outside people can properly resolve them, perhaps both within the organization and genuine outsiders using public DNS servers. The traditional answer to this is the Blackbox exporter, which can send the DNS queries of your choice to the DNS servers of your choice and validate the result. Well, more or less.

What you specifically do with the Blackbox exporter is that you configure some modules and then you provide those modules targets to check (through your Prometheus configuration). When you're probing DNS, the module's configuration specifies all of the parameters of the DNS query and its validation. This means that if you are checking N different DNS names to see if they give you a SOA record (or an A record or a MX record), you need N different modules. Quite reasonably, the metrics Blackbox generates when you check a target don't (currently) include the actual DNS name or query type that you're making. Why this matters is that it makes it difficult to write a generic alert that will create a specific message that says 'asking for the X type of record for host Y failed'.

You can somewhat get around this by encoding this information into the names of your Blackbox modules and then doing various creative things in your Prometheus configuration. However, you still have to write all of the modules out, even though many of them may be basically cut and paste versions of each other with only the DNS names changed. This has a number of issues, including that it's a disincentive to doing relatively comprehensive cross checks. (I speak from experience with our Prometheus setup.)

There is a third party dns_exporter that can be set up in a more flexible way where all parts of the DNS check can be provided by Prometheus (although it exposes some metrics that risk label cardinality explosions). However this still leaves you to list in your Prometheus configuration a cross-matrix of every DNS name you want to query and every DNS server you want to query against. What you'll avoid is needing to configure a bunch of Blackbox modules (although what you lose is the ability to verify that the queries returned specific results).

To do better, I think we'd need to write a custom program (perhaps run through the script exporter) that contained at least some of this knowledge, such as what DNS servers to check. Then our Prometheus configuration could just say 'check this DNS name against the usual servers' and the script would know the rest. Unfortunately you probably can't reuse any of the current Blackbox code for this, even if you wrote the core of this script in Go.

(You could make such a program relatively generic by having it take the list of DNS servers to query from a configuration file. You might want to make it support multiple lists of DNS servers, each of them named, and perhaps set various flags on each server, and you can get quite elaborate here if you want to.)

(This elaborates on a Fediverse post of mine.)

You might want to think about if your system serial numbers are sensitive

By: cks
15 March 2024 at 03:03

Recently, a commentator on my entry about what's lost when running the Prometheus host agent as a non-root user on Linux pointed out that if you do this, one of the things omitted (that I hadn't noticed) is part of the system DMI information. Specifically, you lose various serial numbers and the 'product UUID', which is potentially another unique identifier for the system, because Linux makes the /sys/class/dmi/id files with these readable only by root (this appears to have been the case since support for these was added to /sys in 2007). This got me thinking about whether serial numbers are something we should consider sensitive in general.

My tentative conclusion is that for us, serial numbers probably aren't sensitive enough to do anything special about. I don't think any of our system or component serial numbers can be used to issue one time license keys or the like, and while people could probably do some mischief with some of them, this is likely a low risk thing in our academic environment.

(Broadly we don't consider any metrics to be deeply sensitive, or to put it another way we wouldn't want to collect any metrics that are because in our environment it would take a lot of work to protect them. And we do collect DMI information and put it into our metrics system.)

This doesn't mean that serial numbers have no sensitivity even for us; I definitely do consider them something that I generally wouldn't (and don't) put in entries here, for example. Depending on the vendor, revealing serial numbers to the public may let the public do things like see your exact system configuration, when it was delivered, and other potentially somewhat sensitive information. There's also more of a risk that bored Internet people will engage in even minor mischief.

However, your situation is not necessarily like ours. There are probably plenty of environments where serial numbers are potentially more sensitive or more dangerous if exposed (especially if exposed widely). And in some environments, people run semi-hostile software that would love to get its hands on a permanent and unique identifier for the machine. Before you gather or expose serial number information (for systems or for things like disks), you might want to think about this.

At the same time, having relatively detailed hardware configuration information can be important, as in the war story that inspired me to start collecting this information in our metrics system. And serial numbers are a great way to disambiguate exactly which piece of hardware was being used for what, when. We deliberately collect disk drive serial number information from SMART, for example, and put it into our metrics system (sometimes with amusing results).

Why we should care about usage data for our internal services

By: cks
12 March 2024 at 02:47

I recently wrote about some practical-focused thoughts on usage data for your services. But there's a broader issue about usage data for services and having or not having it. My sense is that for a lot of sysadmins, building things to collect usage data feels like accounting work and likely to lead to unpleasant and damaging things, like internal chargebacks (which have create various problems, and also). However, I think we should strongly consider routinely gathering this data anyway, for fundamentally the same reasons as you should collect information on what TLS protocols and ciphers are being used by your people and software.

We periodically face decisions both obvious and subtle about what to do about services and the things they run on. Do we spend the money to buy new hardware, do we spend the time to upgrade the operating system or the version of the third party software, do we need to closely monitor this system or service, does it need to be optimized or be given better hardware, and so on. Conversely, maybe this is now a little-used service that can be scaled down, dropped, or simplified. In general, the big question is do we need to care about this service, and if so how much. High level usage data is what gives you most of the real answers.

(In some environments one fate for narrowly used services is to be made the responsibility of the people or groups who are the service's big users, instead of something that is provided on a larger and higher level.)

Your system and application metrics can provide you some basic information, like whether your systems are using CPU and memory and disk space, and perhaps how that usage is changing over a relatively long time base (if you keep metrics data long enough). But they can't really tell you why that is happening or not happening, or who is using your services, and deriving usage information from things like CPU utilization requires either knowing things about how your systems perform or assuming them (eg, assuming you can estimate service usage from CPU usage because you're sure it uses a visible amount of CPU time). Deliberately collecting actual usage gives you direct answers.

Knowing who is using your services and who is not also gives you the opportunity to talk to both groups about what they like about your current services, what they'd like you to add, what pieces of your service they care about, what they need, and perhaps what's keeping them from using some of your services. If you don't have usage data and don't actually ask people, you're flying relatively blind on all of these questions.

Of course collecting usage data has its traps. One of them is that what usage data you collect is often driven by what sort of usage you think matters, and in turn this can be driven by how you expect people to use your services and what you think they care about. Or to put it another way, you're measuring what you assume matters and you're assuming what you don't measure doesn't matter. You may be wrong about that, which is one reason why talking to people periodically is useful.

PS: In theory, gathering usage data is separate from the question of whether you should pay attention to it, where the answer may well be that you should ignore that shiny new data. In practice, well, people are bad at staying away from shiny things. Perhaps it's not a bad thing to have your usage data require some effort to assemble.

(This is partly written to persuade myself of this, because maybe we want to routinely collect and track more usage data than we currently do.)

Some thoughts on usage data for your systems and services

By: cks
10 March 2024 at 03:10

Some day, you may be called on by decision makers (including yourself) to provide some sort of usage information for things you operate so that you can make decisions about them. I'm not talking about system metrics such as how much CPU is being used (although for some systems that may be part of higher level usage information, for example for our SLURM cluster); this is more on the level of how much things are being used, by who, and perhaps for what. In the very old days we might have called this 'accounting data' (and perhaps disdained collecting it unless we were forced to by things like chargeback policies).

In an ideal world, you will already be generating and retaining the sort of usage information that can be used to make decisions about services. But internal services aren't necessarily automatically instrumented the way revenue generating things are, so you may not have this sort of thing built in from the start. In this case, you'll generally wind up hunting around for creative ways to generate higher level usage information from low level metrics and logs that you do have. When you do this, my first suggestion is write down how you generated your usage information. This probably won't be the last time you need to generate usage information, and also if decision makers (including you in the future) have questions about exactly what your numbers mean, you can go back to look at exactly how you generated them to provide answers.

(Of course, your systems may have changed around by the next time you need to generate usage information, so your old ways don't work or aren't applicable. But at least you'll have something.)

My second suggestion is to look around today to see if there's data you can easily collect and retain now that will let you provide better usage information in the future. This is obviously related to keeping your logs longer, but it also includes making sure that things make it to your logs (or at least to your retained logs, which may mean setting things to send their log data to syslog instead of keeping their own log files). At this point I will sing the praises of things like 'end of session' summary log records that put all of the information about a session in a single place instead of forcing you to put the information together from multiple log lines.

(When you've just been through the exercise of generating usage data is an especially good time to do this, because you'll be familiar with all of the bits that were troublesome or where you could only provide limited data.)

Of course there are privacy implications of retaining lots of logs and usage data. This may be a good time to ask around to get advance agreement on what sort of usage information you want to be able to provide and what sort you definitely don't want to have available for people to ask for. This is also another use for arranging to log your own 'end of session' summary records, because if you're doing it yourself you can arrange to include only the usage information you've decided is okay.

Options for your Grafana panels when your metrics change names

By: cks
2 March 2024 at 04:33

In an ideal world, your metrics never change their names; once you put them into a Grafana dashboard panel, they keep the same name and meaning forever. In the real world, sometimes a change in metric name is forced on you, for example because you might have to move from collecting a metric through one Prometheus exporter to collecting it with another exporter which naturally gives it a different name. And sometimes a metric will be renamed by its source.

In a Prometheus environment, the very brute force way to deal with this is either a recording rule (creating a duplicate metric with the old name) or renaming the metric during ingestion. However I feel that this is generally a mistake. Almost always, your Prometheus metrics should record the true state of affairs, warts and all, and it should be on other things to sort out the results.

(As part of this, I feel that Prometheus metric names should always be honest about where they come from. There's a convention that the name of the exporter is at the start of the metric name, and so you shouldn't generate your own metrics with someone else's name on them. If a metric name starts with 'node_*', it should come from the Prometheus host agent.)

So if your Prometheus metrics get renamed, you need to fix this in your Grafana panels (which can be a pain but is better in the long run). There are at least three approaches I know of. First, you can simply change the name of the metric in all of the panels. This keeps things simple but means that your historical data stops being visible on the dashboards. If you don't keep historical data for very long (or don't care about it much), this is fine; pretty soon the metric's new name will be the only one in your metrics database. In our case, we keep years of data and do want to be able to look back, so this isn't good enough.

The second option is to write your queries in Grafana as basically 'old_name or new_name'. If your queries involve rate() and avg() and other functions, this can be a lot of (manual) repetition, but if you're careful and lucky you can arrange for the old and the new query results to have the same labels as Grafana sees them, so your panel graphs will be continuous over the metrics name boundary.

The third option is to duplicate the query and then change the name of the metric (or the metrics) in the new copy of the query. This is usually straightforward and easy, but it definitely gives you graphs that aren't continuous around the name change boundary. The graphs will have one line for the old metric and then a new second line for your new metric. One advantage of separate queries is that you can someday turn the old query off in Grafana without having to delete it.

Detecting absent Prometheus metrics without knowing their labels

By: cks
29 February 2024 at 03:18

When you have a Prometheus setup, one of the things you sooner or later worry about is important metrics quietly going missing because they're not being reported any more. There can be many reasons for metrics disappearing on you; for example, a network interface you expect to be at 10G speeds may not be there at all any more, because it got renamed at some point, so now you're not making sure the new name is at 10G.

(This happened to us with one machine's network interface, although I'm not sure exactly how except that it involves the depths of PCIe enumeration.)

The standard Prometheus feature for this is the 'absent()' function, or sometimes absent_over_time(). However, both of these have the problem that because of Prometheus's data model, you need to know at least some unique labels that your metrics are supposed to have. Without labels, all you can detect is a total disappearance of the metric at all, if nothing at all is reporting the metric. If you want to be alerted when some machine stops reporting a metric, you need to list all of the sources that should have the metric (following a pattern we've seen before):

absent(metric{host="a", device="em0"}) or
 absent(metric{host="b", device="eno1"}) or
 absent(metric{host="c", device="eth2"})

Sometimes you don't know all of the label values that your metric be present with (or it's tedious to list all of them and keep them up to date), and it's good enough to get a notification if a metric disappears when it was previously there (for a particular set of labels). For example, you might have an assortment of scripts that put their success results to somewhere and you don't want to have to keep a list of all of the scripts, but you do want to detect when a script stops reporting its metrics. In this case we can use 'offset' to check current metrics against old metrics. The simplest pattern is:

your_metric offset 1h
  unless your_metric

If the metric was there an hour ago and isn't there now, this will generate the metric as it was an hour ago (with the labels it had then), and you can use that to drive an alert (or at least a notification). If there are labels that might naturally change over time in your_metric, you can exclude them with 'unless ignoring (...)' or use 'unless on (...)' for a very focused result.

As written this has the drawback that it only looks at what versions of the metric were there exactly an hour ago. We can do better by using an *_over_time() function, for example:

max_over_time( your_metric[4h] ) offset 1h
  unless your_metric

Now if your metric existed (with some labels) at any point between five hours ago and one hour ago, and doesn't exist now, this expression will give you a result and you can alert on that. Since we're using *_over_time(), you can also leave off the 'offset 1h' and just extend the time range, and then maybe extend the other time range too:

max_over_time( your_metric[12h] )
  unless max_over_time( your_metric[20m] )

This expression will give you a result if your_metric has been present (with a given set of labels) at some point in the last 12 hours but has not been present within the last 20 minutes.

(You'd pick the particular *_over_time() function to use depending on what use, if any, you have for the value of the metric in your alert. If you have no particular use for the value (or you expect the value to be a constant), either max or min are efficient for Prometheus to compute.)

All of these clever versions have a drawback, which is that after enough time has gone by they shut off on their own. Once the metric has been missing for at least an hour or five hours or 12 hours or however long, even the first part of the expression has nothing and you get no results and no alert. So this is more of a 'notification' than a persistent 'alert'. That's unfortunately the best you can really do. If you need a persistent alert that will last until you take it out of your alert rules, you need to use absent() and explicitly specify the labels you expect and require.

Our probably-typical (lack of) machine inventory situation

By: cks
28 February 2024 at 04:03

As part of thinking about how we configure machines to monitor and what to monitor on them, I mentioned in passing that we don't generate this information from some central machine inventory because we don't have a single source of truth for a machine inventory. This isn't to say that we don't have any inventory of our machines; instead, the problem is that we have too many inventories, each serving somewhat different purposes.

The core reason that we have wound up with many different lists of machines is that we use many different tools and systems that need to have lists of machines and each of them has a different input format and input sources. It's technically possible to generate all of these different lists of machines for different programs and tools from some single master source, but by and large you get to build, manage, maintain both the software for the master source and the software to extract and reformat all of the machine lists for the various programs that need them. In many cases (certainly in ours), this adds extra work over just maintaining N lists of machines for N programs and subsystems.

(It also generally means maintaining a bespoke custom system for your environment, which is a constant ongoing expense in various ways.)

So we have all sorts of lists of machines, for a broad view of what a machine is. Here's an incomplete list:

  • DNS entries (all of our servers have static IPs), but not all DNS entries still exist as hardware, much less hardware that is turned on. In addition, we have DNS entries for various IP aliases and other things that aren't unique machines.

    (We'd have more confusion if we used virtual machines, but all of our production machines are on physical hardware.)

  • NFS export permissions for hosts that can do NFS mounts from our fileservers, but not all of our active machines can do this and there are some listed host names that are no longer turned on or perhaps even still in DNS.

    (NFS export permissions aren't uniform between hosts; some have extra privileges.)

  • Hosts that we have established SSH host keys for. This includes hosts that aren't currently in service and may never be in service again.

  • Ubuntu machines that are updated by our bulk updates system, which is driven by another 'list of machines' file that is also used for some other bulk operations. But this data file omits various machines we don't manage that way (or at best only belatedly includes them), and while it tracks some machine characteristics it doesn't have all of them.

    (And sometimes we forget to add machines to this data file, which we at least get a notification about. Well, for Ubuntu machines.)

  • Unix machines that we monitor in various ways in our Prometheus system. These machines may be ping'd, have their SSH port checked to see if it answers, run the Prometheus host agent, and run additional agents to export things like GPU metrics, depending on what the machine is.

    Not all turned-on machines are monitored by Prometheus for various reasons, including that they are test or experimental machines. And temporarily turned off machines tend to be temporarily removed to reduce alert and dashboard noise.

  • Our console server has a whole configuration file of what machines have a serial console and how they're configured and connected up. Turned-off machines that are still connected to the console server remain in this configuration file, and they can then linger even after being de-cabled.
  • We mostly use 'smart' PDUs that can selectively turn outlets off, which means that we track what machine is on what PDU port. This is tracked both in a master file and in the PDU configurations (they have menus that give text labels to ports).

  • A 'server inventory' of where servers are physically located and other basic information about the server hardware, generally including a serial number. Not all racked physical servers are powered on, and not all powered on servers are in production.
  • Some degree of network maps, to track what servers are connected to what switches for troubleshooting purposes.

  • Various forms of server purchase records with details about the physical hardware, including serial numbers, which we have to keep in order to be able to get rid of the hardware later. This doesn't include the current host name (if any) that the hardware is currently being used for, or where the hardware is (currently) located.

If we assigned IPs to servers through DHCP, we'd also have DHCP configuration files. These would have to track servers by another identity, their Ethernet address, which would in turn depend on what networking the server was using. If we switched a server from 1G networking to 10G networking by putting a 10G card in it, we'd have to change the DHCP MAC information for the server but nothing else about it would change.

There's also confusion over what exactly 'a machine' is, partly because different pieces care about different aspects. We assign DNS host names to roles, not to physical hardware, but the role is implemented in some chunk of physical hardware and sometimes the details of that hardware matter. This leads to more potential confusion in physical hardware inventories, because sometimes we want to track that a particular piece of hardware was 'the old <X>' in case we have to fall back to that older OS for some reason.

(And sometimes we have pre-racked spare hardware for some important role and so what hardware is live in that role and what is the spare can swap around.)

We could put all of this information in a single database (probably in multiple tables) and then try to derive all of the various configuration files from it. But it clearly wouldn't be simple (and some of it would always have to be manually maintained, such as the physical location of hardware). If there is off the shelf open source software that will do a good job of handling this, it's quite likely that setting it up (and setting up our inventory schema) would be fairly complex.

Instead, the natural thing to do in our environment when you need a new list of machines for some purpose (for example, when you're setting up a new monitoring system) is to set up a new configuration file for it, possibly deriving the list of machines from another, existing source. This is especially natural if the tool you're working with already has its own configuration file format.

(If our lists of machines had to change a lot it might be tempting to automatically derive some of the configuration files from 'upstream' data. But generally they don't, which means that manual handling is less work because you don't have to build an entire system to handle errors, special exceptions, and so on.)

A recent abrupt change in Internet SSH brute force attacks against us

By: cks
23 February 2024 at 04:00

It's general wisdom in the sysadmin community that if you expose a SSH port to the Internet, people will show up to poke at it, and by 'people' I mean 'attackers that are probably mostly automated'. For several years, the pattern to this that I've noticed was an apparent combination of two activities. There was a constant background pitter-patter of various IPs each making a probe once a minute or less (but for tens of minutes or longer), and then periodic bursts where a single IP would be more active, sometimes significantly so.

(Although I can't be sure, I think the rate of both the background probes and the periodic bursts was significantly up compared to how it was a couple of years ago. Unfortunately making direct comparisons is a bit difficult due to Grafana Loki issues.)

Then there came this past Tuesday, and I noticed something that I reported on the Fediverse:

This is my system administrator's "what is wrong" face when Internet ssh authentication probes against our systems seem to have fallen off a cliff, as reported by system logs. We shouldn't be seeing only two in the last hour.

(The nose dive seems to have started at 6:30 am Eastern and hit 'basically nothing' by 9:30 am.)

After looking at this longer, the pattern I'm now seeing on our systems is basically that the background low-volume probes seem to have gone away. Every so often some attacker will fire up a serious bulk probe, making (for example) 400 attempts over a half an hour (often for a random assortment of nonexistent logins); rarely there will be a burst where a dozen IPs will each make an attempt or two and then stop (there's some signs that a lot of the IPs are Tor exit nodes). But for a lot of the time, there's nothing. We can go an hour or three with absolutely no probes at all, which never used to happen; previously a typical baseline rate of probes was around a hundred an hour.

Since the higher-rate SSH probes get through fine, this doesn't seem to be anything in our firewalls or local configurations (I initially wondered about things like a change in logging that came in with an Ubuntu package update). Instead it seems to be a change in attacker behavior, and since it took about two hours to take full effect on Tuesday morning, I wonder if it was something getting progressively shut down or reoriented.

❌
❌