❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Public Figures Keep Leaving Their Venmo Accounts Public

By: Nick Heer
27 March 2025 at 04:00

The high-test idiocy of a senior U.S. politician inviting a journalist to an off-the-record chat planning an attack on Yemen, killing over thirty people and continuing a decade of war, seems to have popularized a genre of journalism dedicated to the administration’s poor digital security hygiene. Some of these articles feel less substantial; others suggest greater crimes. One story feels like deja vu.

Dhruv Mehrotra and Tim Marchman, Wired:

The Venmo account under [Mike] Waltz’s name includes a 328-person friend list. Among them are accounts sharing the names of people closely associated with Waltz, such as [Walker] Barrett, formerly Waltz’s deputy chief of staff when Waltz was a member of the House of Representatives, and Micah Thomas Ketchel, former chief of staff to Waltz and currently a senior adviser to Waltz and President Donald Trump.

[…]

One of the most notable appears to belong to [Susie] Wiles, one of Trump’s most trusted political advisers. That account’s 182-person friend list includes accounts sharing the names of influential figures like Pam Bondi, the US attorney general, and Hope Hicks, Trump’s former White House communications director.

In 2021, reporters for Buzzfeed News found Joe Biden’s Venmo account and his contacts. Last summer, the same Wired reporters plus Andrew Couts found J.D. Vance’s and, in February, reporters for the American Prospect found Pete Hegseth’s. It remains a mystery to me why one of the most popular U.S. payment apps is this public.

βŒ₯ Permalink

In universities, sometimes simple questions aren't simple

By: cks
29 March 2025 at 02:13

Over on the Fediverse I shared a recent learning experience:

Me, an innocent: "So, how many professors are there in our university department?"
Admin person with a thousand yard stare: "Well, it depends on what you mean by 'professor', 'in', and 'department." <unfolds large and complicated chart>

In many companies and other organizations, the status of people is usually straightforward. In a university, things are quite often not so clear, and in my department all three words in my joke are in fact not a joke (although you could argue that two overlap).

For 'professor', there are a whole collection of potential statuses beyond 'tenured or tenure stream'. Professors may be officially retired but still dropping by to some degree ('emeritus'), appointed only for a limited period (but doing research, not just teaching), hired as sessional instructors for teaching, given a 'status-only' appointment, and other possible situations.

(In my university, there's such a thing as teaching stream faculty, who are entirely distinct from sessional instructors. In other universities, all professors are what we here would call 'research stream' professors and do research work as well as teaching.)

For 'in', even once you have a regular full time tenure stream professor, there's a wide range of possibilities for a professor to be cross appointed (also) between departments (or sometimes 'partially appointed' by two departments). These sort of multi-department appointments are done for many reasons, including to enable a professor in one department to supervise graduate students in another one. How much of the professor's salary each department pays varies, as does where the professor actually does their research and what facilities they use in each department.

(Sometime a multi-department professor will be quite active in both departments because their core research is cross-disciplinary, for example.)

For 'department', this is a local peculiarity in my university. We have three campuses, and professors are normally associated with a specific campus. Depending on how you define 'the department', you might or might not consider Computer Science professors at the satellite campuses to be part of the (main campus) department. Sometimes it depends on what the professors opt to do, for example whether or not they will use our main research computing facilities, or whether they'll be supervising graduate students located at our main campus.

Which answers you want for all of these depends on what you're going to use the resulting number (or numbers) for. There is no singular and correct answer for 'how many professors are there in the department'. The corollary to this is that any time we're asked how many professors are in our department, we have to quiz the people asking about what parts matter to them (or guess, or give complicated and conditional answers, or all of the above).

(Asking 'how many professor FTEs do we have' isn't any better.)

PS: If you think this complicates the life of any computer IAM system that's trying to be a comprehensive source of answers, you would be correct. Locally, my group doesn't even attempt to track these complexities and instead has a much simpler view of things that works well enough for our purposes (mostly managing Unix accounts).

OIDC claim scopes and their interactions with OIDC token authentication

By: cks
17 March 2025 at 02:31

When I wrote about how SAML and OIDC differed in sharing information, where SAML shares every SAML 'attribute' by default and OIDC has 'scopes' for its 'claims', I said that the SAML approach was probably easier within an organization, where you already have trust in the clients. It turns out that there's an important exception to this I didn't realize at the time, and that's when programs (like mail clients) are using tokens to authenticate to servers (like IMAP servers).

In OIDC/OAuth2 (and probably in SAML as well), programs that obtain tokens can open them up and see all of the information that they contain, either inspecting them directly or using a public OIDC endpoint that allows them to 'introspect' the token for additional information (this is the same endpoint that will be used by your IMAP server or whatever). Unless you enjoy making a bespoke collection of (for example) IMAP clients, the information that programs need to obtain tokens is going to be more or less public within your organization and will probably (or even necessarily) leak outside of it.

(For example, you can readily discover all of the OIDC client IDs used by Thunderbird for the various large providers it supports. There's nothing stopping you from using those client IDs and client secrets yourself, although large providers may require your target to have specifically approved using Thunderbird with your target's accounts.)

This means that anyone who can persuade your people to authenticate through a program's usual flow can probably extract all of the information available in the token. They can do this either on the person's computer (capturing the token locally) or by persuading people that they need to 'authenticate to this service with IMAP OAuth2' or the like and then extracting the information from the token.

In the SAML world, this will by default be all of the information contained in the token. In the OIDC world, you can restrict the information made available through tokens issued through programs by restricting the scopes that you allow programs to ask for (and possibly different scopes for different programs, although this is a bit fragile; attackers may get to choose which program's client ID and so on they use).

(Realizing this is going to change what scopes we allow in our OIDC IdP for program client registrations. So far I had reflexively been giving them access to everything, just like our internal websites; now I think I'm going to narrow it down to almost nothing.)

Sidebar: How your token-consuming server knows what created them

When your server verifies OAuth2/OIDC tokens presented to it, the minimum thing you want to know is that they come from the expected OIDC identity provider, which is normally achieved automatically because you'll ask that OIDC IdP to verify that the token is good. However, you may also want to know that the token was specifically issued for use with your server, or through a program that's expected to be used for your server. The normal way to do this is through the 'aud' OIDC claim, which has at least the client ID (and in theory your OIDC IdP could add additional entries). If your OIDC IdP can issue tokens through multiple identities (perhaps to multiple parties, such as the major IdPs of, for example, Google and Microsoft), you may also want to verify the 'iss' (issuer) field instead or in addition to 'aud'.

Some notes on the OpenID Connect (OIDC) 'redirect uri'

By: cks
16 March 2025 at 02:57

The normal authentication process for OIDC is web-based and involves a series of HTTP redirects, interspersed with web pages that you interact with. Something that wants to authenticate you will redirect you to the OIDC identity server's website, which will ask you for your login and password and maybe MFA authentication, check them, and then HTTP redirect you back to a 'callback' or 'redirect' URL that will transfer a magic code from the OIDC server to the OIDC client (generally as a URL query parameter). All of this happens in your browser, which means that the OIDC client and server don't need to be able to directly talk to each other, allowing you to use an external cloud/SaaS OIDC IdP to authenticate to a high-security internal website that isn't reachable from the outside world and maybe isn't allowed to make random outgoing HTTP connections.

(The magic code transferred in the final HTTP redirect is apparently often not the authentication token itself but instead something the client can use for a short time to obtain the real authentication token. This does require the client to be able to make an outgoing HTTP connection, which is usually okay.)

When the OIDC client initiates the HTTP redirection to the OIDC IdP server, one of the parameters it passes along is the 'redirect uri' it wants the OIDC server to use to pass the magic code back to it. A malicious client (or something that's gotten a client's ID and secret) could do some mischief by manipulating this redirect URL, so the standard specifically requires that OIDC IdP have a list of allowed redirect uris for each registered client. The standard also says that in theory, the client's provided redirect uri and the configured redirect uris are compared as literal string values. So, for example, 'https://example.org/callback' doesn't match 'https://example.org/callback/'.

This is straightforward when it comes to websites as OIDC clients, since they should have well defined callback urls that you can configure directly into your OIDC IdP when you set up each of them. It gets more hairy when what you're dealing with is programs as OIDC clients, where they are (for example) trying to get an OIDC token so they can authenticate to your IMAP server with OAuth2, since these programs don't normally have a website. Historically, there are several approaches that people have taken for programs (or seem to have, based on my reading so far).

Very early on in OAuth2's history, people apparently defined the special redirect uri value 'urn:ietf:wg:oauth:2.0:oob' (which is now hard to find or identify documentation on). An OAuth2 IdP that saw this redirect uri (and maybe had it allowed for the client) was supposed to not redirect you but instead show you a HTML page with the magic OIDC code displayed on it, so you could copy and paste the code into your local program. This value is now obsolete but it may still be accepted by some IdPs (you can find it listed for Google in mutt_oauth2.py, and I spotted an OIDC IdP server that handles it).

Another option is that the IdP can provide an actual website that does the same thing; if you get HTTP redirected to it with a valid code, it will show you the code on a HTML page and you can copy and paste it. Based on mutt_oauth2.py again, it appears that Microsoft may have at one point done this, using https://login.microsoftonline.com/common/oauth2/nativeclient as the page. You can do this too with your own IdP (or your own website in general), although it's not recommended for all sorts of reasons.

The final broad approach is to use 'localhost' as the target host for the redirect. There are several ways to make this work, and one of them runs into complications with the IdP's redirect uri handling.

The obvious general approach is for your program to run a little HTTP server that listens on some port on localhost, and capture the code when the (local) browser gets the HTTP redirect to localhost and visits the server. The problem here is that you can't necessarily listen on port 80, so your redirect uri needs to include the port you're listening (eg 'http://localhost:7000'), and if your OIDC IdP is following the standard it must be configured not just with 'http://localhost' as the allowed redirect uri but the specific port you'll use. Also, because of string matching, if the OIDC IdP lists 'http://localhost:7000', you can't send 'http://localhost:7000/' despite them being the same URL.

(And your program has to use 'localhost', not '127.0.0.1' or the IPv6 loopback address; although the two have the same effect, they're obviously not string-identical.)

Based on experimental evidence from OIDC/OAuth2 client configurations, I strongly suspect that some large IdP providers have non-standard, relaxed handling of 'localhost' redirect uris such that their client configuration lists 'http://localhost' and the IdP will accept some random port glued on in the actual redirect uri (or maybe this behavior has been standardized now). I suspect that the IdPs may also accept the trailing slash case. Honestly, it's hard to see how you get out of this if you want to handle real client programs out in the wild.

(Some OIDC IdP software definitely does the standard compliant string comparison. The one I know of for sure is SimpleSAMLphp's OIDC module. Meanwhile, based on reading the source code, Dex uses a relaxed matching for localhost in its matching function, provided that there are no redirect uris register for the client. Dex also still accepts the urn:ietf:wg:oauth:2.0:oob redirect uri, so I suspect that there are still uses out there in the field.)

If the program has its own embedded web browser that it's in full control of, it can do what Thunderbird appears to do (based on reading its source code). As far as I can tell, Thunderbird doesn't run a local listening server; instead it intercepts the HTTP redirection to 'http://localhost' itself. When the IdP sends the final HTTP redirect to localhost with the code embedded in the URL, Thunderbird effectively just grabs the code from the redirect URL in the HTTP reply and never actually issues a HTTP request to the redirect target.

The final option is to not run a localhost HTTP server and to tell people running your program that when their browser gives them an 'unable to connect' error at the end of the OIDC authentication process, they need to go to the URL bar and copy the 'code' query parameter into the program (or if you're being friendly, let them copy and paste the entire URL and you extract the code parameter). This allows your program to use a fixed redirect uri, including just 'http://localhost', because it doesn't have to be able to listen on it or on any fixed port.

(This is effectively a more secure but less user friendly version of the old 'copy a code that the website displayed' OAuth2 approach, and that approach wasn't all that user friendly to start with.)

PS: An OIDC redirect uri apparently allows things other than http:// and https:// URLs; there is, for example, the 'openid-credential-offer' scheme. I believe that the OIDC IdP doesn't particularly do anything with those redirect uris other than accept them and issue a HTTP redirect to them with the appropriate code attached. It's up to your local program or system to intercept HTTP requests for those schemes and react appropriately, much like Thunderbird does, but perhaps easier because you can probably register the program as handling all 'whatever-special://' URLs so the redirect is automatically handed off to it.

(I suspect that there are more complexities in the whole OIDC and OAuth2 redirect uri area, since I'm new to the whole thing.)

The commodification of desktop GUI behavior

By: cks
13 March 2025 at 03:08

Over on the Fediverse, I tried out a thesis:

Thesis: most desktop GUIs are not opinionated about how you interact with things, and this is why there are so many GUI toolkits and they make so little difference to programs, and also why the browser is a perfectly good cross-platform GUI (and why cross-platform GUIs in general).

Some GUIs are quite opinionated (eg Plan 9's Acme) but most are basically the same. Which isn't necessarily a bad thing but it creates a sameness.

(Custom GUIs are good for frequent users, bad for occasional ones.)

Desktop GUIs differ in how they look and to some extent in how you do certain things and how you expect 'native' programs to behave; I'm sure the fans of any particular platform can tell you all about little behaviors that they expect from native applications that imported ones lack. But I think we've pretty much converged on a set of fundamental behaviors for how to interact with GUI programs, or at least how to deal with basic ones, so in a lot of cases the question about GUIs is how things look, not how you do things at all.

(Complex programs have for some time been coming up with their own bespoke alternatives to, for example, huge cascades of menus. If these are successful they tend to get more broadly adopted by programs facing the same problems; consider the 'ribbon', which got what could be called a somewhat mixed reaction on its modern introduction.)

On the desktop, changing the GUI toolkit that a program uses (either on the same platform or on a different one) may require changing the structure of your code (in addition to ordinary code changes), but it probably won't change how your program operates. Things will look a bit different, maybe some standard platform features will appear or disappear, but it's not a completely different experience. This often includes moving your application from the desktop into the browser (a popular and useful 'cross-platform' environment in itself).

This is less true on mobile platforms, where my sense is that the two dominant platforms have evolved somewhat different idioms for how you interact with applications. A proper 'native' application behaves differently on the two platforms even if it's using mostly the same code base.

GUIs such as Plan 9's Acme show that this doesn't have to be the case; for that matter, so does GNU Emacs. GNU Emacs has a vague shell of a standard looking GUI but it's a thin layer over a much different and stranger vastness, and I believe that experienced Emacs people do very little interaction with it.

How SAML and OIDC differ in sharing information, and perhaps why

By: cks
9 March 2025 at 04:39

In practice, SAML and OIDC are two ways of doing third party web-based authentication (and thus a Single Sign On (SSO)) system; the web site you want to use sends you off to a SAML or OIDC server to authenticate, and then the server sends authentication information back to the 'client' web site. Both protocols send additional information about you along with the bare fact of an authentication, but they differ in how they do this.

In SAML, the SAML server sends a collection of 'attributes' back to the SAML client. There are some standard SAML attributes that client websites will expect, but the server is free to throw in any other attributes it feels like, and I believe that servers do things like turn every LDAP attribute they get from a LDAP user lookup into a SAML attribute (certainly SimpleSAMLphp does this). As far as I know, any filtering of what SAML attributes are provided by the server to any particular client is a server side feature, and SAML clients don't necessarily have any way of telling the SAML server what attributes they want or don't want.

In OIDC, the equivalent way of returning information is 'claims', which are grouped into 'scopes', along with basic claims that you get without asking for a scope. The expectation in OIDC is that clients that want more than the basic claims will request specific scopes and then get back (only) the claims for those scopes. There are standard scopes with standard claims (not all of which are necessarily returned by any given OIDC server). If you want to add additional information in the form of more claims, I believe that it's generally expected that you'll create one or more custom scopes for those claims and then have your OIDC clients request them (although not all OIDC clients are willing and able to handle custom scopes).

(I think in theory an OIDC server may be free to shove whatever claims it wants to into information for clients regardless of what scopes the client requested, but an OIDC client may ignore any information it didn't request and doesn't understand rather than pass it through to other software.)

The SAML approach is more convenient for server and client administrators who are working within the same organization. The server administrator can add whatever information to SAML responses that's useful and convenient, and SAML clients will generally automatically pick it up and often make it available to other software. The OIDC approach is less convenient, since you need to create one or more additional scopes on the server and define what claims go in them, and then get your OIDC clients to request the new scopes; if an OIDC client doesn't update, it doesn't get the new information. However, the OIDC approach makes it easier for both clients and servers to be more selective and thus potentially for people to control how much information they give to who. An OIDC client can ask for only minimal information by only asking for a basic scope (such as 'email') and then the OIDC server can tell the person exactly what information they're approving being passed to the client, without the OIDC server administrators having to get involved to add client-specific attribute filtering.

(In practice, OIDC probably also encourages giving less information to even trusted clients in general since you have to go through these extra steps, so you're less likely to do things like expose all LDAP information as OIDC claims in some new 'our-ldap' scope or the like.)

My guess is that OIDC was deliberately designed this way partly in order to make it better for use with third party clients. Within an organization, SAML's broad sharing of information may make sense, but it makes much less sense in a cross-organization context, where you may be using OIDC-based 'sign in with <large provider>' on some unrelated website. In that sort of case, you certainly don't want that website to get every scrap of information that the large provider has on you, but instead only ask for (and get) what it needs, and for it to not get much by default.

The OpenID Connect (OIDC) 'sub' claim is surprisingly load-bearing

By: cks
8 March 2025 at 04:24

OIDC (OpenID Connect) is today's better or best regarded standard for (web-based) authentication. When a website (or something) authenticates you through an OpenID (identity) Provider (OP), one of the things it gets back is a bunch of 'claims', which is to say information about the authenticated person. One of the core claims is 'sub', which is vaguely described as a string that is 'subject - identifier for the end-user at the issuer'. As I discovered today, this claim is what I could call 'load bearing' in a surprising way or two.

In theory, 'sub' has no meaning beyond identifying the user in some opaque way. The first way it's load bearing is that some OIDC client software (a 'Relying Party (RP)') will assume that the 'sub' claim has a human useful meaning. For example, the Apache OpenIDC module defaults to putting the 'sub' claim into Apache's REMOTE_USER environment variable. This is fine if your OIDC IdP software puts, say, a login name into it; it is less fine if your OIDC IdP software wants to create 'sub' claims that look like 'YXVzZXIxMi5zb21laWRw'. These claims mean something to your server software but not necessarily to you and the software you want to use on (or behind) OIDC RPs.

The second and more surprising way that the 'sub' claim is load bearing involves how external consumers of your OIDC IdP keep track of your people. In common situations your people will be identified and authorized by their email address (using some additional protocols), which they enter into the outside OIDC RP that's authenticating against your OIDC IdP, and this looks like the identifier that RP uses to keep track of them. However, at least one such OIDC RP assumes that the 'sub' claim for a given email address will never change, and I suspect that there are more people who either quietly use the 'sub' claim as the master key for accounts or who require 'sub' and the email address to be locked together this way.

This second issue makes the details of how your OIDC IdP software generates its 'sub' claim values quite important. You want it to be able to generate those 'sub' values in a clear and documented way that other OIDC IdP software can readily duplicate to create the same 'sub' values, and that won't change if you change some aspect of the OIDC IdP configuration for your current software. Otherwise you're at least stuck with your current OIDC IdP software, and perhaps with its exact current configuration (for authentication sources, internal names of things, and so on).

(If you have to change 'sub' values, for example because you have to migrate to different OIDC IdP software, this could go as far as the outside OIDC RP basically deleting all of their local account data for your people and requiring all of it to be entered back from scratch. But hopefully those outside parties have a better procedure than this.)

The problem facing MFA-enabled IMAP at the moment (in early 2025)

By: cks
7 March 2025 at 04:32

Suppose that you have an IMAP server and you would like to add MFA (Multi-Factor Authentication) protection to it. I believe that in theory the IMAP protocol supports multi-step 'challenge and response' style authentication, so again in theory you could implement MFA this way, but in practice this is unworkable because people would be constantly facing challenges. Modern IMAP clients (and servers) expect to be able to open and close connections more or less on demand, rather than opening one connection, holding it open, and doing everything over it. To make IMAP MFA practical, you need to do it with some kind of 'Single Sign On' (SSO) system. The current approach for this uses an OIDC identity provider for the SSO part and SASL OAUTHBEARER authentication between the IMAP client and the IMAP server, using information from the OIDC IdP.

So in theory, your IMAP client talks to your OIDC IdP to get a magic bearer token, provides this token to the IMAP server, the IMAP server verifies that it comes from a configured and trusted IdP, and everything is good. You only have to go through authenticating to your OIDC IdP SSO system every so often (based on whatever timeout it's configured with); the rest of the time the aggregate system does any necessary token refreshes behind the scenes. And because OIDC has a discovery process that can more or less start from your email address (as I found out), it looks like IMAP clients like Thunderbird could let you more or less automatically use any OIDC IdP if people had set up the right web server information.

If you actually try this right now, you'll find that Thunderbird, apparently along with basically all significant IMAP client programs, will only let you use a few large identity providers; here is Thunderbird's list (via). If you read through that Thunderbird source file, you'll find one reason for this limitation, which is that each provider has one or two magic values (the 'client ID' and usually the 'client secret', which is obviously not so secret here), in addition to URLs that Thunderbird could theoretically autodiscover if everyone supported the current OIDC autodiscovery protocols (my understanding is that not everyone does). In most current OIDC identity provider software, these magic values are either given to the IdP software or generated by it when you set up a given OIDC client program (a 'Relying Party (RP)' in the OIDC jargon).

This means that in order for Thunderbird (or any other IMAP client) to work with your own local OIDC IdP, there would have to be some process where people could load this information into Thunderbird. Alternately, Thunderbird could publish default values for these and anyone who wanted their OIDC IdP to work with Thunderbird would have to add these values to it. To date, creators of IMAP client software have mostly not supported either option and instead hard code a list of big providers who they've arranged more or less explicit OIDC support with.

(Honestly it's not hard to see why IMAP client authors have chosen this approach. Unless you're targeting a very technically inclined audience, walking people through the process of either setting this up in the IMAP client or verifying if a given OIDC IdP supports the client is daunting. I believe some IMAP clients can be configured for OIDC IdPs through 'enterprise policy' systems, but there the people provisioning the policies are supposed to be fairly technical.)

PS: Potential additional references on this mess include David North's article and this FOSDEM 2024 presentation (which I haven't yet watched, I only just stumbled into this mess).

Always sync your log or journal files when you open them

By: cks
1 March 2025 at 03:10

Today I learned of a new way to accidentally lose data 'written' to disk, courtesy of this Fediverse post summarizing a longer article about CouchDB and this issue. Because this is so nifty and startling when I encountered it, yet so simple, I'm going to re-explain the issue in my own words and explain how it leads to the title of this entry.

Suppose that you have a program that makes data it writes to disk durable through some form of journal, write ahead log (WAL), or the like. As we all know, data that you simply write() to the operating system isn't yet on disk; the operating system is likely buffering the data in memory before writing it out at the OS's own convenience. To make the data durable, you must explicitly flush it to disk (well, ask the OS to), for example with fsync(). Your program is a good program, so of course it does this; when it updates the WAL, it write()s then fsync()s.

Now suppose that your program is terminated after the write but before the fsync. At this point you have a theoretically incomplete and improperly written journal or WAL, since it hasn't been fsync'd. However, when your program restarts and goes through its crash recovery process, it has no way to discover this. Since the data was written (into the OS's disk cache), the OS will happily give the data back to you even though it's not yet on disk. Now assume that your program takes further actions (such as updating its main files) based on the belief that the WAL is fully intact, and then the system crashes, losing that buffered and not yet written WAL data. Oops. You (potentially) have a problem.

(These days, programs can get terminated for all sorts of reasons other than a program bug that causes a crash. If you're operating in a modern containerized environment, your management system can decide that your program or its entire container ought to shut down abruptly right now. Or something else might have run the entire system out of memory and now some OOM handler is killing your program.)

To avoid the possibility of this problem, you need to always force a disk flush when you open your journal, WAL, or whatever; on Unix, you'd immediately fsync() it. If there's no unwritten data, this will generally be more or less instant. If there is unwritten data because you're restarting after the program was terminated by surprise, this might take a bit of time but insures that the on-disk state matches the state that you're about to observe through the OS.

(CouchDB's article points to another article, Justin Jaffray’s NULL BITMAP Builds a Database #2: Enter the Memtable, which has a somewhat different way for this failure to bite you. I'm not going to try to summarize it here but you might find the article interesting reading.)

Institutions care about their security threats, not your security threats

By: cks
23 February 2025 at 03:45

Recently I was part of a conversation on the Fediverse that sparked an obvious in retrospect realization about computer security and how we look at and talk about security measures. To put it succinctly, your institution cares about threats to it, not about threats to you. It cares about threats to you only so far as they're threats to it through you. Some of the security threats and sensible responses to them overlap between you and your institution, but some of them don't.

One of the areas where I think this especially shows up is in issues around MFA (Multi-Factor Authentication). For example, it's a not infrequently observed thing that if all of your factors live on a single device, such as your phone, then you actually have single factor authentication (this can happen with many of the different ways to do MFA). But for many organizations, this is relatively fine (for them). Their largest risk is that Internet attackers are constantly trying to (remotely) phish their people, often in moderately sophisticated ways that involve some prior research (which is worth it for the attackers because they can target many people with the same research). Ignoring MFA alert fatigue for a moment, even a single factor physical device will cut of all of this, because Internet attackers don't have people's smartphones.

For individual people, of course, this is potentially a problem. If someone can gain access to your phone, they get everything, and probably across all of the online services you use. If you care about security as an individual person, you want attackers to need more than one thing to get all of your accounts. Conversely, for organizations, compromising all of their systems at once is sort of a given, because that's what it means to have a Single Sign On system and global authentication. Only a few organizational systems will be separated from the general SSO (and organizations have to hope that their people cooperate by using different access passwords).

Organizations also have obvious solutions to things like MFA account recovery. They can establish and confirm the identities of people associated with them, and a process to establish MFA in the first place, so if you lose whatever lets you do MFA (perhaps your work phone's battery has gotten spicy), they can just run you through the enrollment process again. Maybe there will be a delay, but if so, the organization has broadly decided to tolerate it.

(And I just recently wrote about the difference between 'internal' accounts and 'external' accounts, where people generally know who is in an organization and so has an account, so allowing this information to leak in your authentication isn't usually a serious problem.)

Another area where I think this difference in the view of threats is in the tradeoffs involved in disk encryption on laptops and desktops used by people. For an organization, choosing non-disclosure over availability on employee devices makes a lot of sense. The biggest threat as the organization sees it isn't data loss on a laptop or desktop (especially if they write policies about backups and where data is supposed to be stored), it's an attacker making off with one and having the data disclosed, which is at least bad publicity and makes the executives unhappy. You may feel differently about your own data, depending on how your backups are.

'Internal' accounts and their difference from 'external' accounts

By: cks
14 February 2025 at 03:22

In the comments on my entry on how you should respond to authentication failures depends on the circumstances, sapphirepaw said something that triggered a belated realization in my mind:

Probably less of a concern for IMAP, but in a web app, one must take care to hide the information completely. I was recently at a site that wouldn't say whether the provided email was valid for password reset, but would reveal it was in use when trying to create a new account.

The realization this sparked is that we can divide accounts and systems into two sorts, which I will call internal and external, and how you want to treat things around these accounts is possibly quite different.

An internal account is one that's held by people within your organization, and generally is pretty universal. If you know that someone is a member of the organization you can predict that they have an account on the system, and not infrequently what the account name is. For example, if you know that someone is a graduate student here it's a fairly good bet that they have an account with us and you may even be able to find and work out their login name. The existence of these accounts and even specifics about who has what login name (mostly) isn't particularly secret or sensitive.

(Internal accounts don't have to be on systems that the organization runs; they could be, for example, 'enterprise' accounts on someone else's SaaS service. Once you know that the organization uses a particular SaaS offering or whatever, you're usually a lot of the way to identifying all of their accounts.)

An external account is one that's potentially held by people from all over, far outside the bounds of a single organization (including the one running the the systems the account is used with). A lot of online accounts with websites are like this, because most websites are used by lots of people from all over. Who has such an account may be potentially sensitive information, depending on the website and the feelings of the people involved, and the account identity may be even more sensitive (it's one thing to know that a particular email address has an Fediverse account on mastodon.social, but it may be quite different to know which account that is, depending on various factors).

There's a spectrum of potential secrecy between these two categories. For example, the organization might not want to openly reveal which external SaaS products they use, what entity name the organization uses on them, and the specific names people use for authentication, all in the name of making it harder to break into their environment at the SaaS product. And some purely internal systems might have a very restricted access list that is kept at least somewhat secret so attackers don't know who to target. But I think the broad division between internal and external is useful because it does a lot to point out where any secrecy is.

When I wrote my entry, I was primarily thinking about internal accounts, because internal accounts are what we deal with (and what many internal system administration groups handle). As sapphirepaw noted, the concerns and thus the rules are quite different for external accounts.

(There may be better labels for these two sorts of accounts. I'm not great with naming)

Why writes to disk generally wind up in your OS's disk read cache

By: cks
4 February 2025 at 03:44

Recently, someone was surprised to find out that ZFS puts disk writes in its version of a disk (read) cache, the ARC ('Adaptive Replacement Cache'). In fact this is quite common, as almost every operating system and filesystem puts ordinary writes to disk into their disk (read) cache. In thinking about the specific issue of the ZFS ARC and write data, I realized that there's a general broad reason for this and then a narrower technical one.

The broad reason that you'll most often hear about is that it's not uncommon for your system to read things back after you've written them to disk. It would be wasteful to having something in RAM, write it to disk, remove it from RAM, and then have to more or less immediately read it back from disk. If you're dealing with spinning HDDs, this is quite bad since HDDs can only do a relatively small amount of IO a second; in this day of high performance, low latency NVMe SSDs, it might not be so terrible any more, but it still costs you something. Of course you have to worry about writes flooding the disk cache and evicting more useful data, but this is also an issue with certain sorts of reads.

The narrower technical reason is dealing with issues that come up once you add write buffering to the picture. In practice a lot of ordinary writes to files aren't synchronously written out to disk on the spot; instead they're buffered in memory for some amount of time. This require some pool of (OS) memory to hold the these pending writes, which might as well be your regular disk (read) cache. Putting not yet written out data in the disk read cache also deals with the issue of coherence, where you want programs that are reading data to see the most recently written data even if it hasn't been flushed out to disk yet. Since reading data from the filesystem already looks in the disk cache, you'll automatically find the pending write data there (and you'll automatically replace an already cached version of the old data). If you put pending writes into a different pool of memory, you have to specifically manage it and tune its size, and you have to add extra code to potentially get data from it on reads.

(I'm going to skip considering memory mapped IO in this picture because it only makes things even more complicated, and how OSes and filesystems handle it potentially varies a lot. For example, I'm not sure if Linux or ZFS normally directly use pages in the disk cache, or if even shared memory maps get copies of the disk cache pages.)

PS: Before I started thinking about the whole issue as a result of the person's surprise, I would have probably only given you the broad reason off the top of my head. I hadn't thought about the technical issues of not putting writes in the read cache before now.

Languages don't version themselves using semantic versioning

By: cks
25 January 2025 at 03:46

A number of modern languages have effectively a single official compiler or interpreter, and they version this toolchain with what looks like a semantic version (semver). So we have (C)Python 3.12.8, Go 1.23.5, Rust(c) 1.84.0, and so on, which certainly look like a semver major.minor.patchlevel triplet. In practice, this is not how languages think of their version numbers.

In practice, the version number triplets of things like Go, Rust, and CPython have a meaning that's more like '<dialect>.<release>.<patchlevel>'. The first number is the language dialect and it changes extremely infrequently, because it's a very big deal to significantly break backward compatibility or even to make major changes in language semantics that are sort of backward compatible. Python 1, Python 2, and Python 3 are all in effect different but closely related languages.

(Python 2 is much closer to Python 1 than Python 3 is to Python 2, which is part of why you don't read about a painful and protracted transition from Python 1 to Python 2.)

The second number is somewhere between a major and a minor version number. It's typically increased when the language or the toolchain (or both) do something significant, or when enough changes have built up since the last time the second number was increased and people want to get them out in the world. Languages can and do make major additions with only a change in the second number; Go added generics, CPython added and improved an asynchronous processing system, and Rust has stabilized a whole series of features and improvements, all in Go 1.x, CPython 3.x, and Rust 1.x.

The third number is a patchlevel (or if you prefer, a 'point release'). It's increased when a new version of an X.Y release must be made to fix bugs or security problems, and generally contains minimal code changes and no new language features. I think people would look at the language's developers funny if they landed new language features in a patchlevel instead of an actual release, and they'd definitely be unhappy if something was broken or removed in a patchlevel. It's supposed to be basically completely safe to upgrade to a new patchlevel of the language's toolchain.

Both Go and CPython will break, remove, or change things in new 'release' versions. CPython has deprecated a number of things over the course of the 3.x releases so far, and Go has changed how its toolchain behaves and turned off some old behavior (the toolchain's behavior is not covered by Go's language and standard library compatibility guarantee). In this regard these Go and CPython releases are closer to major releases than minor releases.

(Go uses the term 'major release' and 'minor release' for, eg, 'Go 1.23' and 'Go 1.23.3'; see here. Python often calls each '3.x' a 'series', and '3.x.y' a 'maintenance release' within that series, as seen in the Python 3.13.1 release note.)

The corollary of this is that you can't apply semver expectations about stability to language versioning. Languages with this sort of versioning are 'less stable' than they should be by semver standards, since they make significant and not necessarily backward compatible changes in what semver would call a 'minor' release. This isn't a violation of semver because these languages never claimed or promised to be following semver. Language versioning is different (and basically has to be).

(I've used CPython, Go, and Rust here because they're the three languages where I'm most familiar with the release versioning policies. I suspect that many other languages follow similar approaches.)

The problem with combining DNS CNAME records and anything else

By: cks
11 January 2025 at 03:55

A famous issue when setting up DNS records for domains is that you can't combine a CNAME record with any other type, such as a MX record or a SOA (which is required at the top level of a domain). One modern reason that you would want such a CNAME record is that you're hosting your domain's web site at some provider and the provider wants to be able to change what IP addresses it uses for this, so from the provider's perspective they want you to CNAME your 'web site' name to 'something.provider.com'.

The obvious reason for 'no CNAME and anything else' is 'because the RFCs say so', but this is unsatisfying. Recently I wondered why the RFCs couldn't have said that when a CNAME is combined with other records, you return the other records when asked for them but provide the CNAME otherwise (or maybe you return the CNAME only when asked for the IP address if there are other records). But when I thought about it more, I realized the answer, the short version of which is caching resolvers.

If you're the authoritative DNS server for a zone, you know for sure what DNS records are and aren't present. This means that if someone asks you for an MX record and the zone has a CNAME, a SOA, and an MX, you can give them the MX record, and if someone asks for the A record, you can give them the CNAME, and everything works fine. But a DNS server that is a caching resolver doesn't have this full knowledge of the zone; it only knows what's in its cache. If such a DNS server has a CNAME for a domain in its cache (perhaps because someone asked for the A record) and it's now asked for the MX records of that domain, what is it supposed to do? The correct answer could be either the CNAME record the DNS server has or the MX records it would have to query an authoritative server for. At a minimum combining CNAME plus other records this way would require caching resolvers to query the upstream DNS server and then remember that they got a CNAME answer for a specific query.

In theory this could have been written into DNS originally, at the cost of complicating caching DNS servers and causing them to make more queries to upstream DNS servers (which is to say, making their caching less effective). Once DNS existed with the CNAME behavior such that caching DNS resolvers could cache CNAME responses and serve them, the CNAME behavior was fixed.

(This is probably obvious to experienced DNS people, but since I had to work it out in my head I'm going to write it down.)

Sidebar: The pseudo-CNAME behavior offered by some DNS providers

Some DNS providers and DNS servers offer an 'ANAME' or 'ALIAS' record type. This isn't really a DNS record; instead it's a processing instruction to the provider's DNS software that it should look up the A and AAAA records of the target name and insert them into your zone in place of the ANAME/ALIAS record (and redo the lookup every so often in case the target name's IP addresses change). In theory any changes in the A or AAAA records should trigger a change in the zone serial number; in practice I don't know if providers actually do this.

(If your DNS provider doesn't have ANAME/ALIAS 'records' but does have an API, you can build this functionality yourself.)

There are different sorts of WireGuard setups with different difficulties

By: cks
5 January 2025 at 04:37

I've now set up WireGuard in a number of different ways, some of which were easy and some of which weren't. So here are my current views on WireGuard setups, starting with the easiest and going to the most challenging.

The easiest WireGuard setup is where the 'within WireGuard' internal IP address space is completely distinct from the outside space, with no overlap. This makes routing completely straightforward; internal IPs reachable over WireGuard aren't reachable in any other way, and external IPs aren't reachable over WireGuard. You can do this as a mesh or use the WireGuard 'router' pattern (or some mixture). If you allocate all internal IP addresses from the same network range, you can set a single route to your WireGuard interface and let AllowedIps sort it out.

(An extreme version of this would be to configure the inside part of WireGuard with only link local IPv6 addresses, although this would probably be quite inconvenient in practice.)

A slightly more difficult setup is where some WireGuard endpoints are gateways to additional internal networks, networks that aren't otherwise reachable. This setup potentially requires more routing entries but it remains straightforward in that there's no conflict on how to route a given IP address.

The next most difficult setup is using different IP address types inside WireGuard than from outside it, where the inside IP address type isn't otherwise usable for at least one of the ends. For example, you have an IPv4 only machine that you're giving a public IPv6 address through an IPv6 tunnel. This is still not too difficult because the inside IP addresses associated with each WireGuard peer aren't otherwise reachable, so you never have a recursive routing problem.

The most difficult type of WireGuard setup I've had to do so far is a true 'VPN' setup, where some or many of the WireGuard endpoints you're talking to are reachable both outside WireGuard and through WireGuard (or at least there are routes that try to send traffic to those IPs through WireGuard, such as a VPN 'route all traffic through my WireGuard link' default route). Since your system could plausibly recursively route your encrypted WireGuard traffic over WireGuard, you need some sort of additional setup to solve this. On Linux, this will often be done using a fwmark (also) and some policy based routing rules.

One of the reasons I find it useful to explicitly think about these different types of setups is to better know what to expect and what I'll need to do when I'm planning a new WireGuard environment. Either I will be prepared for what I'm going to have to do, or I may rethink my design in order to move it up the hierarchy, for example deciding that we can configure services to talk to special internal IPs (over WireGuard) so that we don't have to set up fwmark-based routing on everything.

(Some services built on top of WireGuard handle this for you, for example Tailscale, although Tailscale can have routing challenges of its own depending on your configuration.)

My screens now have areas that are 'good' and 'bad' for me

By: cks
30 December 2024 at 04:23

Once upon a time, I'm sure that everywhere on my screen (because it would have been a single screen at that time) was equally 'good' for me; all spots were immediately visible, clearly readable, didn't require turning my head, and so on. As the number of screens I use has risen, as the size of the screens has increased (for example when I moved from 24" non-HiDPI 3:2 LCD panels to 27" HiDPI 16:9 panels), and as my eyes have gotten older, this has changed. More and more, there is a 'good' area that I've set up so I'm looking straight at and then increasingly peripheral areas that are not as good.

(This good area is not necessarily the center of the screen; it depends on how I sit relatively to the screen, the height of the monitor, and so on. If I adjust these I can change what the good spot is, and I sometimes will do so for particular purposes.)

Calling the peripheral areas 'bad' is a relative term. I can see them, but especially on my office desktop (which has dual 27" 16:9 displays), these days the worst spots can be so far off to the side that I don't really notice things there much of the time. If I want to really look, I have to turn my head, which means I have to have a reason to look over there at whatever I put there. Hopefully it's not too important.

For a long time I didn't really notice this change or think about its implications. As the physical area covered by my 'display surface' expanded, I carried over the much the same desktop layout that I had used (in some form) for a long time. It didn't register that some things were effectively being exiled into the outskirts where I would never notice them, or that my actual usage was increasingly concentrated in one specific area of the screen. Now that I have consciously noticed this shift (which is a story for another entry), I may want to rethink some of how I lay things out on my office desktop (and maybe my home one too) and what I put where.

(One thing I've vaguely considered is if I should turn my office displays sideways, so the long axis is vertical, although I don't know if is feasible with their current stands. I have what is in practice too much horizontal space today, so that would be one way to deal with it. But probably this would give me two screens that each are a bit too narrow to be comfortable for me. And sadly there are no ideal LCD panels these days; I would ideally like a HiDPI 24" or 25" 3:2 panel but vendors don't do those.)

x86 servers, ATX power supply control, and reboots, resets, and power cycles

By: cks
26 December 2024 at 04:15

I mentioned recently a case when power cycling an (x86) server wasn't enough to recover it, although perhaps I should have put quotes around "power cycling". The reason for the scare quotas is that I was doing this through the server's BMC, which means that what was actually happening was not clear because there are a variety of ways the BMC could be doing power control and the BMC may have done something different for what it described as a 'power cycle'. In fact, to make it less clear, this particular server's BMC offers both a "Power Cycle" and a "Power Reset" option.

(According to the BMC's manual, a "power cycle" turns the system off and then back on again, while a "power reset" performs a 'warm restart'. I may have done a 'power reset' instead of a 'power cycle', it's not clear from what logs we have.)

There are a spectrum of ways to restart an x86 server, and they (probably) vary in their effects on peripherals, PCIe devices, and motherboard components. The most straightforward looking is to ask the Linux kernel to reboot the system, although in practice I believe that actually getting the hardware to do the reboot is somewhat complex (and in the past Linux sometimes had problems where it couldn't persuade the hardware, so your 'reboot' would hang). Looking at the Linux kernel code suggests that there are multiple ways to invoke a reboot, involving ACPI, UEFI firmware, old fashioned BIOS firmware, a PCIe configuration register, via the keyboard, and so on (for a fun time, look at the 'reboot=' kernel parameter). In general, a reboot can only be initiated by the server's host OS, not by the BMC; if the host OS is hung you can't 'reboot' the server as such.

Your x86 desktop probably has a 'reset' button on the front panel. These days the wire from this is probably tied into the platform chipset (on Intel, the ICH, which came up for desktop motherboard power control) and is interpreted by it. Server platforms probably also have a (conceptual) wire and that wire may well be connected to the BMC, which can then control it to implement, for example a 'reset' operation. I believe that a server reboot can also trigger the same platform chipset reset handling that the reset button does, although this isn't sure. If I'm reading Intel ICH chipset documentation correctly, triggering a reset this way will or may signal PCIe devices and so on that a reset has happened, although I don't think it cuts power to them; in theory anything getting this signal should reset its state.

(The CF9 PCI "Reset Control Register" (also) can be used to initiate a 'soft' or 'hard' CPU reset, or a full reset in which the (Intel) chipset will do various things to signals to peripherals, not just the CPU. I don't believe that Linux directly exposes these options to user space (partly because it may not be rebooting through direct use of PCI CF9 in the first place), although some of them can be controlled through kernel command line parameters. I think this may also control whether the 'reset' button and line do a CPU reset or a full reset. It seems possible that the warm restart of this server's BMC's "power reset" works by triggering the reset line and assuming that CF9 is left in its default state to make this a CPU reset instead of a full reset.)

Finally, the BMC can choose to actually cycle the power off and then back on again. As discussed, 'off' is probably not really off, because standby power and BMC power will remain available, but this should put both the CPU and the platform chipset through a full power-on sequence. However, it likely won't leave power off long enough for various lingering currents to dissipate and capacitors to drain. And nothing you do through the BMC can completely remove power from the system; as long as a server is connected to AC power, it's supplying standby power and BMC power. If you want a total reset, you must either disconnect its power cords or turn its outlet or outlets off in your remote controllable PDU (which may not work great if it's on a UPS). And as we've seen, sometimes a short power cycle isn't good enough and you need to give the server a time out.

(While the server's OS can ask for the server to be powered down instead of rebooted, I don't think it can ask for the server to be power cycled, not unless it talks to the BMC instead of doing a conventional reboot or power down.)

One of the things I've learned from this is that if I want to be really certain I understand what a BMC is doing, I probably shouldn't rely on any option to do a power cycle or power reset. Instead I should explicitly turn power off, wait until that's taken, and then turn power on. Asking a BMC to do a 'power cycle' is a bit optimistic, although it will probably work most of the time.

(If there's another time of our specific 'reset is not enough' hang, I will definitely make sure to use at least the BMC's 'power cycle' and perhaps the full brief off then on approach.)

When power cycling your (x86) server isn't enough to recover it

By: cks
22 December 2024 at 03:43

We have various sorts of servers here, and generally they run without problems unless they experience obvious hardware failures. Rarely, we experience Linux kernel hangs on them, and when this happens, we power cycle the machines, as one does, and the server comes back. Well, almost always. We have two servers (of the same model), where something different has happened once.

Each of the servers either crashed in the kernel and started to reboot or hung in the kernel and was power cycled (both were essentially unused at the time). As each server was running through the system firmware ('BIOS'), both of them started printing an apparently endless series of error dumps to their serial consoles (which had been configured in the BIOS as well as in the Linux kernel). These were like the following:

!!!! X64 Exception Type - 12(#MC - Machine-Check)  CPU Apic ID - 00000000 !!!!
RIP  - 000000006DABA5A5, CS  - 0000000000000038, RFLAGS - 0000000000010087
RAX  - 0000000000000008, RCX - 0000000000000000, RDX - 0000000000000001
RBX  - 000000007FB6A198, RSP - 000000005D29E940, RBP - 000000005DCCF520
RSI  - 0000000000000008, RDI - 000000006AB1B1B0
R8   - 000000005DCCF524, R9  - 000000005D29E850, R10 - 000000005D29E8E4
R11  - 000000005D29E980, R12 - 0000000000000008, R13 - 0000000000000001
R14  - 0000000000000028, R15 - 0000000000000000
DS   - 0000000000000030, ES  - 0000000000000030, FS  - 0000000000000030
GS   - 0000000000000030, SS  - 0000000000000030
CR0  - 0000000080010013, CR2 - 0000000000000000, CR3 - 000000005CE01000
CR4  - 0000000000000668, CR8 - 0000000000000000
DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 0000000076E46000 0000000000000047, LDTR - 0000000000000000
IDTR - 000000006AC3D018 0000000000000FFF,   TR - 0000000000000000
FXSAVE_STATE - 000000005D29E5A0
!!!! Can't find image information. !!!!

(The last line leaves me with questions about the firmware/BIOS but I'm unlikely to get answers to them. I'm putting the full output here for the usual reason.)

Some of the register values varied between reports, others didn't after the first one (for example, from the second onward the RIP appears to have always been 6DAB14D1, which suggests maybe it's an exception handler).

In both cases, we turned off power to the machines (well, to the hosts; we were working through the BMC, which stayed powered on), let them sit for a few minutes, and then powered them on again. This returned them to regular, routine, unexciting service, where neither of them have had problems since.

I knew in a theoretical way that there are parts of an x86 system that aren't necessarily completely reset if the power is only interrupted briefly (my understanding is that a certain amount of power lingers until capacitors drain and so on, but this may be wrong and there's a different mechanism in action). But I usually don't have it demonstrated in front of me this way, where a simple power cycle isn't good enough to restore a system but a cool down period works.

(Since we weren't cutting external power to the entire system, this also left standby power (also) available, which means some things never completely lost power even with the power being 'off' for a couple of minutes.)

PS: Actually there's an alternate explanation, which is that the first power cycle didn't do enough to reset things but a second one would have worked if I'd tried that instead of powering the servers off for a few minutes. I'm not certain I believe this and in any case, powering the servers off for a cool down period was faster than taking a chance on a second power cycle reset.

Common motherboards are supporting more and more M.2 NVMe drive slots

By: cks
7 December 2024 at 04:27

Back at the start of 2020, I wondered if common (x86 desktop) motherboards would ever have very many M.2 NVMe drive slots, where by 'very many' I meant four or so, which even back then was a common number of SATA ports for desktop motherboards to provide. At the time I thought the answer was probably no. As I recently discovered from investigating a related issue, I was wrong, and it's now fairly straightforward to find x86 desktop motherboards that have as many as four M.2 NVMe slots (although not all four may be able to run at x4 PCIe lanes, especially if you have things like a GPU).

For example, right now it's relatively easy to find a page full of AMD AM5-based motherboards that have four M.2 NVMe slots. Most of these seem to be based on the high end X series AMD chipsets (such as the X670 or the X870, but I found a few that were based on the B650 chipset. On the Intel side, should you still be interested in an Intel CPU in your desktop at this point, there's also a number of them based primarily on the Z790 chipset (and some the older Z690). There's even a B760 based motherboard with four M.2 NVMe slots (although two of them are only x1 lanes and PCIe 3.0), and an H770 based one that manages to (theoretically) support all four M.2 slots at x4 lanes.

One of the things that I think has happened on the way to this large supply of M.2 slots is that these desktop motherboards have dropped most of their PCIe slots. These days, you seem to commonly get three slots in total on the kind of motherboard that has four M.2 slots. There's always one x16 slot, often two, and sometimes three (although that's physical x16; don't count on getting all 16 PCIe lanes in every slot). It's not uncommon to see the third PCIe slot be physically x4, or a little x1 slot tucked away at the bottom of the motherboard. It also isn't necessarily the case that lower end desktops have more PCIe slots to go with their fewer M.2 slots; they too seem to have mostly gone with two or three PCIe slots, generally with limited number of lanes even if they're physically x16.

(I appreciate having physical x16 slots even if they're only PCIe x1, because that means you can use any card that doesn't require PCIe bifurcation and it should work, although slowly.)

As noted by commentators on my entry on PCIe bifurcation and its uses for NVMe drives, a certain amount of what we used to need PCIe slots for can now be provided through high speed USB-C and similar things. And of course there are only so many PCIe lanes to go around from the CPU and the chipset, so those USB-C ports and other high-speed motherboard devices consume a certain amount of them; the more onboard devices the motherboard has the fewer PCIe lanes there are left for PCIe slots, whether or not you have any use for those onboard devices and connectors.

(Having four M.2 NVMe slots is useful for me because I use my drives in mirrored pairs, so four M.2 slots means I can run my full old pair in parallel with a full new pair, either in a four way mirror or doing some form of migration from one mirrored pair to the other. Three slots is okay, since that lets me add a new drive to a mirrored pair for gradual migration to a new pair of drives.)

Sorting out 'PCIe bifurcation' and how it interacts with NVMe drives

By: cks
5 December 2024 at 03:01

Suppose, not hypothetically, that you're switching from one mirrored set of M.2 NVMe drives to another mirrored set of M.2 NVMe drives, and so would like to have three or four NVMe drives in your desktop at the same time. Sadly, you already have one of your two NVMe drives on a PCIe card, so you'd like to get a single PCIe card that handles two or more NVMe drives. If you look around today, you'll find two sorts of cards for this; ones that are very expensive, and ones that are relatively inexpensive but require that your system supports a feature that is generally called PCIe bifurcation.

NVMe drives are PCIe devices, so a PCIe card that supports a single NVMe drive is a simple, more or less passive thing that wires four PCIe lanes and some other stuff through to the M.2 slot. I believe that in theory, a card could be built that only required x2 or even x1 PCIe lanes, but in practice I think all such single drive cards are physically PCIe x4 and so require a physical x4 or better PCIe slot, even if you'd be willing to (temporarily) run the drive much slower.

A PCIe card that supports more than one M.2 NVMe drive has two options. The expensive option is to put a PCIe bridge on the card, with the bridge (probably) providing a full set of PCIe lanes to the M.2 NVMe drives locally on one side and doing x4, x8, or x16 PCIe with the motherboard on the other. In theory, such a card will work even at x4 or x2 PCIe lanes, because PCIe cards are supposed to do that if the system says 'actually you only get this many lanes' (although obviously you can't drive four x4 NVMe drives at full speed through a single x4 or x2 PCIe connection).

The cheap option is to require that the system be able to split a single PCIe slot into multiple independent groups of PCIe lanes (I believe these are usually called links); this is PCIe bifurcation. In PCIe bifurcation, the system takes what is physically and PCIe-wise an x16 slot (for example) and splits it into four separate x4 links (I've seen this sometimes labeled as 'x4/x4/x4/x4'). This is cheap for the card because it can basically be four single M.2 NVMe PCIe cards jammed together, with each set of x4 lanes wired through to a single M.2 NVMe slot. A PCIe card for two M.2 NVMe drives will require an x8 PCIe slot bifurcated to two x4 links; if you stick this card in an x16 slot, the upper 8 PCIe lanes just get ignored (which means that you can still set your BIOS to x4/x4/x4/x4).

As covered in, for example, this Synopsys page, PCIe bifurcation isn't something that's negotiated as part of bringing up PCIe connections; a PCIe device can't ask for bifurcation and can't be asked whether or not it supports it. Instead, the decision is made as part of configuring the PCIe root device or bridge, which in practice means it's a firmware ('BIOS') decision. However, I believe that bifurcation may also requires hardware support in the 'chipset' and perhaps the physical motherboard.

I put chipset into quotes because for quite some time, some PCIe lanes come directly from the CPU and only some others come through the chipset as such. For example, in desktop motherboards, the x16 GPU slot is almost always driven directly by CPU PCIe lanes, so it's up to the CPU to have support (or not have support) for PCIe bifurcation of that slot. I don't know if common desktop chipsets support bifurcation on the chipset PCIe slots and PCIe lanes, and of course you need chipset-driven PCIe slots that have enough lanes to be bifurcated in the first place. If the PCIe slots driven by the chipset are a mix of x4 and x1 slots, there's no really useful bifurcation that can be done (at least for NVMe drives).

If you have a limited number of PCIe slots that can actually support x16 or x8 and you need a GPU card, you may not be able to use PCIe bifurcation in practice even if it's available for your system. If you have only one PCIe slot your GPU card can go in and it's the only slot that supports bifurcation, you're stuck; you can't have both a bifurcated set of NVMe drives and a GPU (at least not without a bifurcated PCIe riser card that you can use).

(This is where I would start exploring USB NVMe drive enclosures, although on old desktops you'll probably need one that doesn't require USB-C, and I don't know if a NVMe drive set up in a USB enclosure can later be smoothly moved to a direct M.2 connection without partitioning-related problems or other issues.)

(This is one of the entries I write to get this straight in my head.)

Sidebar: Generic PCIe riser cards and other weird things

The traditional 'riser card' I'm used to is a special proprietary server 'card' (ie, a chunk of PCB with connectors and other bits) that plugs into a likely custom server motherboard connector and makes a right angle turn that lets it provide one or two horizontal PCIe slots (often half-height ones) in a 1U or 2U server case, which aren't tall enough to handle PCIe cards vertically. However, the existence of PCIe bifurcation opens up an exciting world of general, generic PCIe riser cards that bifurcate a single x16 GPU slot to, say, two x8 PCIe slots. These will work (in some sense) in any x16 PCIe slot that supports bifurcation, and of course you don't have to restrict yourself to x16 slots. I believe there are also PCIe riser cards that bifurcate an x8 slot into two x4 slots.

Now, you are perhaps thinking that such a riser card puts those bifurcated PCIe slots at right angles to the slots in your case, and probably leaves any cards inserted into them with at least their tops unsupported. If you have light PCIe cards, maybe this works out. If you don't have light PCIe cards, one option is another terrifying thing, a PCIe ribbon cable with a little PCB that is just a PCIe slot on one end (the other end plugs into your real PCIe slot, such as one of the slots on the riser card). Sometimes these are even called 'riser card extenders' (or perhaps those are a sub-type of the general PCIe extender ribbon cables).

Another PCIe adapter device you can get is an x1 to x16 slot extension adapter, which plugs into an x1 slot on your motherboard and has an x16 slot (with only one PCIe lane wired through, of course). This is less crazy than it sounds; you might only have an x1 slot available, want to plug in a x4, x8, or x16 card that's short enough, and be willing to settle for x1 speeds. In theory PCIe cards are supposed to still work when their lanes are choked down this way.

The general issue of terminal programs and the Alt key

By: cks
23 November 2024 at 23:26

When you're using a terminal program (something that provides a terminal window in a GUI environment, which is now the dominant form of 'terminals'), there's a fairly straightforward answer for what should happen when you hold down the Ctrl key while typing another key. For upper and lower case letters, the terminal program generates ASCII bytes 1 through 26, for Ctrl-[ you get byte 27 (ESC), and there are relatively standard versions of some other characters. For other characters, your specific terminal program may treat them as aliases for some of the ASCII control characters or ignore the Ctrl. All of this behavior is relatively standard from the days of serial terminals, and none of it helps terminal programs decide what should be generated when you hold down the Alt key while typing another key.

(A terminal program can hijack Alt-<key> to control its behavior, but people will generally find this hostile because they want to use Alt-<key> with things running inside the terminal program. In general, terminal programs are restricted to generating things at the character layer, where what they send has to fit in a sequence of bytes and be generally comprehensible to whatever is reading those bytes.)

Historically and even currently there have been three answers. The simplest answer is that Alt sets the 8th bit on what would otherwise be a seven-bit ASCII character. This behavior is basically a relic of the days when things actually were seven bit ASCII (at least in North America) and doing this wouldn't mangle things horribly (provided that the program inside the terminal understood this signal). As a result it's not too popular any more and I think it's basically died out.

The second answer is what I'll call the Emacs answer, which is that Alt plus another key generates ESC (Escape) and then the other key. This matches how Emacs handled its Meta key binding modifier (written 'M-...' in Emacs terminology) in the days of serial terminals; if an Emacs keybinding was M-a, you typed 'ESC a' to invoke it. Even today when we have real Alt keys and some programs could see a real Meta modifier (cf), basically every Emacs or Emacs-compatible system will accept ESC as the Meta prefix even if they're not running in a terminal.

(I started with Emacs sufficiently long ago that ESC-<key> is an ingrained reflex that I still sometimes use even though Alt is right there on my keyboard.)

The third answer is that Alt-<key> generates various accented or special characters in the terminal program's current locale (or in UTF-8, because that's increasingly hard-coded). Once upon a time this was the same as the first answer, because accented and special characters were whatever was found in the upper half of ASCII single-byte characters (bytes 128 to 255). These days, with people using UTF-8, it's generally different; for example, your Alt-a might generate 'Γ‘', but the actual UTF-8 representation of this single Unicode codepoint is actually two bytes, 0xc3 0xa1.

Some terminal programs still allow you to switch between the second and the third answers (Unix xterm is one such program and can even be switched on the fly, see the 'Meta sends Escape' option in the menu you get with Ctrl-<mouse button 1>). Others are hard-coded with the second answer, where Alt-<key> sends ESC <key>. My impression is that the second answer is basically the dominant one these days and only a few terminal programs even potentially support the third option.

PS: How xterm behaves can be host specific due to different default X resources settings on different hosts. Fedora makes xterm default to Alt-<key> sending ESC-<key>, while Ubuntu leaves it with the xterm code default of Alt creating accented characters.

A rough guess at how much IPv6 address space we might need

By: cks
10 November 2024 at 03:54

One of the reactions I saw to my entry on why NAT might be inevitable (at least for us) even with IPv6 was to ask if there really was a problem with being generous with IPv6 allocations, since they are (nominally) so large. Today I want to do some rough calculations on this, working backward from what we might reasonably assign to end user devices. There's a lot of hand-waving and assumptions here, and you can question a lot of them.

I'll start with the assumption that the minimum acceptable network size is a /64, for various reasons including SLAAC. As discussed, end devices presenting themselves on our network may need some number of /64s for internal use. Let's assume that we'll allocate sixteen /64s to each device, meaning that we give out /60s to each device on each of our subnets.

I think it's unlikely we'll want to ever have a subnet with more than 2048 devices on it (and even that's generous). That many /60s is a /49. However, some internal groups have more than one IPv4 subnet today, so for future expansion let's say that each group gets eight IPv6 subnets, so we give out /46s to research groups (or we could trim some of these sizes and give out /48s, which seems to be a semi-standard allocation size that various software may be more happy with).

We have a number of IPv4 subnets (and of research groups). If we want to allow for growth, various internal uses, and so on, we want some extra room, so I think we'd want space for at least 128 of these /46 allocations, which gets us to an overall allocation for our department of a /39 (a /38 if we want 256 just to be sure). The University of Toronto currently has a /32, so we actually have some allocation problems. For a start, the university has three campuses and it might reasonably want to split its /32 allocation into four and give one /34 to each campus. At a /34 for the campus, there's only 32 /39s and the university has many more departments and groups than that.

If the university starts with a /32, splits it to /34s for campuses, and wants to have room for 1024 or 2048 allocations within a campus, each department or group can get only a /44 or a /45 and all of our sizes would have to shrink accordingly; we'd need to drop at least five or six bits somewhere (say four subnets per group, eight or even four /64s per device, maybe 1024 devices maximum per subnet, etc).

If my understanding of how you're supposed to do IPv6 is correct, what makes all of this more painful in a purist IPv6 model is that you're not supposed to allocate multiple, completely separate IPv6 subnets to someone, unlike in the IPv4 world. Instead, everything is supposed to live under one IPv6 prefix. This means that the IPv6 prefix absolutely has to have enough room for future growth, because otherwise you have to go through a very painful renumbering to move to another prefix.

(For instance, today the department has multiple IPv4 /24s allocated to it, not all of them contiguous. We also work this way with our internal use of RFC 1918 address space, where we just allocate /16s as we need them.)

Being able to allocate multiple subnets of some size (possibly a not that large one) to departments and groups would make it easier to not over-allocate to deal with future growth. We might still have problems with the 'give every device eight /64s' plan, though.

(Of course we could do this multiple subnets allocation internally even if the university gives us only a single IPv6 prefix. Probably everything can deal with IPv6 used this way, and it would certainly reduce the number of bits we need to consume.)

The general problem of losing network based locks

By: cks
6 November 2024 at 03:38

There are many situations and protocols where you want to hold some sort of lock across a network between, generically, a client (who 'owns' the lock) and a server (who manages the locks on behalf of clients and maintains the locking rules). Because a network is involved, one of the broad problems that can happen in such a protocol is that the client can have a lock abruptly taken away from it by the server. This can happen because the server was instructed to break the lock, or the server restarted in some way and notified the clients that they had lost some or all of their locks, or perhaps there was a network partition that led to a lock timeout.

When the locking protocol and the overall environment is specifically designed with this in mind, you can try to require clients to specifically think about the possibility. For example, you can have an API that requires clients to register a callback for 'you lost a lock', or you can have specific error returns to signal this situation, or at the very least you can have a 'is this lock still valid' operation (or 'I'm doing this operation on something that I think I hold a lock for, give me an error if I'm wrong'). People writing clients can still ignore the possibility, just as they can ignore the possibility of other network errors, but at least you tried.

However, network locking is sometimes added to things that weren't originally designed for it. One example is (network) filesystems. The basic 'filesystem API' doesn't really contemplate locking and especially it doesn't consider that you can suddenly have access to a 'file' taken away from you in mid-flight. If you add network locking you don't have a natural answer to handling losing locks and there's no obvious point in the API to add it, especially if you want to pretend that your network filesystem is the same as a local filesystem. This makes it much easier for people writing programs to not even think about the possibility of losing a network lock during operation.

(If you're designing a purely networked filesystem-like API, you have more freedom; for example, you can make locking operations turn a regular 'file descriptor' into a special 'locked file descriptor' that you have to do subsequent IO through and that will generate errors if the lock is lost.)

One of the meta-problems with handling losing a network lock is that there's no single answer for what you should do about it. In some programs, you've violated an invariant and the only safe move for the program is to exit or crash. In some programs, you can pause operations until you can re-acquire the lock. In other programs you need to bail out to some sort of emergency handler that persists things in another way or logs what should have been done if you still held the lock. And when designing your API (or APIs) for losing locks, how likely you think each option is will influence what features you offer (and it will also influence how interested programs are in handling losing locks).

PS: A contributing factor to programmers and programs not being interested in handling losing network locks is that they're generally somewhere between uncommon and rare. If lots of people are writing code to deal with your protocol and losing locks are uncommon enough, some amount of those people will just ignore the possibility, just like some amount of programmers ignore the possibility of IO errors.

I feel that NAT is inevitable even with IPv6

By: cks
3 November 2024 at 02:23

Over on the Fediverse, I said something unpopular about IPv6 and NAT:

Hot take: NAT is good even in IPv6, because otherwise you get into recursive routing and allocation problems that have been made quite thorny by the insistence of so many things that a /64 is the smallest block they will work with (SLAAC, I'm looking at you).

Consider someone's laptop running multiple VMs and/or containers on multiple virtual subnets, maybe playing around with (virtual) IPv6 routers too.

(Partly in re <other Fediverse post>.)

The basic problem is straightforward. Imagine that you're running a general use wired or wireless network, where people connect their devices. One day, someone shows up with a (beefy) laptop that they've got some virtual machines (or container images) with a local (IPv6) network that is 'inside' their laptop. What IPv6 network addresses do these virtual machines get when the laptop is connected to your network and how do you make this work?

In a world where IPv6 devices and software reliably worked on subnet sizes smaller than a /64, this would be sort of straightforward. Your overall subnet might be a /64, and you would give each device connecting to it a /96 via some form of prefix delegation. This would allow a large number of devices on your network and also for each device to sub-divide its own /96 for local needs, with lots of room for multiple internal subnets for virtual machines, containers, or whatever else.

(And if a device didn't signal a need for a prefix delegation, you could give it a single IPv6 address from the /64, which would probably be the common case.)

In a world where lots of things insist on being on an IPv6 /64, this is extremely not trivial. Hosts will show up that want zero, one, or several /64s delegated to them, and both you and they may need those multiple /64s to fit into the same larger allocation of a /63, a /62, or so on. Worse, if more hosts than you expected show up asking for more delegations than you budgeted for, you'll need to expand the overall allocation to the entire network and everything under it, which at a minimum may be disruptive. Also, the IPv6 address space is large, but if you chop off half of it it's not that large, especially when you need to consume large blocks of it for contiguous delegations and sub-delegations and sub-sub delegations and so on.

I've described this as a laptop but there are other scenarios that are also perfectly reasonable. For example, suppose that you're setting up a subnet for a university research group that currently operates zero containers, virtual machine hosts, and the like (each of which would require at least one /64). Considering that research groups can and do change their mind on what they're running, how many additional /64s should you budget for them eventually needing, and what do you do when it turns out that they want to operate more than that?

IPv6 NAT gets you out of all of this. You assign an IPv6 address on your subnet's /64 to that laptop or server (or it SLAAC's one for itself), and everything else is its problem, not yours. Its containers and virtual machines get IPv6 addresses from some address space that's not your problem, and the laptop (or server) NATs all of their traffic back and forth. You don't have to know or care about how many internal networks the laptop (or server) is hiding, if it's got some sort of internal routing hierarchy, or anything.

I expect this use of IPv6 NAT to primarily be driven by the people with these laptops and servers, not by the people in charge of IPv6 network design. If you're someone with a laptop that has some containers or VMs that you need to work with, and you plug in to a network that isn't already specifically designed to accommodate you (for example it's just a /64), your practical choices are either IPv6 NAT or containers that can't talk to anything. The people running the network are pretty unlikely to redesign it for you (often their answer will be 'that's not supported on this network'), and if they do, the new network design is unlikely to be deployed immediately (or even very soon).

(I don't believe that delegating a single /64 to each machine is a particularly workable solution. It still leaves you with problems if any machine wants multiple internal IPv6 subnets, and it consumes your IPv6 address space at a prodigious rate if you're designing for a reasonable number of machines on each subnet. I'm also not sure how everyone on the subnet is supposed to know how to talk to each other, which is something that people often do on subnets.)

Two visions of 'software supply chain security'

By: cks
21 October 2024 at 03:04

Although the website that is insisting I use MFA if I want to use it to file bug reports doesn't use the words in its messages to me, we all know that the reason it is suddenly demanding I use MFA is what is broadly known as "software supply chain security" and the 'software supply chain' (which is a contentious name for deciding that you're going to rely on other people's open source code). In thinking about this, I feel that you can have (at least) two visions of "software supply chain security".

In one vision, software supply chain security is a collection of well intentioned moves and changes that are intended to make it harder for bad actors to compromise open source projects and their source code. For instance, all of the package repositories and other places where software is distributed try to get everyone to use multi-factor authentication, so people with the ability to publish new versions of packages can't get their (single) password compromised and have that password used by an attacker to publish a compromised version of their package. You might also expect to see people looking into heavily used, security critical projects to see if they have enough resources and then some moves to provide those resources.

In the other vision, software supply chain security is a way for corporations to avoid being blamed when there's a security issue in open source software that they've pulled into their products or their operations (or both). Corporations mostly don't really care about achieving actual security, especially since real security may not be legibly secure, but they are sensitive to blame, especially because it can result in lawsuits, fines, and other consequences. If a corporation can demonstrate that it was following convincing best practices to obtain secure (open source) software, maybe it can deflect the blame. And when doing this, it's useful if the 'best practices' are clearly legible and easy to assess, such as 'where we get open source software from insists on MFA'.

In the second vision, you might expect a big (corporate) push for visible but essentially performative 'security' steps, with relatively little difficult analysis of underlying root causes of various security risks, much less much of an attempt to address deep structural issues like sustainable open source maintenance.

(If you want an extremely crude measuring stick, you can simply ask "would this measure have prevented the XZ Utils backdoor". Generally the answer is 'no'.)

Forced MFA is effectively an annoying, harder to deal with second password

By: cks
20 October 2024 at 02:32

Suppose, not hypothetically, that some random web site you use is forcing you to enable MFA on your account, possibly an account that in practice you use only to do unimportant things like report issues on other people's open source software. I've written before how MFA is both 'simple' and non-trivial work, but that entry half assumed that you might actually care about the extra security benefits of MFA. If some random unimportant (to you) website is forcing you to get MFA, this goes out the window.

What the website is really doing is forcing you to enable a second password for your account, one that you must use in addition to your first password. Instead of using a password saved in your password manager of choice, you must now use the same saved password plus an additional password that is invariably slower and more work to produce. We understand today that websites that prevent you (or your password manager) from pasting in passwords and force you to type them out by hand are doing it wrong; well, that's what MFA is doing, except that often you're going to need a second device to get that password (whether that is a phone or a security key).

(For extra bonus points, losing the second 'password' alone may be enough to permanently lose your account on the website. At the very least, you're going to need to do a number of extra things to avoid this.)

My view is that if something unimportant is forcing MFA on you you don't feel like giving up on the site entirely, you might as well use the simplest, easiest to use MFA approach that you can. If the website will never let you in with the second factor alone, then it's perfectly okay for it to be relatively or completely insecure, and in any case you don't need to make it any more secure than your existing password management. In fact you might as well put it in your existing password management if possible, although I suspect that there are no current password managers that will both hold your password for a site and (automatically) generate the related TOTP MFA codes to go with it.

(You can get this on the same device, when you log in from your smartphone using its saved passwords and whatever authenticator app you're using. Don't ask how this is actually 'multi-factor', since anyone with your unlocked phone can use both factors; almost everyone in the MFA space is basically ignoring the issue because it would be too inconvenient to take it seriously.)

Will this defeat the website's security goals for forcing MFA down your throat? Yes, absolutely. But that's their problem, not yours. You are under no obligation to take any website (or your presence on it) as seriously as it takes itself. MFA that is not helping anything you care about is an obstacle, not a service.

Of course, sauce for the goose is sauce for the gander, so if you're implementing MFA for your good local security needs, you should be considering if the people who have to use it are going to think of your MFA in this way. Maybe they shouldn't, but remember, people don't actually care about security (and people matter because security is people).

A surprising IC in a LED light chain.

By: cpldcpu
25 November 2024 at 19:23

LED-based festive decorations are a fascinating subject for exploration of ingenuity in low-cost electronics. New products appear every year and often very surprising technology approaches are used to achieve some differentiation while adding minimal cost.

This year, there wasn’t any fancy new controller, but I was surprised how much the cost of simple light strings was reduced. The LED string above includes a small box with batteries and came in a set of ten for less than $2 shipped, so <$0.20 each. While I may have benefitted from promotional pricing, it is also clear that quite some work went into making the product cheap.

The string is constructed in the same way as one I had analyzed earlier: it uses phosphor-converted blue LEDs that are soldered to two insulated wires and covered with an epoxy blob. In contrast to the earlier device, they seem to have switched from copper wire to cheaper steel wires.

The interesting part is in the control box. It comes with three button cells, a small PCB, and a tactile button that turns the string on and cycles through different modes of flashing and and constant light.

Curiously, there is nothing on the PCB except the button and a device that looks like an LED. Also, note how some β€œredundant” joints have simply been left unsoldered.

Closer inspection reveals that the β€œLED” is actually a very small integrated circuit packaged in an LED package. The four pins are connected to the push button, the cathode of the LED string, and the power supply pins. I didn’t measure the die size exactly, but I estimate that it is smaller than 0.3Γ—0.2 mmΒ² = ~0.1 mmΒ².

What is the purpose of packaging an IC in an LED package? Most likely, the company that made the light string is also packaging their own LEDs, and they saved costs by also packaging the IC themselvesβ€”in a package type they had available.

I characterized the current-voltage behavior of IC supply pins with the LED string connected. The LED string started to emit light at around 2.7V, which is consistent with the forward voltage of blue LEDs. The current increased proportionally to the voltage, which suggests that there is no current limit or constant current sink in the IC – it’s simply a switch with some series resistance.

Left: LED string in β€œconstantly on” mode. Right: Flashing

Using an oscilloscope, I found that the string is modulated with an on-off ratio of 3:1 at a frequency if ~1.2 kHz. The image above shows the voltage at the cathode, the anode is connected to the positive supply. This is most likely to limit the current.

All in all, it is rather surprising to see an ASIC being used when it barely does more than flashing the LED string. It would have been nice to see a constant current source to stabilize the light levels over the lifetime of the battery and maybe more interesting light effects. But I guess that would have increased the cost of the ASIC too much and then using an ultra-low cost microcontroller may have been cheaper. This almost calls for a transplant of a MCU into this device…

Keep the crap going

By: VM
6 December 2024 at 09:16

Have you seen the new ads for Google Gemini?

In one version, just as a young employee is grabbing her fast-food lunch, she notices her snooty boss get on an elevator. So she drops her sandwich, rushes to meet her just as the doors are about to close, and submits her proposal in the form of a thick dossier. The boss asks her for a 500-word summary to consume during her minute-long elevator ride. The employee turns to Google Gemini, which digests the report and spits out the gist, and which the employee regurgitates to the boss’s approval. The end.


Isn’t this unsettling? Google isn’t alone either. In May this year, Apple released a tactless ad for its new iPad Pro. From Variety:

The β€œCrush!” ad shows various creative and cultural objects β€” including a TV, record player, piano, trumpet, guitar, cameras, a typewriter, books, paint cans and tubes, and an arcade game machine β€” getting demolished in an industrial press. At the end of the spot, the new iPad Pro pops out, shiny and new, with a voiceover that says, β€œThe most powerful iPad ever is also the thinnest.”

After the backlash, Apple bactracked and apologised β€” and then produced two ads in November for its Apple Intelligence product showcasing how it could help thoughtless people continue to be thoughtless.



The second video is additionally weird because it seems to suggest reaching all the way for an AI tool makes more sense than setting a reminder on the calendar that comes in all smartphones these days.

And they are now joined in spirit by Google, because bosses can now expect their subordinates to Geminify their way through what could otherwise have been tedious work or just impossible to do on punishingly short deadlines β€” without the bosses having to think about whether their attitudes towards what they believe is reasonable to ask of their teammates need to change. (This includes a dossier of details that ultimately won’t be read.)

If AI is going to absorb the shock that comes of someone being crappy to you, will we continue to notice that crappiness and demand they change or β€” as Apple and Google now suggest β€” will we blame ourselves for not using AI to become crappy ourselves? To quote from a previous post:

When machines make decisions, the opportunity to consider the emotional input goes away. This is a recurring concern I’m hearing about from people working with or responding to AI in some way. … This is Anna Mae Duane, director of the University of Connecticut Humanities Institute, in The Conversation: β€œI fear how humans will be damaged by the moral vacuum created when their primary social contacts are designed solely to serve the emotional needs of the β€˜user’.”

The applications of these AI tools have really blossomed and millions of people around the world are using them for all sorts of tasks. But even if the ads don’t pigeonhole these tools, they reveal how their makers β€” Apple and Google β€” are thinking about what the tools bring to the table and what these tech companies believe to be their value. To Google’s credit at least, its other ads in the same series are much better (see here and here for examples), but they do need to actively cut down on supporting or promoting the idea that crappy behaviour is okay.

Two views of what a TLS certificate verifies

By: cks
2 October 2024 at 01:58

One of the things that you could ask about TLS is what a validated TLS certificate means or is verifying. Today there is a clear answer, as specified by the CA/Browser Forum, and that answers is that when you successfully connect to https://microsoft.com/, you are talking to the "real" microsoft.com, not an impostor who is intercepting your traffic in some way. This is known as 'domain control' in the jargon; to get a TLS certificate for a domain, you must demonstrate that you have control over the domain. The CA/Browser Forum standards (and the browsers) don't require anything else.

Historically there has been a second answer, what TLS (then SSL) sort of started with. A TLS certificate was supposed to verify that not just the domain but that you were talking to the real "Microsoft" (which is to say the large, world wide corporation with its headquarters in Redmond WA, not any other "Microsoft" that might exist). More broadly, it was theoretically verifying that you were talking to a legitimate and trustworthy site that you could, for example, give your credit card number to over the Internet, which used to be a scary idea.

This second answer has a whole raft of problems in practice, which is why the CA/Browser Forum has adopted the first answer, but it started out and persists because it's much more useful to actual people. Most people care about talking to (the real) Google, not some domain name, and domain names are treacherous things as far as identity goes (consider IDN homograph attacks, or just 'facebook-auth.com'). We rather want this human version of identity and it would be very convenient if we could have it. But we can't. The history of TLS certificates has convincingly demonstrated that this version of identity has comprehensively failed for a collection of reasons including that it's hard, expensive, difficult or impossible to automate, and (quite) fallible.

(The 'domain control' version of what TLS certificates mean can be automated because it's completely contained within the Internet. The other version is not; in general you can't verify that sort of identity using only automated Internet resources.)

A corollary of this history is that no Internet protocol that's intended for wide spread usage can assume a 'legitimate identity' model of participants. This includes any assumption that people can only have one 'identity' within your system; in practice, since Internet identity can only verify that you are something, not that you aren't something, an attacker can have as many identities as they want (including corporate identities).

PS: The history of commercial TLS certificates also demonstrates that you can't use costing money to verify legitimacy. It sounds obvious to say it, but all that charging someone money demonstrates is that they willing and able to spend some money (perhaps because they have a pet cause), not that they're legitimate.

TLS certificates were (almost) never particularly well verified

By: cks
22 September 2024 at 02:32

Recently there was a little commotion in the TLS world, as discussed in We Spent $20 To Achieve RCE And Accidentally Became The Admins Of .MOBI. As part of this adventure, the authors of the article discovered that some TLS certificate authorities were using WHOIS information to validate who controlled a domain (so if you could take over a WHOIS server for a TLD, you could direct domain validation to wherever you wanted). This then got some people to realize that TLS Certificate Authorities were not actually doing very much to verify who owned and controlled a domain. I'm sure that there were also some people who yearned for a hypothetical old days when Certificate Authorities actually did that, as opposed to the modern days when they don't.

I'm afraid I have bad news for anyone with this yearning. Certificate Authorities have never done a particularly strong job of verifying who was asking for a TLS (then SSL) certificate. I will go further and be more controversial; we don't want them to be thorough about identity verification for TLS certificates.

There are a number of problems with identity verification in theory and in practice, but one of them is that it's expensive, and the more thorough and careful the identity verification, the more expensive it is. No Certificate Authority is in a position to absorb this expense, so a world where TLS certificates are carefully verified is also a world where they are expensive. It's also probably a world where they're difficult or impossible to obtain from a Certificate Authority that's not in your country, because the difficulty of identity verification goes up significantly in that case.

(One reason that thorough and careful verification is expensive is that it takes significant time from experienced, alert humans, and that time is not cheap.)

This isn't the world that we had even before Let's Encrypt created the ACME protocol for automated domain verifications. The pre-LE world might have started out with quite expensive TLS certificates, but it shifted fairly rapidly to ones that cost only $100 US or less, which is a price that doesn't cover very much human verification effort. And in that world, with minimal human involvement, WHOIS information is probably one of the better ways of doing such verification.

(Such a world was also one without a lot of top level domains, and most of the TLDs were country code TLDs. The turnover in WHOIS servers was probably a lot smaller back then.)

PS: The good news is that using WHOIS information for domain verification is probably on the way out, although how soon this will happen is an open question.

Threads, asynchronous IO, and cancellation

By: cks
14 September 2024 at 02:23

Recently I read Asynchronous IO: the next billion-dollar mistake? (via), and had a reaction to one bit of it. Then yesterday on the Fediverse I said something about IO in Go:

I really wish you could (easily) cancel io Reads (and Writes) in Go. I don't think there's any particularly straightforward way to do it today, since the io package was designed way before contexts were a thing.

(The underlying runtime infrastructure can often actually do this because it decouples 'check for IO being possible' from 'perform the IO', but stuff related to this is not actually exposed.)

Today this sparked a belated realization in my mind, which is that a model of threads performing blocking IO in each thread is simply a harder environment to have some sort of cancellation in than an asynchronous or 'event loop' environment. The core problem is that in their natural state, threads are opaque and therefor difficult to interrupt or stop safely (which is part of why Go's goroutines can't be terminated from the outside). This is the natural inverse of how threads handle state for you.

(This is made worse if the thread is blocked in the operating system itself, for example in a 'read()' system call, because now you have to use operating system facilities to either interrupt the system call so the thread can return to user level to notice your user level cancellation, or terminate the thread outright.)

Asynchronous IO generally lets you do better in a relatively clean way. Depending on the operating system facilities you're using, either there is a distinction between the OS telling you that IO is possible and your program doing IO, providing you a chance to not actually do the IO, or in an 'IO submission' environment you generally can tell the OS to cancel a submitted but not yet completed IO request. The latter is racy, but in many situations the IO is unlikely to become possible right as you want to cancel it. Both of these let you implement a relatively clean model of cancelling a conceptual IO operation, especially if you're doing the cancellation as the result of another IO operation.

Or to put it another way, event loops may make you manage state explicitly, but that also means that that state is visible and can be manipulated in relatively natural ways. The implicit state held in threads is easy to write code with but hard to reason about and work with from the outside.

Sidebar: My particular Go case

I have a Go program that at its core involves two goroutines, one reading from standard input and writing to a network connection, one reading from the network connection and writing to standard output. Under some circumstances, the goroutine reading from the network will want to close down the network collection and return to a top level, where another two way connection will be made. In the process, it needs to stop the 'read from stdin, write to the network' goroutine while it is parked in 'read from stdin', without closing stdin (because that will be reused for the next connection).

To deal with this cleanly, I think I would have to split the 'read from standard input, write to the network' goroutine into two that communicated through a channel. Then the 'write to the network' side could be replaced separately from the 'read from stdin' side, allowing me to cleanly substitute a new network connection.

(I could also use global variables to achieve the same substitution, but let's not.)

Ways ATX power supply control could work on server motherboards

By: cks
11 September 2024 at 03:02

Yesterday I talked about how ATX power supply control seems to work on desktop motherboards, which is relatively straightforward; as far as I can tell from various sources, it's handled in the chipset (on modern Intel chipsets, in the PCH), which is powered from standby power by the ATX power supply. How things work on servers is less clear. Here when I say 'server' I mean something with a BMC (Baseboard management controller), because allowing you to control the server's power supply is one of the purposes of a BMC, which means the BMC has to hook into this power management picture.

There appear to be a number of ways that the power control and management could or may be done and the BMC connected to it. People on the Fediverse replying to my initial question gave me a number of possible answers:

I found documentation for some of Intel's older Xeon server chipsets (with provisions for BMCs) and as of that generation, power management was still handled in the PCH and described in basically the same language as for desktops. I couldn't spot a mention of special PCH access for the BMC, so BMC control over server power might have been implemented with the 'BMC controls the power button wire' approach.

I can also imagine hybrid approaches. For example, you could in theory give the BMC control over the 'turn power on' wire to the power supplies, and route the chipset's version of that line to the BMC, in addition to routing the power button wire to the BMC. Then the BMC would be in a position to force a hard power off even if something went wrong in the chipset (or a hard power on, although if the chipset refuses to trigger a power on there might be a good reason for that).

(Server power supplies aren't necessarily 'ATX' power supplies as such, but I suspect that they all have similar standby power, 'turn power on', and 'is the PSU power stable' features as ATX PSUs do. Server PSUs often clearly aren't plain ATX units because they allow the BMC to obtain additional information on things like the PSU's state, temperature, current power draw, and so on.)

Our recent experience with BMCs that wouldn't let their servers power on when they should have suggests that on these servers (both Dell R340s), the BMC has some sort of master control or veto power over the normal 'return to last state' settings in the BIOS. At the same time, the 'what to do after AC power returns' setting is in the BIOS, not in the BMC, so it seems that the BMC is not the sole thing controlling power.

(I tried to take a look at how this was done in OpenBMC, but rapidly got lost in a twisty maze of things. I think at least some of the OpenBMC supported hardware does this through I2C commands, although what I2C device it's talking to is a good question. Some of the other hardware appears to have GPIO signal definitions for power related stuff, including power button definitions.)

How ATX power supply control seems to work on desktop motherboards

By: cks
10 September 2024 at 03:11

Somewhat famously, the power button on x86 PC desktop machines with ATX power supplies is not a 'hard' power switch that interrupts or enables power through the ATX PSU but a 'soft' button that is controlled by the overall system. The actual power delivery is at least somewhat under software control, both the operating system (which enables modern OSes to actually power off the machine under software control) and the 'BIOS', broadly defined, which will do things like signal the OS to do an orderly shutdown if you merely tap the power button instead of holding it down for a few seconds. Because they're useful, 'soft' power buttons and the associated things have also spread to laptops and servers, even if their PSUs are not necessarily 'ATX' as such. After recent events, I found myself curious about actually did handle the chassis power button and associated things. Asking on the Fediverse produced a bunch of fascinating answers, so today I'm starting with plain desktop motherboards, where the answer seems to be relatively straightforward.

(As I looked up once, physically the power button is normally a momentary-contact switch that is open (off) when not pressed. A power button that's stuck 'pressed' can have odd effects.)

At the direct electrical level, ATX PSUs are either on, providing their normal power, or "off", which is not really completely off but has the PSU providing +5V standby power (with a low current limit) on a dedicated pin (pin 9, the ATX cable normally uses a purple wire for this). To switch an ATX PSU from "off" to on, you ground the 'power on' pin and keep it grounded (pin 16; the green wire in normal cables, and ground is black wires). After a bit of stabilization time, the ATX PSU will signal that all is well on another pin (pin 8, the grey wire). The ATX PSU's standby power is used to power the RTC and associated things, to provide the power for features like wake-on-lan (which requires network ports to be powered up at least a bit), and to power whatever handles the chassis power button when the PSU is "off".

On conventional desktop motherboards, the actual power button handling appears to be in the PCH or its equivalent (per @rj's information on the ICH, and also see Whitequark's ICH/PCH documentation links). In the ICH/PCH, this is part of general power management, including things like 'suspend to RAM'. Inside the PCH, there's a setting (or maybe two or three) that controls what happens when external power is restored; the easiest to find one is called AFTERG3_EN, which is a single bit in one of the PCH registers. To preserve this register's settings over loss of external power, it's part of what the documentation calls the "RTC well", which is apparently a chunk of stuff that's kept powered as part of the RTC, either from standby power or from the RTC's battery (depending on whether or not there's external power available). The ICH/PCH appears to have a direct "PWRBTN#" input line, which is presumably eventually connected to the chassis power button, and it directly implements the logic for handling things like the 'press and hold for four seconds to force a power off' feature (which Intel describes as 'transitioning to S5', the "Soft-Off" state).

('G3' is the short Intel name for what Intel calls "Mechanical Off", the condition where there's no external power. This makes the AFTERG3_EN name a bit clearer.)

As far as I can tell there's no obvious and clear support for the modern BIOS setting of 'when external power comes back, go to your last state'. I assume that what actually happens is that the ICH/PCH register involved is carefully updated by something (perhaps ACPI) as the system is powered on and off. When the system is powered on, early in the sequence you'd set the PCH to 'go to S0 after power returns'; when the system is powered off, right at the end you'd set the PCH to 'stay in S5 after power returns'.

(And apparently you can fiddle with this register yourself (via).)

All of the information I've dug up so far is for Intel ICH/PCH, but I suspect that AMD's chipsets work in a similar manner. Something has to do power management for suspend and sleep, and it seems that the chipset is the natural spot for it, and you might as well put the 'power off' handling into the same place. Whether AMD uses the same registers and the same bits is an open question, since I haven't turned up any chipset documentation so far.

Operating system threads are always going to be (more) expensive

By: cks
7 September 2024 at 04:01

Recently I read Asynchronous IO: the next billion-dollar mistake? (via). Among other things, it asks:

Now imagine a parallel universe where instead of focusing on making asynchronous IO work, we focused on improving the performance of OS threads [...]

I don't think this would have worked as well as you'd like, at least not with any conventional operating system. One of the core problems with making operating system threads really fast is the 'operating system' part.

A characteristic of all mainstream operating systems is that the operating system kernel operates in a separate hardware security domain than regular user (program) code. This means that any time the operating system becomes involved, the CPU must do at least two transitions between these security domains (into kernel mode and then back out). Doing these transitions is always more costly than not doing them, and on top of that the CPU's ISA often requires the operating system go through non-trivial work in order to be safe from user level attacks.

(The whole speculative execution set of attacks has only made this worse.)

A great deal of the low level work of modern asynchronous IO is about not crossing between these security domains, or doing so as little as possible. This is summarized as 'reducing system calls because they're expensive', which is true as far as it goes, but even the cheapest system call possible still has to cross between the domains (if it is an actual system call; some operating systems have 'system calls' that manage to execute entirely in user space).

The less that doing things with threads crosses the CPU's security boundary into (and out of) the kernel, the faster the threads go but the less we can really describe them as 'OS threads' and the harder it is to get things like forced thread preemption. And this applies not just for the 'OS threads' themselves but also to their activities. If you want 'OS threads' that perform 'synchronous IO through simple system calls', those IO operations are also transitioning into and out of the kernel. If you work to get around this purely through software, I suspect that what you wind up with is something that looks a lot like 'green' (user-space) threads with asynchronous IO once you peer behind the scenes of the abstractions that programs are seeing.

(You can do this today, as Go's runtime demonstrates. And you still benefit significantly from the operating system's high efficiency asynchronous IO, even if you're opting to use a simpler programming model.)

(See also thinking about event loops versus threads.)

TLS Server Name Indications can be altered by helpful code

By: cks
4 September 2024 at 03:25

In TLS, the Server Name Indication is how (in the modern TLS world) you tell the TLS server what (server) TLS certificate you're looking for. A TLS server that has multiple TLS certificates available, such as a web server handling multiple websites, will normally use your SNI to decide what server TLS certificate to provide to you. If you provide an SNI that the TLS server doesn't know or don't provide a SNI at all, the TLS server can do a variety of things, but many will fall back to some default TLS certificate. Use of SNI is pervasive in web PKI but not always used elsewhere; for example, SMTP clients don't always send SNI when establishing TLS with a SMTP server.

The official specification for SNI is section 3 of RFC 6066, and it permits exactly one format of the SNI data, which is, let's quote:

"HostName" contains the fully qualified DNS hostname of the server, as understood by the client. The hostname is represented as a byte string using ASCII encoding without a trailing dot. [...]

Anything other than this is an incorrectly formatted SNI. In particular, sending a SNI using a DNS name with a dot at the end (the customary way of specifying a fully qualified name in the context of DNS) is explicitly not allowed under RFC 6066. RFC 6066 SNI names are always fully qualified and without the trailing dots.

So what happens if you provide a SNI with a trailing dot? That depends. In particular, if you're providing a name with a trailing dot to a client library or a client program that does TLS, the library may helpfully remove the trailing dot for you when it sends the SNI. Go's crypto/tls definitely behaves this way, and it seems that some TLS libraries may. Based on observing behavior on systems I have access to, I believe that OpenSSL does strip the trailing dot but GnuTLS doesn't, and probably Mozilla's NSS doesn't either (since Firefox appears to not do this).

(I don't know what a TLS server sees as the SNI if it uses these libraries, but it appears likely that OpenSSL doesn't strip the trailing dot but instead passes it through literally.)

This dot stripping behavior is generally silent, which can lead to confusion if you're trying to test the behavior of providing a trailing dot in the SNI (which can cause web servers to give you errors). At the same time it's probably sensible behavior for the client side of TLS libraries, since some of the time they will be deriving the SNI hostname from the host name the caller has given them to connect to, and the caller may want to indicate a fully qualified DNS name in the customary way.

PS: Because I looked it up, the Go crypto/tls client code strips a trailing dot while the server code rejects a TLS ClientHelo that includes a SNI with a trailing dot (which will cause the TLS connection to fail).

The status of putting a '.' at the end of domain names

By: cks
2 September 2024 at 02:29

A variety of things that interact with DNS interpret the host or domain name 'host.domain.' (with a '.' at the end) as the same as the fully qualified name 'host.domain'; for example this appears in web browsers and web servers. At this point one might wonder whether this is an official thing in DNS or merely a common convention and practice. The answer is somewhat mixed.

In the DNS wire protocol, initially described in RFC 1035, we can read this (in section 3.1):

Domain names in messages are expressed in terms of a sequence of labels. Each label is represented as a one octet length field followed by that number of octets. Since every domain name ends with the null label of the root, a domain name is terminated by a length byte of zero. [...]

DNS has a 'root', which all DNS queries (theoretically) start from, and a set of DNS servers, the root nameservers, that answer the initial queries that tell you what the DNS servers are for the top level domain is (such as the '.edu' or the '.ca' DNS servers). In the wire format, this root is explicitly represented as a 'null label', with zero length (instead of being implicit). In the DNS wire format, all domain names are fully qualified (and aren't represented as plain text).

RFC 1035 also defines a textual format to represent DNS information, Master files. When processing these files there is usually an 'origin', and textual domain names may be relative to that origin or absolute. The RFC says:

[...] Domain names that end in a dot are called absolute, and are taken as complete. Domain names which do not end in a dot are called relative; the actual domain name is the concatenation of the relative part with an origin specified in a $ORIGIN, $INCLUDE, or as an argument to the master file loading routine. A relative name is an error when no origin is available.

So in textual DNS data that follows RFC 1035's format, 'host.domain.' is how you specify an absolute (fully qualified) DNS name, as opposed to one that is under the current origin. Bind uses this format (or something derived from it, here in 2024 I don't know if it's strictly RF 1035 compliant any more), and in hand-maintained Bind format zone files you can find lots of use of both relative and absolute domain names.

DNS data doesn't have to be represented in text in RFC 1035 form (and doing so has some traps), either for use by DNS servers or for use by programs who do things like look up domain names. However, it's not quite accurate to say that 'host.domain.' is only a convention. A variety of things use a more or less RFC 1035 format, and in those things a terminal '.' means an absolute name because that's how RFC 1035 says to interpret and represent it.

Since RFC 1035 uses a '.' at the end of a domain name to mean a fully qualified domain name, it's become customary for code to accept one even if the code already only deals with fully qualified names (for example, DNS lookup libraries). Every program that accepts or reports this format creates more pressure on other programs to accept it.

(It's also useful as a widely understood signal that the textual domain name returned through some API is fully qualified. This may be part of why Go's net package consistently returns results from various sorts of name resolutions with a terminating '.', including in things like looking up the name(s) of IP addresses.)

At the same time, this syntax for fully qualified domain names is explicitly not accepted in certain contexts that have their own requirements. One example is in email addresses, where 'user@some.domain.' is almost invariably going to be rejected by mail systems as a syntax error.

In practice, abstractions hide their underlying details

By: cks
1 September 2024 at 01:58

Very broadly, there are two conflicting views of abstractions in computing. One camp says that abstractions simplify the underlying complexity but people still should know about what is behind the curtain, because all abstractions are leaky. The other camp says that abstractions should hide the underlying complexity entirely and do their best not to leak the details through, and that people using the abstraction should not need to know those underlying details. I don't particularly have a side, but I do have a pragmatic view, which is that many people using abstractions don't know the underlying details.

People can debate back and forth about whether people should know the underlying details and whether they are incorrect to not know them, but the well established pragmatic reality is that a lot of people writing a lot of code and building a lot of systems don't know more than a few of the details behind the abstractions that they use. For example, I believe that a lot of people in web development don't know that host and domain names can often have a dot at the end. And people who have opinions about programming probably have a favorite list of leaky abstractions that people don't know as much about as they should.

(One area a lot of programming abstractions 'leak' is around performance issues. For example, the (C)Python interpreter is often much faster if you make things local variables inside a function than if you use global variables because of things inside the abstraction it presents to you.)

That this happens should not be surprising. People have a limited amount of time and a limited amount of things that they can learn, remember, and keep track of. When presented with an abstraction, it's very attractive to not sweat the details, especially because no one can keep track of all of them. Computing is simply too complicated to see behind all of the abstractions all of the way down. Almost all of the time, your effort is better focused on learning and mastering your layer of the abstraction stack rather than trying to know 'enough' about every layer (especially when it's not clear how much is enough).

(Another reason to not dig too deeply into the details behind abstractions is that those details can change, especially if one reason the abstraction exists is to allow the details to change. We call some of these abstractions 'APIs' and discourage people investigating and using the specific details behind the current implementations.)

One corollary of this is that safety and security related abstractions need to be designed with the assumption that people using them won't know or remember all of the underlying details. If forgetting one of those details will leave people using the abstraction with security problems, the abstraction has a design flaw that will inevitably lead to a security issue sooner or later. This security issue is not the fault of the people using the abstraction, except in a mathematical security way.

My (current) view on open source moral obligations and software popularity

By: cks
24 August 2024 at 02:59

A while back I said something pretty strong in a comment on my entry on the Linux kernel CVE story:

(I feel quite strongly that the importance of a project cannot create substantial extra obligations on the part of the people working on the project. We do not get to insist that other people take on more work just because their project got popular. In my view, this is a core fallacy at the heart of a lot of "software supply chain security" stuff, and I think things like the Linux kernel CVE handling are the tip of an iceberg of open source reactions to it.)

After writing that, I thought about it more and I think I have a somewhat more complicated view on moral obligations (theoretically) attached to open source software. To try to boil it down, I feel that other people's decisions should not create a moral obligation on your part.

If you write a project to scratch your itch and a bunch of other people decide to use it too, that is on them, not on you. You have no moral obligation to them that accrues because they started using your software, however convenient it might be for them if you did or however much time might be saved if you did something instead of many or all of them doing something. Of course you may be a nice person, and you may also be the kind of person who is extremely conscious of how many people are relying on your software and what might happen to them if you did or didn't do various things, but that is your decision. You don't have a positive moral obligation to them.

(It's my view that this lack of obligations is a core part of what makes free software and open source software work at all. If releasing open source software came with firm moral or legal obligations, we would see far less of it.)

However, in a bit of a difference from what I implied in my comment, I also feel that while other people's actions don't create a moral obligation on you, your own actions may. If you go out and actively promote your software, try to get it widely used, put forward that you're responsive and stand ready to fix problems, and so on, then the moral waters are at least muddy. If you explicitly acted to put yourself and your software forward, other people sort of do have the (moral) right to assume that you're going to live up to your promises (whether they're explicit or implicit). However, there has to be a line somewhere; you shouldn't acquire an unlimited, open-ended obligation to do work for other people using your software just because you promoted your software a bit.

(The issue of community norms is another thing entirely. I'm sure there are some software communities where merely releasing something into the community comes with the social expectation that you'll support it.)

Staged rollouts of things still have limitations

By: cks
6 August 2024 at 02:45

One of the commonly suggested remedies for deploying things that can go wrong is to do staged rollouts, where you deploy to only a subset of the things at a time and look for problems before proceeding. Staged rollouts are in general a good idea, but it's important to understand that there are limits on how much they can improve the situation, especially if the staged rollouts are going out to outside people ('customers') instead of internally, within your organization in environments that you control.

The first limitation is that staged rollouts only help to the extent that you can actually detect problems before continuing with the rollout. Often what problems you can detect (and how soon) are limited by the telemetry you have available and the degree to which you can inspect and monitor the systems that you're rolling out to. If you're rolling out internally, this can possibly be quite high, but if you're rolling out to customers, you may have limited telemetry (partly because customers will object to your software constantly reporting things back to you, especially if you want to report lots of details) and no ability to reach out and inspect systems. A related issue is that when you build rollout telemetry and monitoring, you're probably basing the telemetry on what problems you expect. If your rollout triggers a problem that you didn't foresee, you may have no telemetry that would tell you about it.

(For a topical example, consider the telemetry you'd need to detect that your application has made your customer's machines crash and be unable to boot. Since the machines aren't booting, you can't send any telemetry from them to actually report the problem; instead you'd need some telemetry signal that your application was running fine and then monitor this signal for a rapid decrease in your staging group. Would you think to both build and monitor this telemetry signal in advance?)

The second limitation is that if your staged rollout detects problems, you've (still) inflicted problems on some people, just not as many of them as without a staged rollout. Again, this is more of a problem with external staged rollouts than with internal ones. When your staged rollout is internal, you're inflicting problems on yourself; when your staged rollout is external, you're inflicting problems on other people and they're going to be unhappy with you. Staged external rollouts don't eliminate problems, they merely reduce them.

(For instance, Ubuntu has a system of 'phased updates' for non-security updates of some packages, such as OpenSSH, but if an update is bad and detected in this phased update process, and you happen to be one of the people who got the update early, you get to sort out whatever mess it's made of your system.)

In addition, staged rollouts are in conflict with rapid updates. The slower and more carefully you do a staged rollout, the longer (on average) it takes for your update to reach people and become functional. This isn't vital for some updates, but we know update speed matters for some things. As an extreme example, if you're pushing out an update to deal with a security problem that's being actively exploited, most people are going to want it right now and the slower your staged rollout runs, the more people will wind up being exploited.

This doesn't excuse doing a non-staged rollout that blows up. Or even a staged rollout that only blows up some people. It's your job to only roll out good changes, and as part of that to test your changes (and your systems) before throwing them into the field. Staged rollouts are an emergency backup in case an error slipped through your other precautions, especially external staged rollouts, where you can't easily fix any problems that you caused.

(The corollary is that if staged rollouts are regularly saving you, you have additional problems and should probably fix them first.)

PS: There are probably situations where it's sensible to make internal staged rollouts your main defense against bad updates. But otherwise it's my view that staged rollouts should be your emergency backup to all of the other testing and validation you're doing.

Part of (computer) security is convincing people that it works

By: cks
20 July 2024 at 02:17

One of the ways that security is people, not math is that as part of security being ultimately about people, part of the work of computer security is convincing people that your security measures actually work. I don't mean this in a narrow technical sense of specific technical features working as designed; I mean this in the broader sense that they achieve the general goals of security, which is really about safety. People want to know that their data and what they do on the computer is safe, in the full sense of the confidentiality, integrity, and availability triad.

Often, convincing people that your security works requires making it legible to them, in the "Seeing Like a State" sense. One way to describe this situation is that partly due to the sorry history of computer security and people not doing effective computer security, many people and organizations have adopted a view that they assume computer security measures don't work or aren't effective until proven otherwise. If you can't convince them that your security measure works, in the process making it legible to them, they assume it doesn't. Historically they were often right.

One complication is that the people you're trying to make your security measures convincing and legible to are almost always people who don't have specialist knowledge in computer security. Often they have little to no knowledge in the field at all (just like you don't have expert-level knowledge in their fields). This means that you generally can't convince them by explaining the technical details, because they don't have the knowledge and experience they'd require to evaluate those details. Handling this has no straightforward solution, but it will often require some degree of building their trust in your skill and honest coupled with some degree of using things that other independent and trusted (by the people you're trying to convince) parties have already called secure. This is part of what it means to have legibility in your security measures; you're making something that other people can understand and assess, even if it's not what you'd make for yourself.

Some system administrators and other computer people can wind up feeling upset about this, because from their perspective the organization is preferring inferior outside solutions (that have the social approval of the crowd) to the superior home grown work. However, all of us inclined to see things from this angle really should turn around and look at it from the organization's perspective. For the organization, it's not a choice between inferior but generally approved security and home grown 'real security', it's a choice between known (although maybe flawed) security and an unknown state where they may be more secure, as secure, less secure, or completely exposed. It's perfectly sensible for the organization to choose a known state over a risky unknown one.

(It's taken me a long time to come around to this perspective over the course of my career, because of course in the beginning I was solidly in the 'this is obviously better security, because of ...' camp. Even today I'm in the camp of 'real security', it's just that I've come to appreciate that part of my job is convincing the organization that what we're offering is not a 'risky unknown' state.)

My self-inflicted UPS and computer conundrum

By: cks
17 July 2024 at 04:44

Today the area where I live experienced what turned out to be an extended power outage, and I wound up going in to the office. In the process, I wound up shooting myself in the foot as far as my ability to tell from the office if power had returned to home, oddly because I have a UPS at home.

The easy way to use a UPS is just to let it run down until it powers off. But this is kind of abrupt on the computer, even if you've more or less halted it already, and it also means that the computer's load reduces the run time for everything else. In my case this matters a bit because after a power loss, my phone line is typically slow to get DSL signal and sync up, so that I can start doing PPPoE and bring up my Internet connection. So if it looks like the UPS's power is running low, my reflex is to power off the computer and hope that power will come back before the UPS bottoms out and the DSL modem turns off (and thus loses line sync).

The first problem is that this only really works if I'm going to stick around to turn the computer on should the power outage end early enough (before the UPS loses power). That turned out not to be a problem this time; the power outage lasted more than long enough to run the UPS out of power, even with only the minor load of the DSL modem, a five-port switch, and a few other small things. The bigger problem is that because of how I have my computer set up right now due to hardware reasons, if I want the computer to be drawing no power (as opposed to being 'off' in some sense), I have to turn the computer off using the hard power switch on the PSU. Once I've flipped this switch, the computer is off until I flip it back, and if I flip it back with (UPS) power available, the computer will power back up again and start drawing power and all that.

(My BIOS is set to 'always power up when AC power is restored', and apparently one side effect of this is that the chassis fans and so on keep spinning even when the system is 'powered off' from Linux.)

The magic UPS feature I would like to fix this is a one-shot push button switch for every outlet that temporarily switches the outlet to 'wait until AC power returns to give this outlet any power'. With this, I could run 'poweroff' on my computer, then push the button to cut power and have it come back when Toronto Hydro restored service. I believe it might be possible to do this with commands to the UPS, but that mostly doesn't help me since the host that would issue those commands is the one I'm running 'poweroff' on.

(The better solution would be a BIOS and hardware that turns everything off after 'poweroff' even when set to always power up after AC comes back. Possibly this is fixed in a later BIOS revision than I have.)

People at universities travel widely and unpredictably

By: cks
16 July 2024 at 02:11

Every so often, people make the entirely reasonable suggestion that if one day you see a particular person log in from locally and then a few days later they're logging in from half way around the world, perhaps you should investigate. This may work for many organizations, but unfortunately it is one of the ways in which universities are peculiar places. At universities, a lot of people travel, they do it a fair bit (and unpredictably), and they go to all sorts of strange places, where they will connect back to the university to continue doing work (for professors and graduate students, at least).

There are all sorts of causes for this travel. Professors, postdocs, and graduate students go to conferences in various locations. Professors go on sabbatical, or go visit another university for a month or two, or even go hang out at a company for a while (perhaps as a visiting researcher). Graduate students also go home to visit their family, which can put them pretty much anywhere in the world, and they can also visit places for other reasons.

(Graduate students are often strongly encouraged to keep working all the time, including on holiday visits to their family. Even professors can feel similar pressures in the modern academic environment.)

Professors, postdocs, and graduate students will not tell you all of this information ahead of time, and even if you forced them to share their travel plans, it would not necessarily be useful because they may well have no idea how they will be connecting to the Internet at their destination (and what IP address ranges that would involve). Plus, geolocation of Internet IP addresses is not particularly exact or accurate, especially if you need to do it for free.

One corollary of this is that at a university, you often can't safely do broad 'geographic' blocks of logins (or VPN connections, or whatever) from IP address ranges, because there's no guarantee that one of your people isn't going to pop up there. The more populous the geographic area, the more likely that some of your people are going to be there sooner or later.

(An additional complication is people who move elsewhere (or are elsewhere) but maintain a relationship with your part of the university, and as part of that may to visit in person every so often. These people travel too, and are even less likely to tell you their travel plans, since now you're a third party to them.)

Network switches aren't simple devices (not even basic switches)

By: cks
13 July 2024 at 03:13

Recently over on the Fediverse I said something about switches:

"Network switches are simple devices" oh I am so sorry. Hubs were simple devices. Switches are alarmingly smart devices even if they don't handle VLANs or support STP (and almost everyone wants them to support Spanning Tree Protocol, to stop loops). Your switch has onboard packet buffering, understands Ethernet addresses, often generates its own traffic and responds to network traffic (see STP), and is actually a (layer 2) high speed router with a fallback to being a hub.

(And I didn't even remember about multicast, plus I omitted various things. The trigger for my post was seeing a quote from Making a Linux-managed network switch, which is speaking (I believe) somewhat tongue in cheek and anyway is a fun and interesting article.)

Back in the old days, a network hub could simply repeat incoming packets out each port, with some hand waving about having to be aware of packet boundaries (see the Wikipedia page for more details). This is not the case with switches. Even a very basic switch must extract source and destination Ethernet addresses out of packets, maintain a mapping table between ports and Ethernet addresses, and route incoming packets to the appropriate port (or send them to all ports if they're to an unknown Ethernet address). This generally needs to be done at line speed and handle simultaneous packets on multiple ports at once.

Switches must have some degree of internal packet buffering, although how much buffering switches have can vary (and can matter). Switches need buffering to deal with both a high speed port sending to a low speed one and several ports all sending traffic to the same destination port at the same time. Buffering implies that packet reception and packet transmission can be decoupled from each other, although ideally there is no buffering delay if the receive to transmit path for a packet is clear (people like low latency in switches).

A basic switch will generally be expected to both send and receive special packets itself, not just pass through network traffic. Lots of people want switches to implement STP (Spanning Tree Protocol) to avoid network loops (which requires the switch to send, receive, and process packets itself), and probably Ethernet flow control as well. If the switch is going to send out its own packets in addition to incoming traffic, it needs the intelligence to schedule this packet transmission somehow and deal with how it interacts with regular traffic.

If the switch supports VLANs, several things get more complicated (although VLAN support generally requires a 'managed switch', since you have to be able to configure the VLAN setup). In common configurations the switch will need to modify packets passing through to add or remove VLAN tags (as packets move between tagged and untagged ports). People will also want the switch to filter incoming packets, for example to drop a VLAN-tagged packet if the VLAN in question is not configured on that port. And they will expect all of this to still run at line speed with low latency. In addition, the switch will generally want to segment its Ethernet mapping table by VLAN, because bad things can happen if it's not.

(Port isolation, also known as "private VLANs", adds more complexity but now you're well up in managed switch territory.)

PS: Modern small network switches are 'simple' in the sense that all of this is typically implemented in a single chip or close to it; the Making a Linux-managed network switch article discusses a couple of them. But what is happening inside that IC is a marvel.

Using WireGuard as a router to get around reachability issues

By: cks
9 July 2024 at 03:33

Suppose that you have a machine, or a set of machines, that can't be readily reached from the outside world with random traffic (for example, your home LAN setup), and you also have a roaming machine that you want to use to reach those machines (for example, your phone). If you only had one of these problems, you could set up a straightforward WireGuard tunnel, where your roaming phone talked to the WireGuard machines on your home LAN. But on the surface, having both of them sounds like you need some degree of complex inbound NAT gateway on a fixed and reachable address in the cloud (your phone talks to the gateway with WireGuard, the gateway NATs the traffic and passes it over WireGuard to the home VLAN, etc). However, with some tricks you don't need this; instead, you can use WireGuard on the fixed cloud machine as a router instead of a gateway.

(As someone who deals with non-WireGuard networking regularly, my reflex is that if two machines can't talk to each other with plain IP, we're going to need some kind of NAT or port forwarding somewhere. This leads to a situation where if two potential WireGuard peers can't talk to each other, my thoughts immediately jump to 'clearly we're going to need a NAT'.)

The basic idea is that you set up the fixed public machine as a router, although only for WireGuard connections, and then you arrange to route appropriate IP addresses and IP address ranges over the various WireGuard connections. The simplest approach is to give each WireGuard client an 'inside' IP address on the WireGuard interface on some subnet, and then have each client route the entire (rest of the) subnet to the WireGuard router machine. The router machine's routing table then sends the appropriate IP address (or address range) down the appropriate WireGuard connection. More complex setups are possible if you have existing IP address ranges that need to be reached over these WireGuard-based links, but the more distinct IPs or IP ranges you want to reach over WireGuard, the more routing entries each WireGuard client needs (the router's routing table also gets more complicated, but it was already a central point of complexity).

(This isn't a new pattern; it used to appear in, for example, PPP servers. But those have been generally out of fashion for a while and not something people deal with. VPN servers also behave this way but often their VPN software handles this all for you without explicit routing table entries or you having to think about it. They may also automatically NAT traffic for you.)

Routing an existing home LAN IP address range or the like to the WireGuard machines is potentially a bit more complex. Unless you can use your existing home gateway as a WireGuard peer, you'll need to either NAT the WireGuard 'inside' IP addresses when they talk to your home LAN or establish a special route on your home LAN that sends traffic for those IPs to your WireGuard gateway. If you can set up WireGuard on your home gateway (by which I mean whatever machine is the default route for things on your LAN), life is simpler because the return traffic is already flowing through the machine; you just need to send it off to the WireGuard router instead of to the Internet. Another option is to assign unused home LAN IP addresses to your remote WireGuard machines, and then have your home LAN WireGuard gateway do 'proxy ARP' or IPv6 NDP for those IPs.

(In theory this is one of the situations where IPv6 may make your life easier, because if necessary you can create your own Unique local address space, carve it up between your home LAN and other areas, and route it around.)

Unix's fsync(), write ahead logs, and durability versus integrity

By: cks
3 July 2024 at 02:41

I recently read Phil Eaton's A write-ahead log is not a universal part of durability (via), which is about what it says it's about. In the process it discusses using Unix's fsync() to achieve durability, which woke up a little twitch I have about this general area, which is the difference between durability and integrity (which I'm sure Phil Eaton is fully aware of; their article was only about the durability side).

The core integrity issue of simple uses of fsync() is that while fsync() forces the filesystem to make things durable on disk, the filesystem doesn't promise to not write anything to disk until you do that fsync(). Once you write() something to the filesystem, it may write it to disk without warning at any time, and even during an fsync() the filesystem makes no promises about what order data will be written in. If you start an fsync() and the system crashes part way through, some of your data will be on disk and some won't be and you have no control over which part is which.

This means that if you overwrite data in place and use fsync(), the only time you are guaranteed that your data has both durability and integrity is in the time after fsync() completes and before you write any more data. Once you start (over)writing data again, that data could be partially written to disk even before you call fsync(), and your integrity could be gone. To retain integrity, you can't overwrite more than a tiny bit of data in place. Instead, you need to write data to a new place, fsync() it, and then overwrite one tiny piece of existing data to activate your new data (and fsync() that write too).

(Filesystems can use similar two-stage approaches to make and then activate changes, such as ZFS's slight variation on this. ZFS does not quite overwrite anything in place, but it does require multiple disk flushes, possibly more than two.)

The simplest version of this condenses things down to one fsync() (or its equivalent) at the cost of having an append-only data structure, which we usually call a log. Logs need their own internal integrity protection, so that they can tell whether or not a segment of the log had all of its data flushed to disk and so is fully valid. Once your single fsync() of a log append finishes, all of the data is on disk and that segment is valid; before the fsync finishes, it's not necessarily so. Only some of the data might have been written, and it might have been written out of order (so that the last block made it to disk but an earlier block did not).

Write ahead logs normally increases the amount of data written to disk; you write data once to the WAL and once to the main database. However, a WAL may well reduce the number of fsync()s (and thus disk flushes) that you have to do in order to have both durability and integrity. In modern solid state storage systems, synchronous disk flushes can be the slowest operation and (asynchronous) write bandwidth relatively plentiful, so trading off more data written for fewer disk flushes can be a net performance win in practice for plenty of workloads.

(Again, I'm sure Phil Eaton knows all of this; their article was specifically about the durability side of things. I'm using it as a springboard for additional thoughts. I'm not sure I'd realized how a WAL can reduce the number of fsync()s required before now.)

Modifying and setting alarm times: a phone UI irritation

By: cks
2 July 2024 at 03:08

Over on the Fediverse, I mentioned a little irritation with my iPhone:

Pretty much every time I change the time of an alarm on my phone I am irritated all over again at the fundamental laziness and robotic computer-ness of time controls. What I want to do is move the time forward or backward, not to separately change (or set) the hours and the minutes. But separate 'hour' and 'minutes' spinners or options are the easy computer way out so that's how UIs implement it.

My phone's standard alarm app has what I believe is the common phone interface for setting and modifying alarm times, where you set the hour and the minute separately. There are two problems with this.

The first problem is what I mentioned in my post. In the case when I'm modifying the time of an existing alarm, what I want to do is move it forward or backward by some amount. Where this time change moves the time of the alarm over an hour boundary, I must separately adjust the minutes and the hours, and do the relevant math in my head. I can't just say 'make it half an hour earlier', I have to move the hour backward and then the minutes forward, in two separate actions.

The second problem is that this interface is also not all that great if I have an exact time for an alarm. If I want to set an alarm for exactly, say, 10:50, this interface forces me to first set '10' hour and then '50' minute, instead of just letting me type in, say, '1050' (the sensible interface is to infer the separation between hours and minutes, so you can use a basic number entry interface). The iPhone's standard alarm application actually supports direct entry of alarm times, but it's not exposed as an obvious feature; you have to know that you can tap on the time spinners to get a number pad for direct time entry.

How this situation probably came about feels relatively straightforward. Spinner fields for selecting between alternatives are a broadly used UI element and are available in standard forms in the system's UI libraries. A UI to adjust times forward and backward would have to be specifically designed for this purpose and would have limited use outside a few contexts. You don't even have to assume laziness on the part of the phone UI designers; if you want to do a good job of UI control design, it needs things like user testing to make sure people can understand it, and you can only do so much of that. It's not difficult to imagine that user testing for something with narrow usage would get pushed way down the priority list in favour of things with more usage and higher importance.

(There is also the issue of UI standardization. Spinner controls may not be ideal for this purpose, but because they're commonly used, people will likely be able to immediately recognize and use them. A custom UI does not have this advantage, and you can argue that setting alarms is not important enough to make people remember a UI just for it. After all, how often do you change the time of alarms? I'm likely an outlier here.)

Security is not really part of most people's jobs

By: cks
26 June 2024 at 02:23

A while back I said something on the Fediverse:

In re people bypassing infosec policies at work, I feel that infosec should understand that "getting your job done" is everyone's first priority, because in this capitalistic society, not getting your job done gets you fired. You might get fired if you bypass IT security, but you definitely will if you can't do your work. Trying to persuade everyone that it's IT's fault, not yours, is a very uphill battle and not one anyone wants to bet on.

(This is sparked by <Fediverse post>)

Let's look at this from the perspective of positive motivations. By and large, people don't get hired, promoted, praised, given bonuses, and so on for doing things securely, developing secure code, and so on. People get hired for being able to program or otherwise do their job, and they get rewarded for things like delivering new features. Sure, you require people to do things securely, but you (probably) also require them to wear clothes, and people are rewarded about equally for them (which is to say they get to keep being employed and paid money). People may or may not fear losing their job if they don't perform well enough because security is getting in their way, but they definitely do get rewarded for performing the non-security aspects of their job well, especially in programming and other computing jobs.

(Perhaps their current employer doesn't really reward them, but they're probably improving their odds of being rewarded by their next employer.)

It's a trite observation that what you reward is what you get. When you hire and promote people for their ability to program and deliver features, that is what they will prioritize. People are generally not indifferent to security issues (especially today), but what you don't reward has turned it into an overhead, one that potentially gets in the way of getting a promotion, a raise, or a bonus. Will a team kill a feature because they can't make it secure enough, when the feature is on their road map and thus their job ratings for this quarter? You already know the answer to that.

Also, people are going to focus on developing their skills at what you reward (and what the industry rewards in general). When you interview and promote and so on based on people being able to write code and solve problems and ship features, that's what they get good at. When you provide no particular rewards for doing things (more) securely, people have no motivation to work on it, and also they generally have little or no feedback on whether they're doing it right and are improving their skills, instead of flailing around and wasting their time.

(My feeling is that industry practices also make it hard to get useful feedback on is the long term consequences of design and programming decisions, in large part because most people don't stay around for the long term, although to be fair a bunch of programs and systems don't either.)

(Many years ago I wrote that people don't care about security and consider it an overhead. I'm not sure that this is still true, but it's probably still somewhat so, along with how security is not the most important thing to most people.)

Gender Discrimination Lawsuit Filed Against Apple

By: Nick Heer
19 June 2024 at 01:50

Patrick McGee, Financial Times, August 2022:

In interviews with 15 female Apple employees, both current and former, the Financial Times has found that Mohr’s frustrating experience with the People group has echoes across at least seven Apple departments spanning six US states.

The women shared allegations of Apple’s apathy in the face of misconduct claims. Eight of them say they were retaliated against, while seven found HR to be disappointing or counterproductive.

Ashley Belanger, Ars Technica, last week:

Apple has spent years β€œintentionally, knowingly, and deliberately paying women less than men for substantially similar work,” a proposed class action lawsuit filed in California on Thursday alleged.

[…]

The current class action has alleged that Apple continues to ignore complaints that the company culture fosters an unfair and hostile workplace for women. It’s hard to estimate how much Apple might owe in back pay and other damages should women suing win, but it could easily add up if all 12,000 class members were paid thousands less than male counterparts over the complaint’s approximately four-year span. Apple could also be on the hook for hundreds in civil penalties per class member per pay period between 2020 and 2024.

I pulled the 2022 Financial Times investigation into this because one of the plaintiffs in the lawsuit filed last week also alleges sexual harassment by a colleague which was not adequately addressed.

Stephen Council, SFGate:

The lawyer said that asking women about pay expectations β€œlocks” past pay discrimination in and that the requirements of a job should determine pay. Finberg isn’t new to the fight over tech pay; he represented employees suing Oracle and Google for gender-based pay discrimination, securing $25 million and $118 million settlements, respectively.

Last year, Apple paid $25 million to settle claims it discriminated in U.S. hiring in favour of people whose ability to remain in the U.S. depended on their employment status.

βŒ₯ Permalink

Account recovery is still a hard problem in public key management

By: cks
8 June 2024 at 02:30

Soatok recently published their work on a part of end to end encryption for the Fediverse, Towards Federated Key Transparency. To summarize the article, it is about the need for a Fediverse public key directory and a proposal for how to build one (this is a necessary ingredient for trustworthy end to end encryption). Soatok is a cryptographer and security expert and I'm not, so I have nothing to say about the specifics of the proposed protocol and so on. But as a system administrator, one thing did catch my eye right away, and that is that Soatok's system has no method of what I will call "account recovery".

How this manifests in the protocol is that registering in the key directory is a one-way action for a given Fediverse identity. Once you (as a specific Fediverse identity) register your first key in the key directory, you cannot reset from this state and start over again. If you somehow lose all of your registered private keys, there is no natural or easy way out to register a new one under your current Fediverse identity and your only option is to start a new Fediverse identity, which can register from scratch.

(While the proposal allows you to revoke keys if you have more than one active one, it specifically doesn't allow you to revoke your last key. This has the additional effect that you can't advertise that all of your previous keys are no longer trusted and you can't be reached over whatever they enable at all. The closest you can come is to leave a single public key registered that you've destroyed the private key for, rendering it useless in practice; however, this still leaves people able to retrieve your 'current key' and then use it in things that will never work.)

Of course, there are good security reasons to not allow this sort of re-registration and account recovery, which is undoubtedly why Soatok's proposal doesn't attempt to include them. Telling the difference between account recovery by a good person and account recovery by an attacker is ultimately a very hard problem, so if you absolutely have to prevent the latter, you can't allow account recovery at all. Even partially and reasonably solving account recovery generally requires human involvement, and that is hard and doesn't scale well (and it's hard to write into protocol specifications).

However, I think it's meaningful to note the tradeoffs being made. One of the lenses to look at security related things is through the triad of confidentiality, availability, and integrity. As with any system that doesn't have account recovery, Soatok's proposal is prioritizing confidentiality over availability. Sometimes this is the right tradeoff, and sometimes it isn't.

To me, all of this demonstrates that account recovery remains a hard and unsolved problem in this area (and in a variety of others). I pessimistically suspect that there will never be good solutions to it, but at the same time I hope that clever people will prove me wrong. Good, secure account recovery would enable a lot of good things.

The BHU Covaxin study and ICMR bait

By: VM
28 May 2024 at 04:51

Earlier this month, a study by a team at Banaras Hindu University (BHU) in Varanasi concluded that fully 1% of Covaxin recipients may suffer severe adverse events. One percent is a large number because the multiplier (x in 1/100 * x) is very large β€” several million people. The study first hit the headlines for claiming it had the support of the Indian Council of Medical Research (ICMR) and reporting that both Bharat Biotech and the ICMR are yet to publish long-term safety data for Covaxin. The latter is probably moot now, with the COVID-19 pandemic well behind us, but it’s the principle that matters. Let it go this time and who knows what else we’ll be prepared to let go.

But more importantly, as The Hindu reported on May 25, the BHU study is too flawed to claim Covaxin is harmful, or claim anything for that matter. Here’s why (excerpt):

Though the researchers acknowledge all the limitations of the study, which is published in the journal Drug Safety, many of the limitations are so critical that they defeat the very purpose of the study. β€œIdeally, this paper should have been rejected at the peer-review stage. Simply mentioning the limitations, some of them critical to arrive at any useful conclusion, defeats the whole purpose of undertaking the study,” Dr. Vipin M. Vashishtha, director and pediatrician, Mangla Hospital and Research Center, Bijnor, says in an email to The Hindu. Dr. Gautam Menon, Dean (Research) & Professor, Departments of Physics and Biology, Ashoka University shares the same view. Given the limitations of the study one can β€œcertainly say that the study can’t be used to draw the conclusions it does,” Dr. Menon says in an email.

Just because you’ve admitted your study has limitations doesn’t absolve you of the responsibility to interpret your research data with integrity. In fact, the journal needs to speak up here: why did Drug Safety publish the study manuscript? Too often when news of a controversial or bad study is published, the journal that published it stays out of the limelight. While the proximal cause is likely that journalists don’t think to ask journal editors and/or publishers tough questions about their publishing process, there is also a cultural problem here: when shit hits the fan, only the study’s authors are pulled up, but when things are rosy, the journals are out to take credit for the quality of the papers they publish. In either case, we must ask what they actually bring to the table other than capitalising on other scientists’ tendency to judge papers based on the journals they’re published in instead of their contents.

Of course, it’s also possible to argue that unlike, say, journalistic material, research papers aren’t required to be in the public interest at the time of publication. Yet the BHU paper threatens to undermine public confidence in observational studies, and that can’t be in anyone’s interest. Even at the outset, experts and many health journalists knew observational studies don’t carry the same weight as randomised controlled trials as well as that such studies still serve a legitimate purpose, just not the one to which its conclusions were pressed in the BHU study.

After the paper’s contents hit the headlines, the ICMR shot off a latter to the BHU research team saying it hasn’t β€œprovided any financial or technical support” to the study and that the study is β€œpoorly designed”. Curiously, the BHU team’s repartee to the ICMR’s makes repeated reference to Vivek Agnihotri’s film The Vaccine War. In the same point in which two of these references appear (no. 2), the team writes: β€œWhile a study with a control group would certainly be of higher quality, this immediately points to the fact that it is researchers from ICMR who have access to the data with the control group, i.e. the original phase-3 trials of Covaxin – as well publicized in β€˜The Vaccine War’ movie. ICMR thus owes it to the people of India, that it publishes the long-term follow-up of phase-3 trials.”

I’m not clear why the team saw fit to appeal to statements made in this of all films. As I’ve written earlier, The Vaccine War β€” which I haven’t watched but which directly references journalistic work by The Wire during and of the pandemic β€” is most likely a mix of truths and fictionalisation (and not in the clever, good-faith ways in which screenwriters adopt textual biographies for the big screen), with the fiction designed to serve the BJP’s nationalist political narratives. So when the letter says in its point no. 5 that the ICMR should apologise to a female member of the BHU team for allegedly β€œspreading a falsehood” about her and offers The Vaccine War as a counterexample (β€œWhile β€˜The Vaccine War’ movie is celebrating women scientists…”), I can’t but retch.

Together with another odd line in the latter β€” that the β€œICMR owes it to the people of India” β€” the appeals read less like a debate between scientists on the merits and the demerits of the study and more like they’re trying to bait the ICMR into doing better. I’m not denying the ICMR started it, as a child might say, but saying that this shouldn’t have prevented the BHU team from keeping it dignified. For example, the BHU letter reads: β€œIt is to be noted that interim results of the phase-3 trial, also cited by Dr. Priya Abraham in β€˜The Vaccine War’ movie, had a mere 56 days of safety follow-up, much shorter than the one-year follow-up in the IMS-BHU study.” Surely the 56-day period finds mention in a more respectable and reliable medium than a film that confuses you about what’s real and what’s not?

In all, the BHU study seems to have been designed to draw attention to gaps in the safety data for Covaxin β€” but by adopting such a provocative route, all that took centerstage was its spat with the ICMR plus its own flaws.

CVEs are not what I'll call security reports

By: cks
4 June 2024 at 01:38

Today I read Josh Bressers' Why are vulnerabilities out of control in 2024? (via), which made me realize that I, along with other people, had been unintentionally propagating a misinterpretation of what a CVE was (for example when I talked about the Linux kernel giving CVEs to all bugfixes). To put it simply, a CVE is not what I'll call a (fully baked) security report. It's more or less in the name, as 'CVE' is short of 'Common Vulnerabilities and Exposures'. A CVE is a common identifier for a vulnerability that is believed to have a security impact, and that's it.

A CVE as such is thus an identifier and a description of the vulnerability. It does not intrinsically tell you what software and versions of the software the vulnerability is present in, or how severe or exploitable the vulnerability is in any specific environment, or the like, which is to say that it doesn't come with an analysis of its security impact. All of that is out of scope for a basic CVE. We think of all of these things as being part of a 'CVE' because people have traditionally 'enriched' the basic CVE information with these additional details; sometimes this has been done by the people reporting the vulnerability and sometimes it has been done by third parties.

(One reason that early CVEs were enriched by the reporters themselves was that in the beginning, people often didn't believe that certain bugs were security vulnerabilities unless you held their hand with demonstration exploits and so on. As general exploit technology has evolved, entire classes of bugs, such as use after free, are now considered likely exploitable and so are presumed to be security vulnerabilities even without demonstration exploits.)

As Why are vulnerabilities out of control in 2024? notes, the amount of work required to do this enrichment is steadily increasing because the number of CVEs is steadily increasing (even outside the Linux kernel situation). This work won't happen for free, and I mean that in a broad sense, since collectively there is only so much free time people have for (unpaid) vulnerability discovery and reporting. Our options are to (fully) fund vulnerability enrichment (which seems increasingly unlikely), live with basic CVE reporting, or to get fewer vulnerabilities reported by insisting that only enriched vulnerabilities can be reported in the first place.

(The current state of CVE reporting and assignment is biased toward getting vulnerabilities reported, which in my view is the correct choice.)

It's certainly convenient for system administrators and other people when we get fully baked, fully enriched security (vulnerability) reports instead of bare CVEs or bare vulnerabilities. But not only does no one owe that to us, we also can't have our cake and eat it too. If we insist on only receiving and acting on fully enriched security reports, we will leave some number of vulnerabilities active in our systems (which may or may not be known ones, depending on whether people bother to report them and make them into CVEs).

(This elaborates a bit on some Fediverse posts of mine.)

Stand-alone downloads of program assets has a security implication

By: cks
3 June 2024 at 03:42

I recently read Engineering for Slow Internet (via), which is about what it talks about and also about the practical experience of trying to use the Internet in Antarctica (in 2023), which has (or had) challenging network conditions. One of the recommendations in the article was that as much as possible you allow people to do stand-alone downloads with their own tools for it, rather than forcing them to download assets through your program (which, to put it kindly, may not be well prepared for the Internet conditions of Antarctica). In general, I am all for having programs cope better with limited Internet (I used to be on a PPP dialup modem link long after most people in Canada had upgraded to DSL or cable, and it was a bit unpleasant), but as I was reading the article it occurred to me that supporting people getting assets your program will use through their own downloads can change the security picture of your application a bit, possibly requiring additional changes in how you do things.

When a modern application fetches assets of some sort over HTTPS from a URL that you fully specified (for example, a spot on your website), most of the time you can assume that the contents you fetched are trustworthy. The entire institution of modern web PKI is working (quite well) to keep bad people from easily intercepting and altering that flow of data. Only in relatively high security situations do you need to add some sort of additional end to end security verification, like digital signatures; a lot of the time you can just assume 'we got it over HTTPS from our URL so it's good'.

(Even with fetching assets over HTTPS, signing your assets provides safety against various attacks, including attackers who compromise your website but not your signing infrastructure.)

This is obviously not true any more if you accept files that were downloaded outside of your program's control. Then you're relying on the person using your software to have not been fooled about where they got the files from and to not have had the files quietly swapped out or provided by malicious other software on their machine. Since you didn't fetch these assets yourself, if you need trust in them it will have to be provided in some additional way. If you aren't already digitally signing things, you may need to start doing so (with all of the key management hassles this involves, and potential key expiry, and so on), or perhaps fetch a small list of cryptographic hashes of the assets from your website while allowing the person to provide you the asset files themselves.

(On common systems, some things you want to download may already be signed due to general system requirements, for example program updates.)

This is not just about the security of your program. This is also somewhat about the security of people using your program, in terms of what they can be tricked into doing by a malicious asset that they accidentally download from the wrong place. Attackers definitely already use various forms of fake program updates, compromised installers, and so on, with various additional tricks to direct people to those things.

Phish tests and (not) getting people to report successful phish attacks

By: cks
2 June 2024 at 02:52

One of the very important things for dealing with phish attacks is for people to rapidly self-report successful phish attacks, ones that obtained their password or other access token. If you don't know that an access token has been compromised, even briefly, you can't take steps to investigate any access it may have been used for, mitigate it, and so on. And the sooner you know about it, the better.

So called "phish tests" in their current form are basically excuses to explicitly or implicitly blame people who 'fall for' the phish test. Explicit blame is generally obvious, but you might wonder about the implicit blame. If the phish test reports how many people in each unit 'fell for' the phish test message, or requires those people to take additional training, or things like that, it is implicitly blaming those people; they or their managers will be exhorted to 'do better' and maybe required to do extra work.

When you conduct phish tests and blame people who 'fall for' those tests, you're teaching people that falling for phish attacks will cause them to be blamed. You are training them that this is a failure on their part and there will be consequences for their failure. When people know that they will be blamed for something, some number of them will try to cover it up, or will delay reporting it, or decide that they didn't really fall for it and they changed their password right away or didn't approve the MFA request or whatever, or the like. This is an entirely natural and predictable human reaction to the implicit training that your phish tests have delivered. And, as covered, this reaction is very bad for your organization's ability to handle a real, successful phish attack (which is going to happen sometime).

Much like you want "blameless incident postmortems", my view is that you want "blameless reporting of successful phishes". I'm not sure how you get it, but I'm pretty sure that the current approach to "phish tests" isn't it (beyond the other issues that Google discussed in On Fire Drills and Phishing Tests). Instead, I think phish tests most likely create a counterproductive mindset in people subjected to them, one where the security team is the opposition, out to trick people and then punish those who were tricked.

(This is the counterproductive effect I mentioned in my entry on how phish tests aren't like fire drills.)

The BHU Covaxin study and ICMR bait

By: V.M.
28 May 2024 at 03:51

Earlier this month, a study by a team at Banaras Hindu University (BHU) in Varanasi concluded that fully 1% of Covaxin recipients may suffer severe adverse events. One percent is a large number because the multiplier (x in 1/100 * x) is very large β€” several million people. The study first hit the headlines for claiming it had the support of the Indian Council of Medical Research (ICMR) and reporting that both Bharat Biotech and the ICMR are yet to publish long-term safety data for Covaxin. The latter is probably moot now, with the COVID-19 pandemic well behind us, but it’s the principle that matters. Let it go this time and who knows what else we’ll be prepared to let go.

But more importantly, as The Hindu reported on May 25, the BHU study is too flawed to claim Covaxin is harmful, or claim anything for that matter. Here’s why (excerpt):

Though the researchers acknowledge all the limitations of the study, which is published in the journal Drug Safety, many of the limitations are so critical that they defeat the very purpose of the study. β€œIdeally, this paper should have been rejected at the peer-review stage. Simply mentioning the limitations, some of them critical to arrive at any useful conclusion, defeats the whole purpose of undertaking the study,” Dr. Vipin M. Vashishtha, director and pediatrician, Mangla Hospital and Research Center, Bijnor, says in an email to The Hindu. Dr. Gautam Menon, Dean (Research) & Professor, Departments of Physics and Biology, Ashoka University shares the same view. Given the limitations of the study one can β€œcertainly say that the study can’t be used to draw the conclusions it does,” Dr. Menon says in an email.

Just because you’ve admitted your study has limitations doesn’t absolve you of the responsibility to interpret your research data with integrity. In fact, the journal needs to speak up here: why did Drug Safety publish the study manuscript? Too often when news of a controversial or bad study is published, the journal that published it stays out of the limelight. While the proximal cause is likely that journalists don’t think to ask journal editors and/or publishers tough questions about their publishing process, there is also a cultural problem here: when shit hits the fan, only the study’s authors are pulled up, but when things are rosy, the journals are out to take credit for the quality of the papers they publish. In either case, we must ask what they actually bring to the table other than capitalising on other scientists’ tendency to judge papers based on the journals they’re published in instead of their contents.

Of course, it’s also possible to argue that unlike, say, journalistic material, research papers aren’t required to be in the public interest at the time of publication. Yet the BHU paper threatens to undermine public confidence in observational studies, and that can’t be in anyone’s interest. Even at the outset, experts and many health journalists knew observational studies don’t carry the same weight as randomised controlled trials as well as that such studies still serve a legitimate purpose, just not the one to which its conclusions were pressed in the BHU study.

After the paper’s contents hit the headlines, the ICMR shot off a latter to the BHU research team saying it hasn’t β€œprovided any financial or technical support” to the study and that the study is β€œpoorly designed”. Curiously, the BHU team’s repartee to the ICMR’s makes repeated reference to Vivek Agnihotri’s film The Vaccine War. In the same point in which two of these references appear (no. 2), the team writes: β€œWhile a study with a control group would certainly be of higher quality, this immediately points to the fact that it is researchers from ICMR who have access to the data with the control group, i.e. the original phase-3 trials of Covaxin – as well publicized in β€˜The Vaccine War’ movie. ICMR thus owes it to the people of India, that it publishes the long-term follow-up of phase-3 trials.”

I’m not clear why the team saw fit to appeal to statements made in this of all films. As I’ve written earlier, The Vaccine War β€” which I haven’t watched but which directly references journalistic work by The Wire during and of the pandemic β€” is most likely a mix of truths and fictionalisation (and not in the clever, good-faith ways in which screenwriters adopt textual biographies for the big screen), with the fiction designed to serve the BJP’s nationalist political narratives. So when the letter says in its point no. 5 that the ICMR should apologise to a female member of the BHU team for allegedly β€œspreading a falsehood” about her and offers The Vaccine War as a counterexample (β€œWhile β€˜The Vaccine War’ movie is celebrating women scientists…”), I can’t but retch.

Together with another odd line in the latter β€” that the β€œICMR owes it to the people of India” β€” the appeals read less like a debate between scientists on the merits and the demerits of the study and more like they’re trying to bait the ICMR into doing better. I’m not denying the ICMR started it, as a child might say, but saying that this shouldn’t have prevented the BHU team from keeping it dignified. For example, the BHU letter reads: β€œIt is to be noted that interim results of the phase-3 trial, also cited by Dr. Priya Abraham in β€˜The Vaccine War’ movie, had a mere 56 days of safety follow-up, much shorter than the one-year follow-up in the IMS-BHU study.” Surely the 56-day period finds mention in a more respectable and reliable medium than a film that confuses you about what’s real and what’s not?

In all, the BHU study seems to have been designed to draw attention to gaps in the safety data for Covaxin β€” but by adopting such a provocative route, all that took centerstage was its spat with the ICMR plus its own flaws.

Feel the pain

By: V.M.
9 April 2024 at 11:43

Emotional decision making is in many contexts undesirable – but sometimes it definitely needs to be part of the picture, insofar as our emotions hold a mirror to our morals. When machines make decisions, the opportunity to consider the emotional input goes away. This is a recurring concern I’m hearing about from people working with or responding to AI in some way. Here are two recent examples I came across that set this concern out in two different contexts: loneliness and war.

This is Anna Mae Duane, director of the University of Connecticut Humanities Institute, in The Conversation:

There is little danger that AI companions will courageously tell us truths that we would rather not hear. That is precisely the problem. My concern is not that people will harm sentient robots. I fear how humans will be damaged by the moral vacuum created when their primary social contacts are designed solely to serve the emotional needs of the β€œuser”.

And this is from Yuval Abraham’s investigation for +972 Magazine on Israel’s chilling use of AI to populate its β€œkill lists”:

β€œIt has proven itself,” said B., the senior source. β€œThere’s something about the statistical approach that sets you to a certain norm and standard. There has been an illogical amount of [bombings] in this operation. This is unparalleled, in my memory. And I have much more trust in a statistical mechanism than a soldier who lost a friend two days ago. Everyone there, including me, lost people on October 7. The machine did it coldly. And that made it easier.”

Phish tests aren't like fire drills

By: cks
31 May 2024 at 03:01

Google recently wrote a (blog) article, On Fire Drills and Phishing Tests, which discusses the early history of what we now call fire drills. As the article covers, the early "fire evacuation tests" focused mostly on how individual people performed, complete with telling people that things were their own fault for not doing the evacuation well enough. It then analogizes this to the current way "phish tests" are done. As I read this, I had a reaction on the Fediverse to the general thought of fire drills and phish tests:

In re comparing fire drills to phishing tests[1], if phishing tests were like fire drills, they would test the response to a successful phish. Was the person phished able to rapidly report and mitigate things? Do the organization's phish alarms work and reach people? Etc etc.

Current "phishing tests" are like testing people to see if they accidentally start fires if they're handed (dangerously) flammable materials. That's not a fire drill.

1: <fediverse link>

The purpose of fire drills is to test what happens once the fire alarm goes off and to make sure that it works. Do all of the fire alarms actually generate enough noise that people can hear? Are there visual indicators for people with bad or no hearing? Can people see (or hear) where they should go to get out of the building? And so on and so forth. In other words, fire drills test the response to the problem, not whether the problem happens in the first place.

(They also somewhat implicitly test if people respond to fire alarms, because if people don't you have another problem.)

As I mentioned in my Fediverse post, current "phish tests" aren't doing anything like this. Current "phish tests" are testing people to see if they recognize and (don't) respond to phish messages (and then blaming people if they don't handle the phish right, which is one of the things that the Google article is calling out). A "phish drill" that was like a "fire drill" would test all of the mitigation and response processes that you wanted to happen after someone fell for a phish, whatever these were. Of course, one awkward aspect of testing these processes is that you actually have to have them and they need to be made effective. But this is exactly why you should test them, just as part of the reason for fire drills is to make sure you have enough alarms, evacuation routes, and so on (and that they all work).

(I personally think that current blame the person "phish tests" are counterproductive in an additional way not covered by the Google article, but that's another entry.)

The BHU Covaxin study and ICMR bait

By: VM
28 May 2024 at 04:18
The BHU Covaxin study and ICMR bait

Earlier this month, a study by a team at Banaras Hindu University (BHU) in Varanasi concluded that fully 1% of Covaxin recipients may suffer severe adverse events. One percent is a large number because the multiplier (x in 1/100 * x) is very large β€” several million people. The study first hit the headlines for claiming it had the support of the Indian Council of Medical Research (ICMR) and reporting that both Bharat Biotech and the ICMR are yet to publish long-term safety data for Covaxin. The latter is probably moot now, with the COVID-19 pandemic well behind us, but it’s the principle that matters. Let it go this time and who knows what else we’ll be prepared to let go.

But more importantly, asΒ The HinduΒ reportedΒ on May 25, the BHU study is too flawed to claim Covaxin is harmful, or claim anything for that matter. Here’s why (excerpt):

Though the researchers acknowledge all the limitations of the study, which is published in the journalΒ Drug Safety, many of the limitations are so critical that they defeat the very purpose of the study. β€œIdeally, this paper should have been rejected at the peer-review stage. Simply mentioning the limitations, some of them critical to arrive at any useful conclusion, defeats the whole purpose of undertaking the study,” Dr. Vipin M. Vashishtha, director and pediatrician, Mangla Hospital and Research Center, Bijnor, says in an email toΒ The Hindu. Dr. Gautam Menon, Dean (Research) & Professor, Departments of Physics and Biology, Ashoka University shares the same view. Given the limitations of the study one can β€œcertainly say that the study can’t be used to draw the conclusions it does,” Dr. Menon says in an email.

Just because you’ve admitted your study has limitations doesn’t absolve you of the responsibility to interpret your research data with integrity. In fact, the journal needs to speak up here: why didΒ Drug SafetyΒ publish the study manuscript? Too often when news of a controversial or bad study is published, the journal that published it stays out of the limelight. While the proximal cause is likely that journalists don’t think to ask journal editors and/or publishers tough questions about their publishing process, there is also a cultural problem here: when shit hits the fan, only the study’s authors are pulled up, but when things are rosy, the journals are out to take credit for the quality of the papers they publish. In either case, we must ask what they actually bring to the table other than capitalising on other scientists’ tendency to judge papers based on the journals they’re published in instead of their contents.

Of course, it's also possible to argue that unlike, say, journalistic material, research papers aren't required to be in the public interest at the time of publication. Yet the BHU paper threatens to undermine public confidence in observational studies, and that can't be in anyone's interest. Even at the outset, experts and many health journalists knew observational studies don’t carry the same weight as randomised controlled trials as well as that such studies still serve a legitimate purpose, just not the one to which its conclusions were pressed in the BHU study.

After the paper’s contents hit the headlines, the ICMRΒ shot off a latterΒ to the BHU research team saying it hasn’t "provided any financial or technical support" to the study and that the study is β€œpoorly designed". Curiously, theΒ BHU team’s reparteeΒ to the ICMR's makes repeated reference to Vivek Agnihotri's filmΒ The Vaccine War. In the same point in which two of these references appear (no. 2), the team writes: "While a study with a control group would certainly be of higher quality, this immediately points to the fact that it is researchers from ICMR who have access to the data with the control group, i.e. the original phase-3 trials of Covaxin – as well publicized in 'The Vaccine War' movie. ICMR thus owes it to the people of India, that it publishes the long-term follow-up of phase-3 trials."

I'm not clear why the team saw fit to appeal to statements made in this of all films. As I'veΒ written earlier,Β The Vaccine WarΒ β€” which I haven't watched but which directly references journalistic work byΒ The WireΒ during and of the pandemic β€” is most likely a mix of truths and fictionalisation (and not in the clever, good-faith ways in which screenwriters adopt textual biographies for the big screen), with the fiction designed to serve the BJP's nationalist political narratives. So when the letter says in its point no. 5 that the ICMR should apologise to a female member of the BHU team for allegedly β€œspreading a falsehood” about her and offersΒ The Vaccine WarΒ as a counterexample ("While 'The Vaccine War' movie is celebrating women scientists…”), I can’t but retch.

Together with another odd line in the latter β€” that the "ICMR owes it to the people of India" β€” the appeals read less like a debate between scientists on the merits and the demerits of the study and more like they’re trying to bait the ICMR into doing better. I'm not denying the ICMR started it, as a child might say, but saying that this shouldn't have prevented the BHU team from keeping it dignified. For example, the BHU letter reads: "It is to be noted that interim results of the phase-3 trial, also cited by Dr. Priya Abraham in 'The Vaccine War' movie, had a mere 56 days of safety follow-up, much shorter than the one-year follow-up in the IMS-BHU study.” Surely the 56-day period finds mention in a more respectable and reliable medium than a film that confuses you about what’s real and what’s not?

In all, the BHU study seems to have been designed to draw attention to gaps in the safety data for Covaxin β€” but by adopting such a provocative route, all that took centerstage was its spat with the ICMR plus its own flaws.

Realizing the hidden complexity of cloud server networking

By: cks
19 May 2024 at 01:52

We have our first cloud server. This cloud server has a public IP address that we can talk to, which is good because we need it and feels straightforward; we have lots of machines with public IP addresses. This public IP address has a firewall that we have to set rules for, which feels perfectly normal; we have firewalls too. Although if I think about it, the cloud provider is working at a much bigger scale, which makes it harder and more impressive. Except that our actual cloud server has a RFC 1918 IP address and is on an internal private network segment, so what we actually are working with is a NAT firewall gateway. And the RFC 1918 address is a sufficiently straightforward /24 that it's clear it's not unique to us; plenty of cloud customer servers must have their own version of the RFC 1918 /24.

That was when I realized how complex all of the infrastructure for this networking has to be behind the scenes. The cloud provider is not merely operating a carrier-grade NAT, which is already non-trivial. They're operating a CGNAT firewall system that can connect a public IP to an IP on a specific internal virtual network, where the IP (and subnet) aren't unique across all of the (internal) networks being NAT'd. I feel that I'm reasonably knowledgeable about networking and I'm not sure how I'd even approach designing a system that did that. It's different in kind from the NAT firewalls I work on, not merely in size (the way plain CGNAT sometimes feels).

Intellectually, I knew that cloud environments were fearsomely complex behind the scenes, with all sorts of spectacular technical underpinnings (and thus all sorts of things to go wrong). But running 'ip -br a' on our first cloud server and then thinking a bit about how it all worked was the first time it really came home to me. Things like virtual machine provisioning, replicated storage, and so on were sufficiently far outside what I work on that I just admired them from a distance. Connecting our cloud server's public IP with its actual IP was the first time I had the 'I work in this area and nothing I know of could pull that off' feeling.

(Of course if we'd all switched over to IPv6 we might not need this complex NAT environment, because in theory all of those cloud servers could have globally unique IPv6 addresses and subnets and all you'd need would be a carrier grade firewall system. I'm not sure that would work in practice, though, and I don't know how clouds handle IPv6 allocation for customer servers. Our cloud server didn't get assigned an IPv6 address when we set it up.)

UEFI, BIOS, and other confusing x86 PC (firmware) terms

By: cks
4 May 2024 at 03:22

IBM compatible x86 PCs have come with firmware since their first days. This firmware was called (the) BIOS, and so over time 'BIOS' became the generic term for 'IBM compatible x86 PC firmware' (which could come from various companies who carefully reimplemented it from scratch in ways that didn't violate IBM's copyrights). Over time, PC firmware ('BIOS') got more complex and acquired more (boot time) user interface features, like all sorts of splash screens, tuning options, semi-graphical interfaces, and so on. However, the actual BIOS API, primarily used at boot time, stayed more or less unchanged and as a result PCs kept booting in a simple and limited way, (mostly) using the Master Boot Record (MBR).

Various people in the x86 PC world have wanted more sophisticated firmware for a long time (firmware that was more like the firmware that non-x86 servers and workstations often had). The 'BIOS MBR' boot sequence was very limited and awkward, and a variety of features that people wanted had to be wedged in with tricks and extensions. This led to UEFI, which is technically a standard for the APIs and behavior of 'UEFI' firmware (with multiple implementations from various 'BIOS' (firmware) vendors). As part of this standard, UEFI boots machines in a completely different and more powerful way than through the MBR (and UEFI provides some official ways of controlling what should get booted).

Today (and for some time) basically all x86 PCs have firmware that officially supports and implements the UEFI standards (although how well has varied over time; early UEFI support had various problems). This is variously called 'UEFI firmware', 'UEFI BIOS', just 'UEFI', or even 'BIOS with UEFI' (which is how some of the earliest implementations actually felt, as if the UEFI features and requirements were bolted on the side of the existing BIOS). And these days, because 'BIOS' became the generic name for x86 PC firmware, people may say 'BIOS' (eg 'changing BIOS settings') and in practice mean 'UEFI firmware' as opposed to 'BIOS without UEFI support'.

(The giant exception to pervasive UEFI firmware is various virtualization systems, for example on Linux. Unless you specifically ask for firmware with UEFI support, these often provide virtual machines with firmware that is truly BIOS firmware, with no UEFI features. There are various reasons for this beyond the scope of this entry.)

When people talk about doing things with x86 PC firmware, such as booting the system, they often say 'UEFI' to mean 'booting through UEFI native processes and APIs' and 'BIOS' to mean MBR booting. Since most x86 PCs have UEFI firmware these days, MBR booting is generally using UEFI's optional support for this, as opposed to an actual BIOS firmware (except on (some) virtual machines).

As a corollary to this, if someone talks about a 'UEFI only' machine, what they probably mean is a machine that has no support for MBR booting. In practice, probably most or all x86 firmware on real hardware has been fundamentally UEFI firmware for years (although it's possible that PC firmware vendors have built frankenfirmware that was one part UEFI and one part genuine BIOS).

All of this (mis)usage persists partly because it's short, especially when you get to phrases like 'this server is UEFI only'. And generally people know what you mean.

PS: My impression is that server firmware is more likely to stick to the UEFI standard and specification, while firmware in consumer focused desktop motherboards and systems may be more inclined to do things like hunt around randomly to find plausible UEFI boot targets.

Thinking about filesystem space allocation policies and SSDs

By: cks
3 May 2024 at 03:14

Historically, many filesystems have devoted a significant amount of effort to sophisticated space allocation policies. For example, in Unix one of the major changes from V7 to 4.x BSD was the change to the Berkeley Fast File System (also) with its concept of 'cylinder groups' that drastically improved the locality of file data, directory data, and inodes. Various other (Unix) filesystem allocation related technologies have been developed since, for example the idea of delaying deciding where exactly data will live in the filesystem until it's about to be written out, which allows the filesystem to group data better (especially in the face of the fsync problem, where only some of the data may get written out right now).

Traditionally, filesystems really cared about this (and spent so much effort on allocation policies) because disk seeks (on HDDs) were very expensive and issuing extra commands to disks was somewhat expensive even when they didn't require seeks. Solid state disks demolish much of this. Obviously they don't 'seek' as such, and their internal divisions are opaque (and they change, as logical blocks are rewritten on different areas of internal flash). SATA SSDs do still have some limits on the number of commands that can be issued to them, and I believe SAS SSDs do as well. NVMe SSDs famously can handle huge numbers of commands and I believe generally do better with multiple commands being issued to them at once. I believe that there is still an advantage on NVMe SSDs to doing relatively large IOs, so even a SSD-focused filesystem would like to store data in large contiguous chunks rather than scattering its data randomly across the NVMe's storage in 4 Kbyte chunks.

Where this becomes potentially relevant to ordinary people running systems (as opposed to filesystem authors) is that some filesystems will switch between different space allocation strategies depending on various things, like how much free space is left on the filesystem. If you're using SATA/SAS SSDs or especially NVMe SSDs, it may make sense to change when this strategy shift occurs. However, if you have a generally low rate of writes, it's probably not going to make much of a difference (this is Amdahl's Law poking its head up again).

(However, you may have periodic periods of high write rates where you really care about the write latency and thus you care about this issue along with things like disk write buffering and its interactions with write flushes.)

In addition, sometimes what the filesystem is switching between is not really a faster or a slower allocation strategy but instead, for example, how fragmented free space gets (for example, ZFS space allocation from metaslabs). Even if the 'more fragmented' option is faster, you may not want to change where that mode starts (or ends) unless you really know what you're doing.

(Space allocation isn't the only place where filesystems have or had tuning and settings for HDDs that aren't necessarily applicable to SSDs.)

Thoughts on potentially realistic temperature trip limit for hardware

By: cks
22 April 2024 at 02:46

Today one of the machine rooms that we have network switches in experienced some kind of air conditioning issue. During the issue, one of our temperature monitors recorded a high temperature of 44.1 C (it normally sees the temperature as consistently below 20C). The internal temperatures of our network switches undoubtedly got much higher than that, seeing as the one that I can readily check currently reports an internal temperature of 41 C while our temperature monitor says the room temperature is just under 20 C. Despite likely reaching very high internal temperatures, this switch (and probably others) did not shut down to protect themselves.

It's not news to system administrators that when hardware has temperature limits at all, those limits are generally set absurdly high. We know from painful experience that our switches experience failures and other problems when they get sufficiently hot during AC issues such as this, but I don't think we've ever seen a switch (or a server) shut down because of too-high temperatures. I'm sure that some of them will power themselves off if cooked sufficiently, but by that point a lot of damage will already be done.

So hardware vendors should set realistic temperature limits and we're done, right? Well, maybe not so fast. First off, there's some evidence that what we think of as typical ambient and internal air temperatures are too conservative. Google says they run data centers at 80 F or up to 95 F, depending on where you look, although this is with Google's custom hardware instead of off the shelf servers. Second, excess temperature in general is usually an exercise in probabilities and probable lifetimes; often the hotter you run systems, the sooner they will fail (or become more likely to fail). This gives you a trade off between intended system lifetime and operating temperature, where the faster you expect to replace hardware (eg in N years) the hotter you can probably run it (because you don't care if it starts dying after N+1 instead of N+2 years, in either case it'll be replaced by then).

And on the third hand, hardware vendors probably don't want to try to make tables and charts that explain all of this and, more importantly, more or less promise certain results from running their hardware at certain temperatures. It's much simpler and safer to promise less and then leave it up to (large) customers to conduct their own experiments and come up with their own results.

Even if a hardware vendor took the potential risk of setting 'realistic' temperature limits on their hardware, either they might still be way too high for us, because we want to run our hardware much longer than the hardware vendor expects, or alternately they could be too conservative and low, because we would rather take a certain amount of risk to our hardware than have everything aggressively shut down in the face of air conditioning problems (that aren't yet what we consider too severe) and take us entirely off the air.

(And of course we haven't even considered modifying any firmware temperature limits on systems where we could potentially do that. We lack the necessary data to do anything sensible, so we just stick with whatever the vendor has set.)

Having IPv6 for public servers is almost always merely nice, not essential

By: cks
16 April 2024 at 02:22

Today on lobste.rs I saw a story about another 'shame people who don't have IPv6' website. People have made these sites before and they will make them again and as people in the comments note, it will have next to no effect. One of the reasons for that is a variant on how IPv6 has often had low user benefits.

As a practical matter, almost all servers that people want to be generally accessible need to be accessible via IPv4, because there are still a lot of places and people that are IPv4 only (including us, for various reasons). And as the inverse version of this, practically everyone needs to be able to talk to public servers that are IPv4 only, even if this requires 6-to-4 carrier grade NAT somewhere in the network. So people operating generally accessible public servers can almost never go IPv6 only, and since they have to have to be reachable through IPv4 and approximately everyone can talk to them over IPv4, adding IPv6 support has only a moderate benefit. Maybe some people can avoid going through carrier grade NAT; maybe some people will get to feel nicer.

(You can choose to operate a website or a service as IPv6 only, but in that case you're cutting off a potentially significant amount of your general audience. This is not something that many site and service operators are enthusiastic about. Being IPv4 only has much less effects on your audience. This is related to how IPv6 mostly benefits new people on the Internet, not incumbents. Of course IPv6 only can make sense if your target audience is narrower and you happen to know that they all have working IPv6.)

When you have a service feature that is merely nice instead of essential and which potentially involves some significant engineering complexity, is it any surprise that many organizations put it rather far down their priority list? In my view, it's basically what one would expect from both an engineering and business perspective.

(In my view the corollary to this is that general server side IPv6 adoption could be best helped by some combination of making it easier to add IPv6 and making it more useful to have IPv6. Unfortunately a whole raft of historical decisions make it hard to do much about the former, cf.)

Solving the hairpin NAT problem with policy based routing and plain NAT

By: cks
6 April 2024 at 04:06

One use of Network Address Translation (NAT) is to let servers on your internal networks be reached by clients on the public internet. You publish public IP addresses for your servers in DNS, and then have your firewall translate those public IPs to their internal IPs as the traffic passes through. If you do this with straightforward NAT rules, someone on the same internal network as those servers may show up with a report that they can't talk to those public servers. This is because you've run into what I call the problem of 'triangular' NAT, where only part of the traffic is flowing through the firewall.

The ability to successfully NAT traffic to a machine that is actually on the same network is normally called hairpin NAT (after the hairpin turn packets make as they turn around to head out the same firewall interface they arrived on). Not every firewall likes hairpin NAT or makes it easy to set up, and even if you do set it up through cleverness, using hairpin NAT necessarily means that the server won't see the real client IP address; it will instead see some IP address associated with the firewall, as the firewall has to NAT the client IP to force the server's replies to flow back through it.

However, it recently struck me that there is another way to solve this problem, by using policy based routing. If you add an additional IP address on the server, set a routing policy so that outgoing traffic from that IP can never be sent to the local network but is always sent to the firewall, and then make that IP the internal IP that the firewall NATs to, you avoid the triangular NAT problem without the firewall having to change the client IP (which means that the internal server gets to see the true client IP for its logs or other purposes). This sort of routing policy is possible with at least some policy based routing frameworks, because at one point I accidentally did this on Linux.

(You almost certainly don't want to set up this routing policy for the internal server's primary IP address, the one it will use when making its own connections to machines. I'd expect various problems to come up.)

You still need a firewall that will send NAT'd packets back out the same interface they came in on. Generally, routers will do this for ordinary traffic, but firewall rules on routers may come with additional requirements. However, it should be possible on any routing firewall that can due full hairpin NAT, since that also requires sending packets back out the same interface after firewall rules. I believe this is generally going to be challenging on a bridging firewall, or outright impossible (we once ran into issues with changing the destination on a bridging firewall, although I haven't checked the state of affairs today).

Why I think you shouldn't digitally sign things casually

By: cks
5 April 2024 at 04:11

Over on the Fediverse, I said:

My standard attitude on digital signatures for anything, Git commits included, is that you should not sign anything unless you understand what you're committing to when you do so. This usually includes "what people expect from you when you sign things". Signing things creates social and/or legal liability. Do not blindly assume that liability without thought, especially if people want you to.

In re: (a Fediverse post encouraging signing Git commits)

If people are asking you to sign something, they are attributing a different meaning to an unsigned thing from you than to a signed thing from you. Before you go along with this and sign, you want to understand what that difference in meaning is and whether you're prepared to actually deliver that difference in practice. Are people assuming that you have your signing key in a hardware token that you keep careful custody of? Are people assuming you take some sort of active responsibility for commits you digitally sign? What is going to happen (even just socially) if your signing key is compromised?

For a very long time, I've felt that people's likely expectations of the security of my potential digital signatures did not match up with the actual security I was prepared to provide (for example, my old entry on why I don't have a GPG key). Nothing in the modern world of security has changed my views, especially as I've become more aware of my personal limits on how much I care about security. And while it's true that a certain amount of modern security practices make things not what they're labeled, the actual reality doesn't necessarily change people's expectations.

If you understand what people are really asking you for and expecting, and you feel that you can live up to that, then sure, sign away. Or if you feel that actual problems are unlikely enough and the social benefits of signing are high enough. But don't do it blindly.

(And if you have no choice about it because some organization is insisting that you sign things if you want to publish software packages, push changes, or whatever, then you mostly have no choice. Either you can sign or you can drop out. Just remember that sometimes dropping out is the right (or the only) answer.)

PS: There is also a tangle of issues around non-repudiation that I'm not going to try to get into.

The many possible results of turning an IP address into a 'hostname'

By: cks
24 March 2024 at 03:07

One of the things that you can do with the DNS is ask it to give you the DNS name for an IP address, in what is called a reverse DNS lookup. A full and careful reverse DNS lookup is more complex than it looks and has more possible results than you might expect. As a result, it's common for system administrators to talk about validated reverse DNS lookups versus plain or unvalidated reverse DNS lookups. If you care about the results of the reverse DNS lookup, you want to validate it, and this validation is where most of the extra results come in to play.

(To put the answer first, a validated reverse DNS lookup is one where the name you got from the reverse DNS lookup also exists in DNS and lists your initial IP address as one of its IP addresses. This means that the organization responsible for the name agrees that this IP is one of the IPs for that name.)

The result of a plain reverse DNS lookup can be zero, one, or even many names, or a timeout (which is in effect zero results but which takes much longer). Returning more than one name from a reverse DNS lookup is uncommon and some APIs for doing this don't support it at all, although DNS does. However, you cannot trust the name or names that result from reverse DNS, because reverse DNS lookups is done using a completely different set of DNS zones than domain names use, and as a result can be controlled by a completely different person or organization. I am not Google, but I can make reverse DNS for an IP address here claim to be a Google hostname.

(Even within an organization, people can make mistakes with their reverse DNS information, precisely because it's less used than the normal (forward) DNS information. If you have a hostname that resolves to the wrong IP address, people will notice right away; if you have an IP address that resolves to the wrong name, people may not notice for some time.)

So for each name you get in the initial reverse DNS lookup, there are a number of possibilities:

  • Tha name is actually an (IPv4, generally) IP address in text form. People really do this even if they're not supposed to, and your DNS software probably won't screen these out.

  • The name is the special DNS name used for that IP address's reverse DNS lookup (or at least some IP's lookup). It's possible for such names to also have IP addresses, and so you may want to explicitly screen them out and not consider them to be validated names.

  • The name is for a private or non-global name or zone. People do sometimes leak internal DNS names into reverse DNS records for public IPs.
  • The name is for what should be a public name but it doesn't exist in the DNS, or it doesn't have any IP addresses associated with it in a forward lookup.

    In both of these cases we can say the name is unknown. If you don't treat 'the name is an IP address' specially, such a name will also turn up as unknown here if you make a genuine DNS query.

  • The name exists in DNS with IP addresses, but the IP address you started with is not among the IP addresses returned for it in a forward lookup. We can say that the name is inconsistent.

  • The name exists in DNS with IP addresses, and one of those IP addresses is the IP address you started with. The name is consistent and the reverse DNS lookup is valid; the IP address you started with is really called that name.

(There may be a slight bit of complexity in doing the forward DNS lookup.)

If a reverse DNS lookup for an IP address gave you more than one name, you may only care whether there is one valid name (which gives you a name for the IP), you may want to know all of the valid names, or you may want to check that all names are valid and consider it an error if any of them aren't. It depends on why you're doing the reverse DNS lookup and validation. And you might also care about why a name doesn't validate for an IP address, or that an IP address has no reverse DNS lookup information.

Of course if you're trying to find the name for an IP address, you don't necessarily have to use a reverse DNS lookup. In some sense, the 'name' or 'names' for an IP address are whatever DNS names point to it as (one of) their IP address(es). If you have an idea what those names might be, you can just directly check them all to see if you find the IP you're curious about.

If you're writing code that validates IP address reverse DNS lookups, one reason to specifically check for and care about a name that is an IP address is that some languages have 'name to IP address' APIs that will helpfully give you back an IP address if you give them one in text form. If you don't check explicitly, you can look up an IP address, get the IP address in text form, feed it into such an API, get the IP address back again, and conclude that this is a validated (DNS) name for the IP.

It's extremely common for IP addresses to have names that are unknown or inconsistent. It's also pretty common for IP addresses to not have any names, and not uncommon for reverse DNS lookups to time out because the people involved don't operate DNS servers that return timely answers (for one reason or another).

PS: It's also possible to find out who an IP address theoretically belongs to, but that's an entire different discussion (or several of them). Who an IP address belongs to can be entirely separate from what its proper name is. For example, in common colocation setups and VPS services, the colocation provider or VPS service will own the IP, but its proper name may be a hostname in the organization that is renting use of the provider's services.

About DRAM-less SSDs and whether that matters to us

By: cks
20 March 2024 at 03:15

Over on the Fediverse, I grumbled about trying to find SATA SSDs for server OS drives:

Trends I do not like: apparently approximately everyone is making their non-Enterprise ($$$) SATA SSDs be kind of terrible these days, while everyone's eyes are on NVMe. We still use plenty of SATA SSDs in our servers and we don't want to get stuck with terrible slow 'DRAM-less' (QLC) designs. But even reputable manufacturers are nerfing their SATA SSDs into these monsters.

(By the '(QLC)' bit I meant SATA SSDs that were both DRAM-less and used QLC flash, which is generally not as good as other flash cell technology but is apparently cheaper. The two don't have to go together, but if you're trying to make a cheap design you might as well go all the way.)

In a reply to that post, @cesarb noted that the SSD DRAM is most important for caching internal metadata, and shared links to Sabrent's "DRAM & HMB" and Phison's "NAND Flash 101: Host Memory Buffer", both of which cover this issue from the perspective of NVMe SSDs.

All SSDs need to use (and maintain) metadata that tracks things like where logical blocks are in the physical flash, what parts of physical flash can be written to right now, and how many writes each chunk of flash has had for wear leveling (since flash can only be written to so many times). The master version of this information must be maintained in flash or other durable storage, but an old fashioned conventional SSD with DRAM had some amount of DRAM that was used in large part to cache this information for fast access and perhaps fast bulk updating before it was flushed to flash. A DRAMless SSD still needs to access and use this metadata, but it can only hold a small amount of it in the controller's internal memory, which means it must spend more time reading and re-reading bits of metadata from flash and may not have as comprehensive a view of things like wear leveling or the best ready to write flash space.

Because they're PCIe devices, DRAMless NVMe SSDs can borrow some amount of host RAM from the host (your computer), much like some or perhaps all integrated graphics 'cards' (which are also nominally PCIe devices) borrow host RAM to use for GPU purposes (the NVMe "Host Memory Buffer (HMB)" of the links). This option isn't available to SATA (or SAS) SSDs, which are entirely on their own. The operating system generally caches data read from disk and will often buffer data written before sending it to the disk in bulk, but it can't help with the SSD's internal metadata.

(DRAMless NVMe drives with a HMB aren't out of the woods, since I believe the HMB size is typically much smaller than the amount of DRAM that would be on a good NVMe drive. There's an interesting looking academic article from 2020, HMB in DRAM-less NVMe SSDs: Their usage and effects on performance (also).)

How much the limited amount of metadata affects the drive's performance depends on what you're doing, based on both anecdotes and Sabrent's and Phison's articles. It seems that the more internal metadata whatever you're doing needs, the worse off you are. The easily visible case is widely distributed random reads, where a DRAMless controller will apparently spend a visible amount of time pulling metadata off the flash in order to find where those random logical blocks are (enough so that it clearly affects SATA SSD latency, per the Sabrent article). Anecdotally, some DRAMless SATA SSDs can experience terrible write performance under the right (or wrong) circumstances and actually wind up performing worse than HDDs.

Our typical server doesn't need much disk space for its system disk (well, the mirrored pair that we almost always use); even a generous Ubuntu install barely reaches 30 GBytes. With automatic weekly TRIMs of all unused space (cf), the SSDs will hopefully easily be able to find free space during writes and not feel too much metadata pressure then, and random reads will hopefully mostly be handled by Linux's in RAM disk cache. So I'm willing to believe that a competently implemented DRAMless SATA SSD could perform reasonably for us. One of the problems with this theory is finding such a 'competently implemented' SATA SSD, since the reason that SSD vendors are going DRAMless on SATA SSDs (and even NVMe drives) is to cut costs and corners. A competent, well performing implementation is a cost too.

PS: I suspect there's no theoretical obstacle to a U.2 form factor NVMe drive being DRAMless and using a Host Memory Buffer over its PCIe connection. In practice U.2 drives are explicitly supposed to be hot-swappable and I wouldn't really want to do that with a HMB, so I suspect DRAM-less NVMe drives with HMB are all M.2 in practice.

(I also have worries about how well the HMB is protected from stray host writes to that RAM, and how much the NVMe disk is just trusting that it hasn't gotten corrupted. Corrupting internal flash metadata through OS faults or other problems seems like a great way to have a very bad day.)

Disk write buffering and its interactions with write flushes

By: cks
18 March 2024 at 01:59

Pretty much every modern system defaults to having data you write to filesystems be buffered by the operating system and only written out asynchronously or when you specially request for it to be flushed to disk, which gives you general questions about how much write buffering you want. Now suppose, not hypothetically, that you're doing write IO that is pretty much always going to be specifically flushed to disk (with fsync() or the equivalent) before the programs doing it consider this write IO 'done'. You might get this situation where you're writing and rewriting mail folders, or where the dominant write source is updating a write ahead log.

In this situation where the data being written is almost always going to be flushed to disk, I believe the tradeoffs are a bit different than in the general write case. Broadly, you can never actually write at a rate faster than the write rate of the underlying storage, since in the end you have to wait for your write data to actually get to disk before you can proceed. I think this means that you want the OS to start writing out data to disk almost immediately as your process writes data; delaying the write out will only take more time in the long run, unless for some reason the OS can write data faster when you ask for the flush than before then. In theory and in isolation, you may want these writes to be asynchronous (up until the process asks for the disk flush, where you have to synchronously wait for them), because the process may be able to generate data faster if it's not stalling waiting for individual writes to make it to disk.

(In OS tuning jargon, we'd say that you want writeback to start almost immediately.)

However, journaling filesystems and concurrency add some extra complications. Many journaling filesystems have the journal as a central synchronization point, where only one disk flush can be in progress at once and if several processes ask for disk flushes at more or less the same time they can't proceed independently. If you have multiple processes all doing write IO that they will eventually flush and you want to minimize the latency that processes experience, you have a potential problem if different processes write different amounts of IO. A process that asynchronously writes a lot of IO and then flushes it to disk will obviously have a potentially long flush, and this flush will delay the flushes done by other processes writing less data, because everything is running through the chokepoint that is the filesystem's journal.

In this situation I think you want the process that's writing a lot of data to be forced to delay, to turn its potentially asynchronous writes into more synchronous ones that are restricted to the true disk write data rate. This avoids having a large overhang of pending writes when it finally flushes, which hopefully avoids other processes getting stuck with a big delay as they try to flush. Although it might be ideal if processes with less write volume could write asynchronously, I think it's probably okay if all of them are forced down to relatively synchronous writes with all processes getting an equal fair share of the disk write bandwidth. Even in this situation the processes with less data to write and flush will finish faster, lowering their latency.

To translate this to typical system settings, I believe that you want to aggressively trigger disk writeback and perhaps deliberately restrict the total amount of buffered writes that the system can have. Rather than allowing multiple gigabytes of outstanding buffered writes and deferring writeback until a gigabyte or more has accumulated, you'd set things to trigger writebacks almost immediately and then force processes doing write IO to wait for disk writes to complete once you have more than a relatively small volume of outstanding writes.

(This is in contrast to typical operating system settings, which will often allow you to use a relatively large amount of system RAM for asynchronous writes and not aggressively start writeback. This especially would make a difference on systems with a lot of RAM.)

Something I don't know: How server core count interacts with RAM latency

By: cks
3 March 2024 at 03:54

When I wrote about how the speed of improvement in servers may have slowed down, I didn't address CPU core counts, which is one area where the numbers have been going up significantly. Of course you have to keep those cores busy, but if you have a bunch of CPU-bound workloads, the increased core count is good for you. Well, it's good for you if your workload is genuinely CPU bound, which generally means it fits within per-core caches. One of the areas I don't know much about is how the increasing CPU core counts interact with RAM latency.

RAM latency (for random requests) has been relatively flat for a while (it's been flat in time, which means that it's been going up in cycles as CPUs got faster). Total memory access latency has apparently been 90 to 100 nanoseconds for several memory generations (although individual DDR5 memory module access is apparently only part of this, also). Memory bandwidth has been going up steadily between the DDR generations, so per-core bandwidth has gone up nicely, but this is only nice if you have the kind of sequential workloads that benefit from it. As far as I know, the kind of random access that you get from things like pointer chasing is all dependent on latency.

(If the total latency has been basically flat, this seems to imply that bandwidth improvements don't help too much. Presumably they help for successive non-random reads, and my vague impression is that reading data from successive addresses from RAM is faster than reading random addresses (and not just because RAM typically transfers an entire cache line to the CPU at once).)

So now we get to the big question: how many memory reads can you have in flight at once with modern DDR4 or DDR5 memory, especially on servers? Where the limit is presumably matters since if you have a bunch of pointer-chasing workloads that are limited by 'memory latency' and you run them on a high core count system, at some point it seems that they'll run out of simultaneous RAM read capacity. I've tried to do some reading and gotten confused, which may be partly because modern DRAM is a pretty complex thing.

(I believe that individual processors and multi-socket systems have some number of memory channels, each of which can be in action simultaneously, and then there are memory ranks (also) and memory banks. How many memory channels you have depends partly on the processor you're using (well, its memory controller) and partly on the motherboard design. For example, 4th generation AMD Epyc processors apparently support 12 memory channels, although not all of them may be populated in a given memory configuration (cf). I think you need at least N (or maybe 2N) DIMMs for N channels. And here's a look at AMD Zen4 memory stuff, which doesn't seem to say much on multi-core random access latency.)

The speed of improvement in servers may have slowed down

By: cks
1 March 2024 at 03:43

One of the bits of technology news that I saw recently was that AWS was changing how long it ran servers, from five years to six years. Obviously one large motivation for this is that it will save Amazon a nice chunk of money. However, I suspect that one enabling factor for this is that old servers are more similar to new servers than they used to be, as part of what could be called the great slowdown in computer performance improvement.

New CPUs and to a lesser extent memory are somewhat better than they used to be, both on an absolute measure and on a performance per watt basis, but the changes aren't huge the way they used to be. SATA SSD performance has been more or less stagnant for years; NVMe performance has improved, but from a baseline that was already very high, perhaps higher than many workloads could take advantage of. Network speeds are potentially better but it's already hard to truly take advantage of 10G speeds, especially with ordinary workloads and software.

(I don't know if SAS SSD bandwidth and performance has improved, although raw SAS bandwidth has and is above what SATA can provide.)

For both AWS and people running physical servers (like us) there's also the question of how many people need faster CPUs and more memory, and related to that, how much they're willing to pay for them. It's long been observed that a lot of what people run on servers is not a voracious consumer of CPU and memory (and IO bandwidth). If your VPS runs at 5% or 10% CPU load most of the time, you're probably not very enthused about paying more for a VPS with a faster CPU that will run at 2.5% almost all of the time.

(Now that I've written this it strikes me that this is one possible motivation for cloud providers to push 'function as a service' computing, because it potentially allows them to use those faster CPUs more effectively. If they're renting you CPU by the second and only when you use it, faster CPUs likely mean more people can be packed on to the same number of CPUs and machines.)

We have a few uses for very fast single-core CPU performance, but other than those cases (and our compute cluster) it's hard to identify machines that could make much use of faster CPUs than they already have. It would be nice if our fileservers had U.2 NVMe drives instead of SATA SSDs but I'm not sure we'd really notice; the fileservers only rarely see high IO loads.

PS: It's possible that I've missed important improvements here because I'm not all that tuned in to this stuff. One possible area is PCIe lanes directly supported by the system's CPU(s), which enable all of those fast NVMe drives, multiple 10G or faster network connections, and so on.

Open source culture and the valorization of public work

By: cks
26 February 2024 at 04:21

A while back I wrote about how doing work that scales requires being able to scale your work, which in the open source world requires time, energy, and the willingness to engage in the public sphere of open source regardless of the other people there and your reception. Not everyone has this sort of time and energy, and not everyone gets a positive reception by open source projects even if they have it.

This view runs deep in open source culture, which valorizes public work even at the cost of stress and time. Open source culture on the one hand tacitly assumes that everyone has those available, and on the other hand assumes that if you don't do public work (for whatever reason) that you are less virtuous or not virtuous at all. To be a virtuous person in open source is to contribute publicly at the cost of your time, energy, stress, and perhaps money, and to not do so is to not be virtuous (sometimes this is phrased as 'not being dedicated enough').

(Often the most virtuous public contribution is 'code', so people who don't program are already intrinsically not entirely virtuous and lesser no matter what they do.)

Open source culture has some reason to praise and value 'doing work that scales', public work; if this work does not get done, nothing happens. But it also has a tendency to demand that everyone do it and to judge them harshly when they don't. This is the meta-cultural issue behind things like the cultural expectations that people will file bug reports, often no matter what the bug reporting environment is like or if filing bug reports does any good (cf).

I feel that this view is dangerous for various reasons, including because it blinds people to other explanations for a lack of public contributions. If you can say 'people are not contributing because they're not virtuous' (or not dedicated, or not serious), then you don't have to take a cold, hard look at what else might be getting in the way of contributions. Sometimes such a cold hard look might turn up rather uncomfortable things to think about.

(Not every project wants or can handle contributions, because they generally require work from existing project members. But not all such projects will admit up front in the open that they either don't want contributions at all or they gatekeep contributions heavily to reduce time burdens on existing project members. And part of that is probably because openly refusing contributions is in itself often seen as 'non-virtuous' in open source culture.)

❌
❌