Reading view

There are new articles available, click to refresh the page.

A Tech Blog Diff

By: LGoto
Camel caravan in the Amatlich erg, Mauritania, Valerian Guillot

The Developer Outreach team is happy to announce that we will be migrating the Tech Blog into Diff. This move will allow us to provide better support and more visibility for the incredible work of the technical community. Diff is the community news and event blog supported by the Movement Communications team. Diff sees about 20,000 visits a month and has 1,200 email subscribers. 

  • What will happen to the Tech Blog content?
    • All Tech Blog posts will be accessible on Diff, clearly tagged with “techblog”. Old links will automatically redirect to their new location. New posts with a technical focus will be tagged with “techblog” so they will be easily discoverable.You’ll be able to find all techblog posts – old and new – on the landing page at https://diff.wikimedia.org/techblog 
  • When is this happening?
    • The migration should be complete in April 2026.
  • How do I submit a blog post with a technical focus?
    • For now, please hold your post until we complete the migration.
    • After the migration is done: The process remains the same. For WMF staff, talk to your manager about your interest in writing a blog post so they are not surprised when you ask them to approve it once it is written. For folks outside WMF, if you are part of a team or other larger organization, be sure they are aware and approve. Then, see the Diff submission process and select the category “Technology” and the tag “techblog” when writing your draft. After you submit, the Developer Outreach team will review your draft. When it’s ready to go, we will schedule your post to be published.

We’re excited for the Tech Blog to evolve and thank the Movement Communications team for helping us make this possible!

cash issuing terminals

In the United States, we are losing our fondness for cash. As in many other countries, cards and other types of electronic payments now dominate everyday commerce. To some, this is a loss. Cash represented a certain freedom from intermediation, a comforting simplicity that you just don't get from Visa. It's funny to consider, then, how cash is in fact quite amenable to automation. Even Benjamin Franklin's face on a piece of paper can feel like a mere proxy for a database transaction. How different is cash itself from "e-cash", when it starts and ends its lifecycle through automation?

Increasing automation of cash reflects the changing nature of banking: decades ago, a consumer might have interacted with banking primarily through a "passbook" savings account, where transactions were so infrequent that the bank recorded them directly in the patron's copy of the passbook. Over the years, increasing travel and nationwide communications led to the ubiquitous use of inter-bank money transfers, mostly in the form of the check. The accounts that checks typically drew on—checking accounts—were made for convenience and ease of access. You might deposit your entire paycheck into an account—it might even be sent there automatically—and then when you needed a little walking around money, you would withdraw cash by the assistance of a teller. By the time I was a banked consumer, even the teller was mostly gone. Today, we get our cash from machines so that it can be deposited into other machines.

IBM 2984 ATM

Cash handling is fraught with peril. Bills are fairly small and easy to hide, and yet quite valuable. Automation in the banking world first focused on solving this problem, of reliable and secure cash handling within the bank branch. The primary measure against theft by insiders was that the theft would be discovered, as a result of the careful bookkeeping that typifies banks. But, well, that bookkeeping was surprisingly labor-intensive in even the bank of the 1950s.

Histories of the ATM usually focus on just that: the ATM. It's an interesting story, but one that I haven't been particularly inclined to cover due to the lack of a compelling angle. Let's try IBM. IBM is such an important, famous player in business automation that it forms something of a synecdoche for the larger industry. Even so, in the world of bank cash handling, IBM's efforts ultimately failed... a surprising outcome, given their dominance in the machines that actually did the accounting.

In this article, we'll examine the history of ATMs—by IBM. IBM was just one of the players in the ATM industry and, by its maturity, not even one of the more important ones. But the company has a legacy of banking products that put the ATM in a more interesting context, and despite lackluster adoption of later IBM models, their efforts were still influential enough that later ATMs inherited some of IBM's signature design concepts. I mean that more literally than you might think. But first, we have to understand where ATMs came from. We'll start with branch banking.

When you open a bank account, you typically do so at a "branch," one of many physical locations that a national bank maintains. Let us imagine that you are opening an account at your local branch of a major bank sometime around 1930; whether before or after that year's bank run is up to you. Regardless of the turbulent economic times, the branch became responsible for tracking the balance of your account. When you deposit money, a teller writes up a slip. When you come back and withdraw money, a different teller writes up a different slip. At the end of each business day, all of these slips (which basically constitute a journal in accounting terminology) have to be rounded up by the back office and posted to the ledger for your account, which was naturally kept as a card in a big binder.

A perfectly practicable 1930s technology, but you can already see the downsides. Imagine that you appear at a different branch to withdraw money from your account. Fortunately, this was not very common at the time, and you would be more likely to use other means of moving money in most scenarios. Still, the bank tries to accommodate. The branch at which you have appeared can dispense cash, write a slip, and then send it to the correct branch for posting... but they also need to post it to their own ledger that tracks transactions for foreign accounts, since they need to be able to reconcile where their cash went. And that ignores the whole issue of who you are, whether or not you even have an account at another branch, and whether or not you have enough money to cover the withdrawal. Those are problems that, mercifully, could mostly be sorted out with a phone call to your home branch.

Bank branches, being branches, do not exist in isolation. The bank also has a headquarters, which tracks the finances of its various branches—both to know the bank's overall financial posture (critical considering how banks fail), and to provide controls against insider theft. Yes, that means that each of the branch banks had to produce various reports and ledger copies and then send them by courier to the bank headquarters, where an army of clerks in yet another back office did yet another round of arithmetic to produce the bank's overall ledgers.

As the United States entered World War II, an expanding economy, rapid industrial buildup, and a huge increase in national mobility (brought on by things like the railroads and highways) caused all of these tasks to occur on larger and larger scales. Major banks expanded into a tiered system, in which branches reported their transactions to "regional centers" for reconciliation and further reporting up to headquarters. The largest banks turned to unit record equipment or "business machines," arguably the first form of business computing: punched card machines that did not evaluate programs, but sorted and summed.

Simple punched card equipment gave way to more advanced automation, innovations like the "posting machine." These did exactly what they promised: given a stack of punched cards encoding transactions, they produced a ledger with accurately computed sums. Specialized posting machines were made for industries ranging from hospitality (posting room service and dining charges to room folios) to every part of finance, and might be built custom to the business process of a large customer.

If tellers punched transactions into cards, the bank could come much closer to automation by shipping the cards around for processing at each office. But then, if transactions are logged in a machine readable format, and then processed by machines, do we really need to courier them to rooms full of clerks?

Well, yes, because that was the state of technology in the 1930s. But it would not stay that way for long.

In 1950, Bank of America approached SRI about the feasibility of an automated check processing system. Use of checks was rapidly increasing, as were total account holders, and the resulting increase in inter-branch transactions was clearly overextending BoA's workforce—to such an extent that some branches were curtailing their business hours to make more time for daily closing. By 1950, computer technology had advanced to such a state that it was obviously possible to automate this activity, but it still represented one of the most ambitious efforts in business computing to date.

BoA wanted a system that would not only automate the posting of transactions prepared by tellers, but actually automate the handling of the checks themselves. SRI and, later, their chosen manufacturing partner General Electric ran a multi-year R&D campaign on automated check handling that ultimately led to the design of the checks that we use today: preprinted slips with account holder information, and account number, already in place. And, most importantly, certain key fields (like account number and check number) represented in a newly developed machine-readable format called "MICR" for magnetic ink character recognition. This format remains in use today, to the extent that checks remain in use, although as a practical matter MICR has given way to the more familiar OCR (aided greatly by the constrained and standardized MICR character set).

The machine that came out of this initiative was called ERMA, the Electronic Recording Machine, Accounting. I will no doubt one day devote a full article to ERMA, as it holds a key position in the history of business computing while also managing to not have much of a progeny due to General Electric's failure to become a serious contender in the computer industry. ERMA did not lead to a whole line of large-scale "ERM" business systems as GE had hoped, but it did firmly establish the role of the computer in accounting, automate parts of the bookkeeping through almost the entirety of what would become the nation's largest bank, and inspire generations of products from other computer manufacturers.

The first ERMA system went into use in 1959. While IBM was the leader in unit record equipment and very familiar to the banking industry, it took a few years for Big Blue to bring their own version. Still, IBM had their own legacy to build on, including complex electromechanical machines that performed some of the tasks that ERMA was taking over. Since the 1930s, IBM had produced a line of check processing or "proofing" machines. These didn't exactly "automate" check handling, but they did allow a single operator to handle a lot of documents.

The IBM 801, 802, and 803 line of check proofers used what were fundamentally unit record techniques—keypunch, sorting bins, mechanical totalizers—to present checks one at a time in front of the operator, who read information like the amount, account number, and check number off of the paper slip and entered it on a keypad. The machine then whisked the check away, printing the keyed data (and reference numbers for auditing) on the back of the check, stamped an endorsement, added the check's amounts to the branch's daily totals (including subtotals by document type), and deposited the check in an appropriate sorter bin to be couriered to the drawer's bank. While all this happened, the machines also printed the keyed check information and totals onto paper tapes.

By the early 1960s, with ERMA on the scene, IBM started to catch up. Subsequent check processing systems gained support for MICR, eliminating much (sometimes all!) of the operator's keying. Since the check proofing machines could also handle deposit slips, a branch that generated MICR-marked deposit slips could eliminate most of the human touchpoints involved in routine banking. A typical branch bank setup might involve an IBM 1210 document reader/sorter machine connected by serial channel to an IBM 1401 computer. This system behaved much like the older check proofers, reading documents, logging them, and calculating totals. But it was now all under computer control, with the flexibility and complexity that entails.

One of these setups could process almost a thousand checks a minute with a little help from an operator, and adoption of electronic technology at other stages made clerks lives easier. For example, IBM's mid-1960s equipment introduced solid-state memory. The IBM 1260 was used for adding machine-readable MICR data to documents that didn't already have it. Through an innovation that we would now call a trivial buffer, the 1260's operator could key in the numbers from the next document while the printer was still working on the previous.

Along with improvements in branch bank equipment came a new line of "high-speed" systems. In a previous career, I worked at a Federal Reserve bank, where "high-speed" was used as the name of a department in the basement vault. There, huge machines processed currency to pick out bad bills. This use of "high-speed" seems to date to an IBM collaboration with the Federal Reserve to build machines for central clearinghouses, handling checks by the tens of thousands. By the time I found myself in central banking, the use of "high-speed" machinery for checks was a thing of the past—"digital substitute" documents or image-based clearing having completely replaced physical handling of paper checks. Still, the "high-speed" staff labored on in their ballistic glass cages, tending to the green paper slips that the institution still dispenses by the millions.

IBM service documentation

One of the interesting things about the ATM is when, exactly, it pops up in the history of computers. We are, right now, in the 1960s. The credit card is in its nascent stages, MasterCard's predecessor pops up in 1966 to compete with Bank of America's own partially ERMA-powered charge card offering. With computer systems maintaining account sums, and document processing machines communicating with bookkeeping computers in real-time, it would seem that we are on the very cusp of online transaction authorization, which must be the fundamental key to the ATM. ATMs hand out cash, and one thing we all know about cash is that, once you give yours to someone else you are very unlikely to get it back. ATMs, therefore, must not dispense cash unless they can confirm that the account holder is "good for it." Otherwise the obvious fraud opportunity would easily wipe out the benefits.

So, what do you do? It seems obvious, right? You connect the ATM to the bookkeeping computer so it can check account balances before dispensing cash. Simple enough.

But that's not actually how the ATM evolved, not at all. There are plenty of reasons. Computers were very expensive so banks centralized functions and not all branches had one. Long-distance computer communication links were very expensive as well, and still, in general, an unproven technology. Besides, the computer systems used by banks were fundamentally batch-mode machines, and it was difficult to see how you would shove an ATM's random interruptions into the programming model.

Instead, the first ATMs were token-based. Much like an NYC commuter of the era could convert cash into a subway token, the first ATMs were machines that converted tokens into cash. You had to have a token—and to get one, you appeared at a teller during business hours, who essentially dispensed the token as if it were a routine cash withdrawal.

It seems a little wacky to modern sensibilities, but keep in mind that this was the era of the traveler's check. A lot of consumers didn't want to carry a lot of cash around with them, but they did want to be able to get cash after hours. By seeing a teller to get a few ATM tokens (usually worth $10 or £10 and sometimes available only in that denomination), you had the ability to retrieve cash, but only carried a bank document that was thought (due to features like revocability and the presence of ATMs under bank surveillance) to be relatively secure against theft. Since the tokens were later "cleared" against accounts much like checks, losing them wasn't necessarily a big deal, as something analogous to a "stop payment" was usually possible.

Unlike subway tokens, these were not coin-shaped. The most common scheme was a paper card, often the same dimensions as a modern credit card, but with punched holes that encoded the denomination and account holder information. The punched holes were also viewed as an anti-counterfeiting measure, probably not one that would hold up today, but still a roadblock to fraudsters who would have a hard time locating a keypunch and a valid account number. Manufacturers also explored some other intriguing opportunities, like the very first production cash dispenser, 1967's Barclaycash machine. This proto-ATM used punched paper tokens that were also printed in part with a Carbon-14 ink. Carbon-14 is unstable and emits beta radiation, which the ATM detected with a simple electrostatic sensor. For some reason difficult to divine the radioactive ATM card did not catch on.

For roughly the first decade of the "cash machine," they were offline devices that issued cash based on validating a token. The actual decision making, on the worthiness of a bank customer to withdraw cash, was still deferred to the teller who issued the tokens. Whether or not you would even consider this an ATM is debatable, although historical accounts generally do. They are certainly of a different breed than the modern online ATM, but they also set some of the patterns we still follow. Consider, for example, the ATMs within my lifespan that accepted deposits in an envelope. These ATMs did nothing with the envelopes other than accumulate them into a bin to go to a central processing center later on—the same way that early token-based ATMs introduced deposit boxes.

In this theory of ATM evolution, the missing link that made 1960s­–1970s ATMs so primitive was the lack of computer systems that were amenable to real-time data processing using networked peripherals. The '60s and '70s were a remarkable era in computer history, though, seeing the introduction of IBM's System/360 and System/370 line. These machines were more powerful, more flexible, and more interoperable than any before them. I think it's fair to say that, despite earlier dabbling, it was the 360/370 that truly ushered in the era of business computing. Banks didn't miss out.

One of the innovations of the System/360 was an improved and standardized architecture for the connection of peripherals to the machine. While earlier IBM models had supported all kinds of external devices, there was a lot of custom integration to make that happen. With the System/360, this took the form of "Bisync," which I might grandly call a far ancestor of USB. Bisync allowed a 360 computer to communicate with multiple peripherals connected to a common multi-drop bus, even using different logical communications protocols. While the first Bisync peripherals were "remote job entry" terminals for interacting with the machine via punched cards and teletype, IBM and other manufacturers found more and more applications in the following years.

IBM 3214 ATM

IBM had already built document processing machines that interacted with their computers. In 1971, IBM joined the credit card fray with the 2730, a "transaction" terminal that we would now recognize as a credit card reader. It used a Bisync connection to a System/360-class machine to authorize a credit transaction in real time. The very next year, IBM took the logical next step: the IBM 2984 Cash Issuing Terminal. Like many other early ATMs, the 2984 had its debut in the UK as Lloyds Bank's "Cashpoint."

The 2984 similarly used Bisync communications with a System/360. While not the very first implementation of the concept, the 2984 was an important step in ATM security and the progenitor of an important line of cryptographic algorithms. To withdraw cash, a user inserted a magnetic card that contained an account number, and then keyed in a PIN. The 2984 sent this information, over the Bisync connection, to the computer, which then responded with a command such as "dispense cash." In some cases, the computer was immediately on the other side of the wall, but it was already apparent that banks would install ATMs in remote locations controlled via leased telephone lines—and those telephone lines were not well-secured. A motivated attacker (and with cash involved, it's easy to be motivated!) could probably "tap" the ATM's network connection and issue it spurious "dispense cash" commands. To prevent this problem, and assuage the concerns of bankers who were nervous about dispensing cash so far from the branch's many controls, IBM decided to encrypt the network connection.

The concept of an encrypted network connection was not at all new; encrypted communications were widely used in the military during the second World War and the concept was well-known in the computer industry. As IBM designed the 2984, in the late '60s, encrypted computer links were nonetheless very rare. There were not yet generally accepted standards, and cryptography as an academic discipline was immature.

IBM, to secure the 2984's network connection, turned to an algorithm recently developed by an IBM researcher named Horst Feistel. Feistel, for silly reasons, had named his family of experimental block ciphers LUCIFER. For the 2984, a modified version of one of the LUCIFER implementations called DSD-11. Through a Bureau of Standards design competition and the twists and turns of industry politics, DSD-1 later reemerged (with just slight changes) as the Data Encryption Standard, or DES. We owe the humble ATM honors for its key role in computer cryptography.

The 2984 was a huge step forward. Unlike the token-based machines of the 1960s, it was pretty much the same as the ATMs we use today. To use a 2984, you inserted your ATM card and entered a PIN. You could then choose to check your balance, and then enter how much cash you wanted. The machine checked your balance in real time and, if it was high enough, debited your account immediately before coughing up money.

The 2984 was not as successful as you might expect. The Lloyd's Bank rollout was big, but very few were installed by other banks. Collective memory of the 2984 is vague enough that I cannot give a definitive reason for its limited success, but I think it likely comes down to a common tale about IBM: price and flexibility. The 2984 was essentially a semi-custom peripheral, designed for Lloyd's Bank and the specific System/360 environment already in place there. Adoption for other banks was quite costly. Besides, despite the ATM's lead in the UK, the US industry had quickly caught up. By the time the 2984 would be considered by other banks, there were several different ATMs available in the US from other manufacturers (some of them the same names you see on ATMs today). The 2984 is probably the first "modern" ATM, but since IBM spent 4-5 years developing it, it was not as far ahead of the curve on launch day as you might expect. Just a year or two later, a now-forgotten company called Docutel was dominating the US market, leaving IBM little room to fit in.

Because most other ATMs were offered by companies that didn't control the entire software stack, they were more flexible, designed to work with simpler host support. There is something of an inverse vertical integration penalty here: when introducing a new product, close integration with an existing product family makes it difficult to sell! Still, it's interesting that the 2984 used pretty much the same basic architecture as the many ATMs that followed. It's worth reflecting on the 2984's relationship with its host, a close dependency that generally holds true for modern ATMs as well.

The 2984 connected to its host via a Bisync channel (possibly over various carrier or modem systems to accommodate remote ATMs), a communications facility originally provided for remote job entry, the conceptual ancestor of IBM's later block-oriented terminals. That means that the host computer expected the peripheral to provide some input for a job and then wait to be sent the results. Remote job entry devices, and block terminals later, can be confusing when compared to more familiar Unix-family terminals. In some ways, they were quite sophisticated, with the host computer able to send configuration information like validation rules for input. In other ways, they were very primitive, capable of no real logic other than receiving computer output (which was dumped to cards, TTY, or screen) and then sending computer input (from much the same devices). So, the ATM behaved the same way.

In simple terms, the ATM's small display (called a VDU or Video Display Unit in typical IBM terminology) showed whatever the computer sent as the body of a "display" command. It dispensed whatever cash the computer indicated with a "dispense cash" command. Any user input, such as reading a card or entry of a PIN number, was sent directly to the computer. The host was responsible for all of the actual logic, and the ATM was a dumb terminal, just doing exactly what the computer said. You can think of the Cash Issuing Terminal as, well, just that: a mainframe terminal with a weird physical interface.

IBM 4700 series documentation

Most modern ATMs follow this same model, although the actual protocol has become more sophisticated and involves a great deal more XML. You can be reassured that when the ATM takes a frustratingly long time to advance to the next screen, it is at least waiting to receive the contents of that screen from a host computer that is some distance away or, even worse, in The Cloud.

Incidentally, you might wonder about the software that ran on the host computer. I believe that the IBM 2984 was designed for use with CICS, the Customer Information Control System. CICS will one day get its own article, but it launched in 1966, built specifically for the Michigan Bell to manage customer and billing data. Over the following years, CICS was extensively expanded for use in the utility and later finance industries. I don't think it's inaccurate to call CICS the first "enterprise customer relationship management system," the first voyage in an adventure that took us through Siebel before grounding on the rocks of Salesforce. Today we wouldn't think of a CRM as the system of record for depository finance institutions like banks, but CICS itself was very finance-oriented from the start (telephone companies sometimes felt like accounting firms that ran phones on the side) and took naturally to gathering transactions and posting them against customer accounts. Since CICS was designed as an online system to serve telephone and in-person customer service reps (in fact making CICS a very notable early real-time computing system), it was also a good fit for handling ATM requests throughout the day.


I put a lot of time into writing this, and I hope that you enjoy reading it. If you can spare a few dollars, consider supporting me on ko-fi. You'll receive an occasional extra, subscribers-only post, and defray the costs of providing artisanal, hand-built world wide web directly from Albuquerque, New Mexico.


Despite the 2984's lackluster success, IBM moved on. I don't think IBM was particularly surprised by the outcome, the 2984 was always a "request quotation" (e.g. custom) product. IBM probably regarded it as a prototype or pilot with their friendly customer Lloyds Bank. More than actual deployment, the 2984's achievement was paving the way for the IBM 3614 Consumer Transaction Facility.

In 1970, IBM had replaced the System/360 line with the System/370. The 370 is directly based on the 360 and uses the same instruction set, but it came with numerous improvements. Among them was a new approach to peripheral connectivity that developed into the IBM Systems Network Architecture, or SNA, basically IBM's entry into the computer networking wars of the 1970s and 1980s. While SNA would ultimately cede to IP (with, naturally, an interregnum of SNA-over-IP), it gave IBM the foundations for networked systems that are almost modern in their look and feel.

I say almost because SNA was still very much a mainframe-oriented design. An example SNA network might look like this: An S/370 computer running CICS (or one of several other IBM software packages with SNA support) is connected via channel (the high-speed peripheral bus on mainframe computers, analogous to PCI) to an IBM 3705 Communications Controller running the Network Control Program (analogous to a network interface controller). The 3705 had one or more "scanners" installed, which supported simple low-speed serial lines or fast, high-level protocols like SDLC (synchronous data link control) used by SNA. The 3705 fills a role sometimes called a "front-end processor," doing the grunt work of polling (scanning) communications lines and implementing the SDLC protocol so that the "actual computer" was relieved of these menial tasks.

At the other end of one of the SDLC links might be an IBM 3770 Data Communications System, which was superficially a large terminal that, depending on options ordered, could include a teletypewriter, card reader and punch, diskette drives, and a high speed printer. Yes, the 3770 is basically a grown-up remote job entry terminal, and the SNA/SDLC stack was a direct evolution from the Bisync stack used by the 2984. The 3770 had a bit more to offer, though: in order to handle its multiple devices, like the printer and card punch, it acted as a sort of network switch—the host computer identified the 3770's devices as separate endpoints, and the 3770 interleaved their respective traffic. It could also perform that interleaving function for additional peripherals connected to it by serial lines, which depending on customer requirements often included additional card punches and readers for data entry, or line printers for things like warehouse picking slips.

In 1973, IBM gave banks the SNA treatment with the 3600 Finance Communication System 2. A beautifully orange brochure tells us:

The IBM 3600 Finance Communication System is a family of products designed to provide the Finance Industry with remote on-line teller station operation.

System/370 computers represented an enormous investment, generally around a million dollars and more often above that point than below. They were also large and required both infrastructure and staff to support them. Banks were already not inclined to install an S/370 in each branch, so it became a common pattern to place a "full-size" computer like an S/370 in a central processing center to support remote peripherals (over leased telephone line) in branches. The 3600 was a turn-key product line for exactly this use.

An S/370 computer with a 3704 or 3705 running the NCP would connect (usually over a leased line) to a 3601 System, which IBM describes as a "programmable communications controller" although they do not seem to have elevated that phrase to a product name. The 3601 is basically a minicomputer of its own, with up to 20KB of user-available memory and diskette drive. A 3601 includes, as standard, a 9600 bps SDLC modem for connection to the host, and a 9600 bps "loop" interface for a local multidrop serial bus. For larger installations, you could expand a 3601 with additional local loop interfaces or 4800 or 9600 bps modems to extend the local loop interface to a remote location via telephone line.

In total, a 3601 could interface up to five peripheral loops with the host computer over a single interleaved SDLC link. But what would you put on those peripheral loops? Well, the 3604 Keyboard Display Unit was the mainstay, with a vacuum fluorescent display and choice of "numeric" (accounting, similar to a desk calculator) or "data entry" (alphabetic) keyboard. A bank would put one of these 3604s in front of each teller, where they could inquire into customer accounts and enter transactions. In the meantime, 3610 printers provided general-purpose document printing capability, including back-office journals (logging all transactions) or filling in pre-printed forms such as receipts and bank checks. Since the 3610 was often used as a journal printer, it was available with a take-up roller that stored the printed output under a locked cover. In fact, basically every part of the 3600 system was available with a key switch or locking cover, a charming reminder of the state of computer security at the time.

The 3612 is a similar printer, but with the addition of a dedicated passbook feature. Remember passbook savings accounts, where the bank writes every transaction in a little booklet that the customer keeps? They were still around, although declining in use, in the 1970s. The 3612 had a slot on the front where an appropriately formatted passbook could be inserted, and like a check validator or slip printer, it printed the latest transaction onto the next empty line. Finally, the 3618 was a "medium-speed" printer, meaning 155 lines per minute. A branch bank would probably have one, in the back office, used for printing daily closing reports and other longer "administrative" output.

IBM 4700 series documentation

A branch bank could carry out all of its routine business through the 3600 system, including cash withdrawals. In fact, since a customer withdrawing cash would end up talking to a teller who simply keyed the transaction into a 3604, it seems like a little more automation could make an ATM part of the system.

Enter the 3614 Consumer Transaction Facility, the first IBM ATM available as a regular catalog item. The 3614 is actually fairly obscure, and doesn't seem to have sold in large numbers. Some sources suggest that it was basically the same as the 2984, but with a general facelift and adaptations to connect to a 3601 Finance Communication Controller instead of directly to a front-end processor. Some features which were optional on the 2984, like a deposit slot, were apparently standard on 3614. I'm not even quite sure when the 3614 was introduced, but based on manual copyright dates they must have been around by 1977.

One of the reasons the 3614 is obscure is that its replacement, the IBM 3624 Consumer Transaction Facility, hit the market in 1978—probably very shortly after the 3614. The 3624 was functionally very similar to the 3614, but with maintainability improvements like convenient portable cartridges for storing cash. It also brought a completely redesigned front panel that is more similar to modern ATMs. I should talk about the front panels—the IBM ATMs won a few design awards over their years, and they were really very handsome machines. The backlit logo panel and function-specific keys of the 3624 look more pleasant to use than most modern ATMs, although they would, of course, render translation difficult.

The 3614/3624 series established a number of conventions that are still in use today. For example, they added an envelope deposit system in which the machine accepted an envelope (with cash or checks) and printed a transaction identifier on the outside of the envelope for lookup at the processing center. This relieved the user of writing up a deposit slip when using the ATM. It was also capable of not only reading but, optionally, writing to the magnetic strips on ATM cards. To the modern reader that sounds strange, but we have to discuss one of the most enduring properties of the 3614/3624: their handling of PIN numbers.

I believe the 2984 did something fairly similar, but the details are now obscure (and seem to get mixed up with its use of LUCIFER/DSD-1/DES for communications). The 3614/3624, though, so firmly established a particular approach to PIN numbers that it is now known as the 3624 algorithm. Here's how it works: the ATM reads the card number (called Primary Account Number or PAN) off of the ATM card, reads a key from memory, and then applies a convoluted cryptographic algorithm to calculate an "intermediate PIN" from it. The "intermediate PIN" is then summed with a "PIN offset" stored on the card itself, modulo 10, to produce the PIN that the user is actually expected to enter. This means that your "true" PIN is a static value calculated from your card number and a key, but as a matter of convenience, you can "set" a PIN of your choice by using an ATM that is equipped to rewrite the PIN offset on your card. This same system, with some tweaks and a lot of terminological drift, is still in use today. You will sometimes hear IBM's intermediate PIN called the "natural PIN," the one you get with an offset of 0, which is a use of language that I find charming.

Another interesting feature of the 3624 was a receipt printer—I'm not sure if it was the first ATM to offer a receipt, but it was definitely an early one. The exact mechanics of the 3624 receipt printer are amusing and the result of some happenstance at IBM. Besides its mainframes and their peripherals, IBM in the 1970s was increasingly invested in "midrange computers" or "midcomputers" that would fill in a space between the mainframe and minicomputer—and, most importantly, make IBM more competitive with the smaller businesses that could not afford IBM's mainframe systems and were starting to turn to competitors like DEC as a result. These would eventually blossom into the extremely successful AS/400 and System i, but not easily, and the first few models all suffered from decidedly soft sales.

For these smaller computers, IBM reasoned that they needed to offer peripherals like card punches and readers that were also smaller. Apparently following that line of thought to a misguided extent, IBM also designed a smaller punch card: the 96-column three-row card, which was nearly square. The only computer ever to support these cards was the very first of the midrange line, the 1969 System/3. One wonders if the System/3's limited success led to excess stock of 96-column card equipment, or perhaps they just wanted to reuse tooling. In any case, the oddball System/3 card had a second life as the "Transaction Statement Printer" on the 3614 and 3624. The ATM could print four lines of text, 34 characters each, onto the middle of the card. The machines didn't actually punch them, and the printed text ended up over the original punch fields. You could, if you wanted, actually order a 3624 with two printers: one that presented the slip to the customer, and another that retained it internally for bank auditing. A curious detail that would so soon be replaced by thermal receipt printers.

Unlike IBM's ATMs before it, and, as we will see, unlike those after it as well, the 3624 was a hit. While IBM never enjoyed the dominance in ATMs that they did in computers, and companies like NCR and Diebold had substantial market share, the 3624 was widely installed in the late 1970s and would probably be recognized by anyone who was withdrawing cash in that era. The machine had technical leadership as well: NCR built their successful ATM line in part by duplicating aspects of the 3624 design, allowing interoperability with IBM backend systems. Ultimately, as so often happens, it may have been IBM's success that became its undoing.

In 1983, IBM completely refreshed their branch banking solution with the 4700 Finance Communication System. While architecturally similar, the 4700 was a big upgrade. For one, the CRT had landed: the 4700 peripherals replaced several-line VFDs with full-size CRTs typical of other computer terminals, and conventional computer keyboards to boot. Most radically, though, the 4700 line introduced distributed communications to IBM's banking offerings. The 4701 Communications Controller was optionally available with a hard disk, and could be programmed in COBOL. Disk-equipped 4701s could operate offline, without a connection to the host, or in a hybrid mode in which they performed some transactions locally and only contacted the host system when necessary. Local records kept by the 4701 could be automatically sent to the host computer on a scheduled basis for reconciliation.

Along with the 4700 series came a new ATM: the IBM 473x Personal Banking Machines. And with that, IBM's glory days in ATMs came crashing to the ground. The 473x series was such a flop that it is hard to even figure out the model numbers, the 4732 is most often referenced but others clearly existed, including the 4730, 4731, 4736, 4737, and 4738. These various models were introduced from 1983 to 1988, making up almost a decade of IBM's efforts and very few sales. The 4732 had a generally upgraded interface, including a CRT, but a similar feature set—unsurprising, given that the 3624 had already introduced most of the features ATMs have today. It also didn't sell. I haven't been able to find any numbers, but the trade press referred to the 4732 with terms like "debacle," so they couldn't have been great.

There were a few faults in the 4732's stars. First, IBM had made the decision to handle the 4700 Finance Communication System as a complete rework of the 3600. The 4700 controllers could support some 3600 peripherals, but 4700 peripherals could not be used with 3600 controllers. Since 3600 systems were widely installed in banks, the compatibility choice created a situation where many of the 4732's prospective buyers would end up having to replace a significant amount of their other equipment, and then likely make software changes, in order to support the new machine. That might not have been so bad on its own had IBM's competitors not provided another way out.

NCR made their fame in ATMs in part by equipping their contemporary models with 3624 software emulation, making them a drop-in modernization option for existing 3600 systems. Other ATM manufacturers had pursued a path of interoperability, with multiprotocol ATMs that supported multiple hosts, and standalone ATM host products that could interoperate with multiple backend accounting systems. For customers, buying an NCR or Diebold product that would work with whatever they already used was a more appealing option than buying the entire IBM suite in one go. It also matched the development cycle of ATMs better: as a consumer-facing device, ATMs became part of the brand image of the bank, and were likely to see replacement more often than back-office devices like teller terminals. NCR offered something like a regular refresh, while IBM was still in a mode of generational releases that would completely replace the bank's computer systems.

IBM 3614 promo photo

The 4732 and its 473x compatriots became the last real IBM ATMs. After a hiatus of roughly a decade, IBM reentered the ATM market by forming a joint venture with Diebold called InterBold. The basic terms were that Diebold would sell its ATMs in the US, and IBM would sell them overseas, where IBM had generally been the more successful of the two brands. The IBM 478x series ATMs, which you might encounter in the UK for example, are the same as the Diebold 1000 series in the US. InterBold was quite successful, becoming the dominant ATM manufacturer in the US, and in 1998 Diebold bought out IBM's share.

IBM had won the ATM market, and then lost it. Along the way, they left us with so much texture: DES's origins in the ATM, the 3624 PIN format, the dumb terminal or thin client model... even InterBold, IBM's protracted exit, gave us quite a legacy: now you know the reason that so many later ATMs ran OS/2. IBM, a once great company, provided Diebold with their once great operating system. Unlike IBM, Diebold made it successful.

  1. Wikipedia calls it DTD-1 for some reason, but IBM sources consistently say DSD-1. I'm not sure if the name changed, if DSD-1 and DTD-1 were slightly different things, or if Wikipedia is simply wrong. One of the little mysteries of the universe.

  2. I probably need to explain that I am pointedly not explaining IBM model numbers, which do follow various schemes but are nonetheless confusing. Bigger numbers are sometimes later products but not always; some prefixes mean specific things, other prefixes just identify product lines.

forecourt networking

The way I see it, few parts of American life are as quintessentially American as buying gas. We love our cars, we love our oil, and an industry about as old as automobiles themselves has developed a highly consistent, fully automated, and fairly user friendly system for filling the former with the latter.

I grew up in Oregon. While these rules have since been relaxed, many know Oregon for its long identity as one of two states where you cannot pump your own gas (the other being New Jersey). Instead, an attendant, employee of the gas station, operates the equipment. Like Portland's lingering indoor gas station, Oregon's favor for "full-service" is a holdover. It makes sense, of course, that all gas stations used to be full-service.

The front part of a gas station, where the pumps are and where you pull up your car, is called the Forecourt. The practicalities of selling gasoline, namely that it is a liquid sold by volume, make the forecourt more complex than you might realize. It's a set of devices that many of us interact with on a regular basis, but we rarely think about the sheer number of moving parts and long-running need for digital communications. Hey, that latter part sounds interesting, doesn't it?

Electric vehicles are catching on in the US. My personal taste in vehicles tends towards "old" and "cheap," but EVs have been on the market for long enough that they now come in that variety. Since my daily driver is an EV, I don't pay my dues at the Circle K nearly as often as I used to. One of the odd little details of EVs is the complexity hidden in the charging system or "EVSE," which requires digital communications with the vehicle for protection reasons. As consumers across the country install EVSE in their garages, we're all getting more familiar with these devices and their price tags. We might forget that, well, handling a fluid takes a lot of equipment as well... we just don't think about it, having shifted the whole problem to a large industry of loosely supervised hazardous chemical handling facilities.

Well, I don't mean to turn this into yet another discussion of the significant environmental hazard posed by leaking underground storage tanks. Instead, we're going to talk about forecourt technology. Let's start, then, with a rough, sketchy history of the forecourt.

Illustration from Triangle MicroSystems manual

The earliest volumetric fuel dispensers used an elevated glass tank where fuel was staged and measured before gravity drained it through the hose into the vehicle tank. Operation of these pumps was very manual, with an attendant filling the calibrated cylinder with the desired amount of gas, emptying it into the vehicle, and then collecting an appropriate sum of money. As an upside, the customer could be quite confident of the amount of fuel they purchased, since they could see it temporarily stored in the cylinder.

As cars proliferated in the 1910s, a company called Gilbarco developed a fuel dispenser that actually measured the quantity of fuel as it was being pumped from storage tank to vehicle... with no intermediary step in a glass cylinder required. The original Gilbarco design involved a metal turbine in a small glass sphere; the passing fuel spun the turbine which drove a mechanical counter. In truth, the design of modern fuel dispensers hasn't changed that much, although the modern volumetric turbines are made more accurate with a positive displacement design similar to a Roots blower.

Even with the new equipment, fuel was sold in much the same way: an attendant operated the pump, read the meter, and collected payment. There was, admittedly, an increased hazard of inattentive or malicious gas stations overcharging. Volumetric dispensers thus lead to dispensers that automatically calculated the price (now generally a legal requirement) and the practice of a regulatory authority like the state or tribal government testing fuel dispensers for calibration. Well, if consumers were expected to trust the gas station, perhaps the gas station ought to trust the consumer... and these same improvements to fuel dispensers made it more practical for the motorist to simply pump their own gas.

At the genesis of self-serve gasoline, most stations operated on a postpayment model. You pulled up, pumped gas, and then went inside to the attendant to pay whatever you owed. Of course, a few unscrupulous people would omit that last step. A simple countermeasure spread in busy cities: the pumps were normally kept powered off. Before dispensing gasoline, you would have to speak with the attendant. Depending on how trustworthy they estimated you to be, they might just turn on power to the pump or they might require you to deposit some cash with them in advance. This came to be known as "prepayment," and is now so universal in th US that the "prepay only" stickers on fuel dispensers seem a bit anachronistic 1.

It's simple enough to imagine how this scheme worked, electronically. There is separate power wiring to the pumps for each dispenser (and these stations usually only had two dispensers anyway), and that wiring runs to the counter where the attendant can directly switch power. Most gas stations do use submersible pumps in the tank rather than in the actual dispenser, but older designs still had one pump per dispenser and were less likely to use submersible pumps anyway.

Soon, things became more complex. Modern vehicles have big gas tanks, and gas has become fairly expensive. What happens when a person deposits, say, $20 of "earnest cash" to get a pump turned on, and then pumps $25 worth of gas? Hopefully they have the extra $5, but the attendant doesn't know that. Besides, gas stations grew larger and it wasn't always feasible for the attendant to see the dispenser counters out the window. You wouldn't want to encourage people to just lie about the amount of gas they'd dispensed.

Gas stations gained remote control: using digital communications, fuel dispensers reported the value of their accumulators to a controller at the counter. The attendant would use the same controller to enable dispenser, potentially setting a limit at which the dispenser would automatically shut off. If you deposit $20, they enable the pump with a limit of $20. If you pay by card, they will likely authorize the card for a fixed amount (this used to routinely be $40 but has gone up for reasons you can imagine), enable the dispenser with no limit or a high limit, and then capture the actual amount after you finished dispensing 2.

And that's how gas stations worked for quite a few decades. Most gas stations that you use today still have this exact same system in operation, but it may have become buried under additional layers of automation. There are two things that have caused combinatorial complexity in modern forecourt control: first, any time you automate something, there is a natural desire to automate more things. With a digital communications system between the counter and the forecourt, you can do more than just enable the dispensers! You might want to monitor the levels in the tanks, update the price on the big sign, and sell car wash vouchers with a discount for a related fuel purchase. All of these capabilities, and many more, have been layered on to forecourt control systems through everything from serial bus accessories to REST API third party integrations.

Speaking of leaking underground storage tanks, you likely even have a regulatory obligation to monitor tank levels and ensure they balance against bulk fuel deliveries and dispenser totals. This detects leakage, but it also detects theft, still a surprisingly common problem for gas stations. Your corporate office, or your bulk fuel provider, may monitor these parameters remotely to schedule deliveries and make sure that theft isn't happening with the cooperation of the station manager. Oh, and prices, those may be set centrally as well.

The second big change is nearly universal "CRIND." This is an awkward industry acronym for everyone's favorite convenience feature, Card Reader IN Dispenser. CRIND fuel dispensers let payment card customers complete the whole authorize, dispense, and capture process right at the dispenser, without coming inside at all. CRIND is so common today that it's almost completely displaced even its immediate ancestor, "fuel island" outdoor payment terminals (OPTs) that provide a central kiosk where customers make payments for multiple dispensers. This used to be a pretty common setup in California where self-service caught on early but, based on my recent travels, has mostly evaporated there.

So you can see that we have a complicated and open-ended set of requirements for communication and automation in the fuel court: enabling and monitoring pumps, collecting card payments, and monitoring and controlling numerous accessories. Most states also require gas stations to have an intercom system so that customers can request help from the attendant inside. Third-party loyalty systems were briefly popular although, mercifully, the more annoying of them have mostly died out... although only because irritating advertising-and-loyalty technology has been better integrated into the dispensers themselves.

Further complicating things, gas station forecourts are the epitome of legacy integration. Fuel dispensers are expensive, concrete slabs are expensive, and gas stations run on thin margins. While there aren't very many manufacturers of fuel dispensers, or multi-product dispensers as they're typically called today, the industry of accessories, control systems, and replacement parts is vast. Most gas stations have accumulated several different generations of control systems and in-dispenser accessories like tree rings. New features like CRIND, chip payment, touchless payment, and "Gas Station TV" have each motivated another round of new communications protocols.

And that's how we get to our modern world, where the brochure for a typical gas station forecourt controller lists 25+ different communications protocols—and assures that you can use "any mix."

Variability between gas stations increases when you consider the differing levels of automation available. It used to be common for gas stations to use standalone pump controllers that didn't integrate with much else—when you prepaid, for example, the cashier would manually enter the pump number and prepayment limit on a separate device from the cash register.

Here in New Mexico, quite a few stations used to use the Triangle MicroSystems MPC family, a wedge-shaped box with an industrial-chic membrane keypad in grey and bright red. Operation of the MPC is pretty simple, basically pressing a pump number and then entering a dollar limit. Of course, the full set of features runs much deeper, including financial reporting and fleet fueling contracts.

This is another important dimension of the gas station control industry: fleet fueling. It used to be that gas stations were divided into two categories, consumer stations that took cash payment and "cardlock" stations that used an electronic payment system. Since cardlock stations originally relied on proprietary, closed payment agreements, they didn't sell to consumers and had different control requirements (often involving an outside payment terminal). As consumers widely adopted card payments, the lines between the two markets blurred. Modern cardlock fueling networks, like CFN and Wex, are largely just another set of payment processors. Most major gas stations participate in most major cardlock networks, just the same as they participate in most major ATM networks for lower-cost processing of debit cards.

Of course, more payment networks call for more integrations. The complexity of the modern payment situation has generally outgrown standalone controllers, and they seem to be fading away. Instead, the typical gas station today has forecourt control completely integrated into their POS system. Forecourt integration is such an important requirement that gas station convenience stores, mostly handling normal grocery-type transactions, nevertheless rely almost exclusively on dedicated gas station POS solutions. In other words, next time you buy a can of Monster and a bag of chips, the cashier most likely rings you up and takes payment through a POS solution offered by the dispenser manufacturer (like Gilbarco Passport Retail) or one of dozens of vendors that caters specifically to gas stations (including compelling names like Petrosoft). Control of fuel dispensers is just too weird of a detail to integrate into other POS platforms... or so it was thought, although things clearly get odd as Gilbarco has to implement basic kitchen video system integration for the modern truck stop.

So how does this all work technically? That's the real topic of fascination, right? Well, it's a mess and hard to describe succinctly. There are so many different options, and particularly legacy retrofit options, that one gas station will be very different from the next.

In the days of "mechanical pumps," simple designs with mechanical counters, control wiring was simple: the dispenser (really a mechanical device called a pulser) was expected to provide "one pulse per penny" on a counting circuit for dollars dispensed, which incremented a synchronized counter on the controller. For control the other way, the controller just closed relays to open "fast" or "slow" valves on the dispenser. The controller might also get a signal when a handle lever is activated, to alert the attendant that someone is trying to use a dispenser, but that was about it.

Later on, particularly as multi-product dispensers with two hoses and four rates (due to diesel and three grades) became common, wiring all the different pulse and valve circuits became frustrating. Besides, pumps with digital counters no longer needed mechanical adjustment when prices changed, allowing for completely centralized price calculation. To simplify wiring while enabling new features, fuel dispenser manufacturers introduced simple current-loop serial buses. These are usually implemented as a single loop that passes through each dispenser, carrying small packets with addressed commands or reports, usually at a pretty low speed. The dispensers designed for use with these systems are much more standalone than the older mechanical dispensers, and perform price accumulation internally, so they only needed to report periodic totals during fueling and at the end of the transaction.

An upside of these more standalone dispensers is that they made CRIND easier to implement: the payment terminal in the dispenser could locally enable the pump, including setting limits, by a direct interface to the pump controller. Still, the CRIND needed some way to actually authorize and report transactions. Solution: another current loop. Most CRIND installations involved a second, similar, but usually higher-speed serial bus that specifically handled payment processing. The CRIND terminals in such a system usually communicated with a back-office payment server using a very simple protocol, sending card information in the clear. That back-office server might be in the back of the convenience store, but it could also be remote.

As gas stations introduced CRIND, plastic card sales became a key part of the business. Card volume is much greater than cash volume at most stations, and it's known that customers will often leave rather than go inside if there is a problem with CRIND payment. So gas stations prioritized reliability of payments. To this day, if you look at the roof of many gas stations, you'll find a small parabolic antenna pointed aimlessly skywards. By the end of the 1990s, many chain gas stations used satellite networks for payment processing, either routinely (cheaper than a leased telephone line!) or as a contingency. Cisco's VSAT terminal modules for edge routers, combined with a boutique industry of Mbps-class data networks on leased transponders, made satellite a fairly inexpensive and easy-to-obtain option for handling small payment processing messages.

This arrangement of one current loop for dispenser control and one current loop for payment terminals lasted for long enough that it became a de facto wiring standard for the gas station forecourt. New construction gas stations provided conduits from the convenience store to the pumps, and those conduits were usually spec'd for an A/C power circuit (controlled, per code, by an emergency stop button) and two low-voltage data circuits. The low-voltage data circuits were particularly important because the electrical code and fire code impose specific rules on electrical systems used in proximity to flammable fluids—what's called a "hazardous environment" in the language of safety codes. Dispenser manufacturers sold specialized field interconnection enclosures that isolated the data circuits to the required safety standard, lowering the complexity of the installation in the dispensers themselves 3.

Illustration from Gilbarco manual

The next event to challenge forecourt infrastructure was the introduction of EMV chip and tap-to-pay payment cards. Many Americans will remember how fuel dispensers routinely had tap-to-pay terminals physically installed for years, even a decade, before they actually started working. Modernizing dispensers usually meant installing a new CRIND system with EMV support, but upgrades to the underlying network to support them took much longer. The problem was exactly the simplicity of the CRIND current loop design: EMV standards required that all data be encrypted (you couldn't just send card numbers to the backend in the clear as older systems did), and required larger and more numerous messages between the payment network, the terminal, and the card itself. Even if supporting EMV transactions on the serial bus was possible, most manufacturers chose not to, opting for the vastly simpler design of direct IP connectivity to each CRIND terminal.

But how do you put IP over a simple two-wire serial bus? Well, there are a lot of options, and the fuel dispenser industry chose basically all of them. There were proprietary solutions, but more common were IP networking technologies adapted to the forecourt application. Consider DSL: for a good decade, many forecourt interconnection boxes and fuel dispenser controllers supported good old fashioned DSL over the payment loop (not to be confused with DSL as in Diesel, an abbreviation also used in the fuel industry).

Bandwidth requirements increased yet further, though, with the introduction of Maria Menounos. "Forecourt media" advertising systems can deliver full video to each dispenser, a golden opportunity to pitch credit cards and monetize something called "Cheddar." While there was a long era of satellite transponders delivering analog video to chains for in-store marketing (I will one day write about WalMart TV), the "GSTV" phenomenon is newer and completely internet-based. For HD video you need a little better than the 5Mbps performance that industrial DSL systems were delivering. Enter HomePlug.


I put a lot of time into writing this, and I hope that you enjoy reading it. If you can spare a few dollars, consider supporting me on ko-fi. You'll receive an occasional extra, subscribers-only post, and defray the costs of providing artisanal, hand-built world wide web directly from Albuquerque, New Mexico.


Despite HomePlug's limited market success, it has been widely adopted in gas station forecourts. The advantage of HomePlug is clear: it dispenses with the control wiring loops entirely, providing IP communications with dispensers over the electrical supply wiring. It usually presents an almost zero-wiring upgrade, just adding HomePlug boards on both ends, so even in stations with good forecourt serial loops, dispenser upgrades often end in a switch to HomePlug.

The most interesting thing about these networks is just how modular it all still is: somewhere in your local gas station, there is a forecourt controller. Depending on the age of the system, that might be a bespoke embedded system with plug-in modules, or it might be a generic N100 Mini PC with a few serial ports and mostly IP connectivity. There is likely a forecourt interconnection box that holds not just the wiring terminals but also adapter boards that convert between various serial protocols, IP carriers, and control signals. The point of sale backend server might interact with the forecourt controller via IP, but older systems used RS-232... and systems in between might use the same logical protocol as they did with RS-232, but encapsulated in TCP. The installation manuals for all of these products include pages of wiring diagrams for each different scenario.

Next time you stop at a gas station and find the CRIND not working, think about all of that: whatever technician comes out to fix it will have their work cut out for them, just to figure out which way that gas station is set up.

  1. In more rural areas of poorer states such as my own, you will still find gas stations where the attendant turns the pump on after eyeing you. These are mostly stations that just haven't had the money to install newer equipment, which as we will see can be a big project. I have lived here for about a decade, long enough to have noticed a significant decline in the number of these stations still operating.

  2. For most payment card technologies, "authorizing" and "capturing" are separate steps that can be done with different dollar amounts. This model of paying for gas is one of the reasons why.

  3. For example, UL standards require physical separation between mains voltage wiring and plumbing components inside of fuel dispenser enclosures. The enclosures are actually rather crowded spaces, so that can turn into a real hassle—and a selling point for low-voltage-only control systems. Fuel dispenser enclosures are also required to contain a fuel fire due to leaking plumbing, which is why you see fairly heavy sheet metal construction with the sides forming chimney-like vents.

the essence of frigidity

The front of the American grocery store contains a strange, liminal space: the transitional area between parking lot and checkstand, along the front exterior and interior of the building, that fills with oddball commodities. Ice is a fixture at nearly every store, filtered water at most, firewood at some. This retail purgatory, both too early and too late in the shopping journey for impulse purchases, is mostly good only for items people know they will need as they check out. One of the standard residents of this space has always struck me as peculiar: dry ice.

Carbon dioxide ice is said to have been invented, or we might better say discovered, in the 1830s. For whatever reason, it took just about a hundred years for the substance to be commercialized. Thomas B. Slate was a son of Oregon, somehow ended up in Boston, and then realized that the solid form of CO2 was both fairly easy to produce and useful as a form of refrigeration. With an eye towards marketing, he coined the name Dry Ice—and founded the DryIce Corporation of America. The year was 1925, and word quickly spread. In a widely syndicated 1930 article, "Use of Carbon Dioxide as Ice Said to be Developing Rapidly," the Alamogordo Daily News and others reported that "the development of... 'concentrated essence of frigidity' for use as a refrigerant in transportation of perishable products, is already taxing the manufacturing facilities of the Nation... So rapidly has the use of this new form of refrigeration come into acceptance that there is not sufficient carbon dioxide gas available."

The rush to dry ice seems strange today, but we must consider the refrigeration technology of the time. Refrigerated transportation first emerged in the US during the middle of the 19th century. Train boxcars, packed thoroughly with ice, carried meat and fruit from midwestern agriculture to major cities. This type of refrigerated transportation greatly expanded the availability of perishables, and the ability to ship fruits and vegetables between growing regions made it possible, for the first time, to get some fresh fruit out of season. Still, it was an expensive proposition: railroads built extensive infrastructure to support the movement of trains loaded down with hundreds of tons of ice. The itself had to be quarried from frozen lakes, some of them purpose-built, a whole secondary seasonal transportation economy.

Mechanical refrigeration, using some kind of phase change process as we are familiar with today, came about a few decades later and found regular use on steamships by 1900. Still, this refrigeration equipment was big and awkward; steam power was a practical requirement. As the Second World War broke out, tens of thousands of refrigerated railcars and nearly 20,000 refrigerated trucks were in service—the vast majority still cooled by ice, not mechanical refrigeration.

You can see, then, the advantages of a "dryer" and lighter form of ice. The sheer weight of the ice significantly reduced the capacity of refrigerated transports. "One pound of carbon dioxide ice at 110 degrees below zero is declared to be equivalent to 16 pounds of water ice," the papers explained, for the purposes of transportation. The use of dry ice could reduce long-haul shipping costs for fruit and vegetables by 50%, the Department of Commerce estimated, and dry ice even opened the door to shipping fresh produce from the West Coast to the East—without having to "re-ice" the train multiple times along the way. Indeed, improvements in refrigeration would remake the American agricultural landscape. Central California was being irrigated so that produce could grow, and refrigeration would bring that produce to market.

1916 saw the American Production Company drilling on the dusty plains of northeastern New Mexico, a few miles south of the town of Bueyeros. On the banks of an anonymous wash, in the shadow of Mesa Quitaras, they hoped to strike oil. Instead, at about 2,000 feet, they struck something else: carbon dioxide. The well blew wide open, and spewed CO2 into the air for about a year, the production estimated at 25,000,000 cubic feet of gas per day under natural pressure. For American Production, this was an unhappy accident. They could identify no market for CO2, and a year later, they brought the well under control, only to plug and abandon it permanently.

Though the "No. 1 Bueyeros" well was a commercial failure at the time, it was not wasted effort. American Production had set the future for northeastern New Mexico. There was oil, if you looked in the right place. American Production found its own productive wells, and soon had neighbors. Whiting Brothers, once operator of charismatic service stations throughout the Southwest and famously along Route 66, had drilled their own wells by 1928. American Production became part of British Petroleum. Breitburn Production of Texas has now consolidated much of the rest of the field, and more than two million cubic feet of natural gas come from northeastern New Mexico each month.

If you looked elsewhere, there was gas—not natural gas, but CO2. Most wells in the region produced CO2 as a byproduct, and the less fortunate attempts yielded nothing but CO2. The clear, non-flammable gas was mostly a nuisance in the 1910s and 1920s. By the 1930s, though, promotion by the DryIce Corporation of America (in no small part through the Bureau of Commerce) had worked. CO2 started to be seen as a valuable commodity.

Harding County dry ice plant

The production of dry ice is deceptively simple. Given my general knowledge about producing and handling cryogenic gases, I was surprised to read of commercial-scale production with small plants in the 1930s. There is, it turns out, not that much to it. One of the chief advantages of CO2 as an industrial gas is its low critical temperature and pressure. If you take yourself back to high school chemistry, and picture a phase diagram, we can think about liquifying the CO2 gas coming out of a well. The triple point of carbon dioxide, where increasing pressure and temperature will make it a liquid, is at around -60 Celsius and 5 atmospheres. The critical point, beyond which CO2 becomes a supercritical gas-fluid hybrid, is only at 30 degrees Celsius and 72 atmospheres. In terms more familiar to us Americans, that's about 88 degrees F and 1,000 PSI.

In other words, CO2 gas becomes a liquid at temperatures and pressures that were readily achievable, even with the early stages of chemical engineering in the 1930s. With steam-powered chillers and compressors, it wasn't difficult to produce liquid CO2 in bulk. But CO2 makes the next step even more convenient: liquid CO2, released into open air, boils very rapidly. As it bubbles away, the phase change absorbs energy, leaving the remaining liquid CO2 even colder. Some of it freezes into ice, almost like evaporating seawater to extract the salt, evaporating liquid CO2 leaves a snow-like mass of flaky, loose CO2 ice. Scoop that snow up, pack it into forms, and use steam power or weight to compress it, and you have a block of the product we call dry ice.

The Bueyeros Field, as it was initially known, caught the interest of CO2 entrepreneurs in 1931. A company called Timmons Carbonic, or perhaps Southern Dry Ice Company (I suspect these to be two names for the same outfit), produced a well about a mile east, up on the mesa.

Over the next few years, the Estancia Valley Carbon Dioxide Development Company drilled a series of wells to be operated by Witt Ice and Gas. These were located in the Estancia field, further southwest and closer to Albuquerque. Witt built New Mexico's first production dry ice plant, which operated from 1932 to 1942 off of a pipeline from several nearby wells. Low pressure and difficult drilling conditions in the Estancia field limited the plant's output, so by the time it shut down Witt had already built a replacement. This facility, known as the Bueyeros plant, produced 17 tons of dry ice per day starting in 1940. It is located just a couple of miles from the original American Production well, north of Mesa Quitaras.

About 2,000' below the surface at Bueyeros lies the Tubb Sandstone, a loose aggregation of rock stuck below the impermeable Cimarron Anhydrite. Carbon dioxide can form underground through several processes, including the breakdown of organic materials under great heat and pressure (a process that creates petroleum oil as well) and chemical reactions between different minerals, especially when volcanic activity causes rapid mixing with plenty of heat. There are enough mechanisms of formation, either known or postulated, that it's hard to say where exactly the CO2 came from. Whatever its source, the gas flowed upwards underground into the sandstone, where it became trapped under the airtight layer of Anhydrite. It's still there today, at least most of it, and what stands out in particular about northeastern New Mexico's CO2 is its purity. Most wells in the Bueyeros field produce 99% pure CO2, suitable for immediate use.

Near Solano, perhaps 20 miles southwest of Bueyeros by air, the Carbonic Chemical Co built the state's largest dry ice plant. Starting operation in 1942, the plant seems to have initially gone by the name "Dioxice," immortalized as a stop on the nearby Union Pacific branch. Dioxice is an occasional synonym for Dry Ice, perhaps intended to avoid the DryIce Corporation's trademark, although few bothered. The Carbonic Chemical Plant relied on an 18 mile pipeline to bring gas from the Bueyeros field. Uniquely, this new plant used a "high pressure process." By feeding the plant only with wells producing high pressure (hundreds of PSI, as much as 500 PSI of natural pressure at some wells), the pipeline was made more efficient and reliable. Further, the already high pressure of the gas appreciably raised the temperature at which it would liquefy.

The Carbonic Chemical plant's ammonia chillers only had to cool the CO2 to -15 degrees F, liquifying it before spraying it into "snow chambers" that filled with white carbon dioxide ice. A hydraulic press, built directly into the snow chamber, applied a couple of hundred tons of force to create a solid block of dry ice weighing some 180 pounds. After a few saw cuts, the blocks were wrapped in paper and loaded onto insulated train cars for delivery to customers throughout the west—and even some in Chicago.

The main applications of CO2, a 1959 New Mexico Bureau of Mines report explains, were dry ice for shipping. Secondarily, liquid CO2 was shipped in tanks for use in carbonating beverages. Witt Ice and Gas in particular built a good business out of distributing liquid CO2 for beverage and industrial use, and for a time was a joint venture with Chicago-based nationwide gas distributor Cardox. Bueyeros's gas producers found different customers over time, so it is hard to summarize their impact, but we know some salient examples. Most beverage carbonation in mid-century Denver, and perhaps all in Albuquerque, used Bueyeros gas. Dry ice from Bueyeros was used to pack train cars passing through from California, and accompanied them all the way to the major cities of the East Coast.

By the 1950s, much of the product went to a more modern pursuit. Experimental work pursued by the military and the precursors to the Department of Energy often required precise control of low temperatures, and both solid and liquid CO2 were suitable for the purpose. In the late 1950s, Carbonic Chemical listed Los Alamos Scientific Laboratory, Sandia Laboratories, and White Sands Missile Range as their primary customers.

Bueyeros lies in Harding County, New Mexico. Harding County is home to two incorporated cities (Roy and Mosquero), a couple of railroad stops, a few highways, and hardly 650 people. It is the least populous county of New Mexico, but it's almost the size of Delaware. Harding County has never exactly been a metropolis, but it did used to be a more vital place. In the 1930s, as the CO2 industry built out, there were almost 4,500 residents. Since then, the population has declined about 20% from each census to the next.

Harding County dry ice plant

CO2 production went into a similar decline. After the war, significant improvements in refrigeration technology made mechanical refrigeration inevitable, even for road transportation. Besides, the growing chemical industry had designed many industrial processes that produced CO2 as a byproduct. CO2 for purposes like carbonation and gas blanketing was often available locally at lower prices than shipped-in well CO2, leading to a general decline in the CO2 industry.

Growing understanding of New Mexico geology and a broader reorganizing of the stratigraphic nomenclature lead the Bueyeros Field to become part of the Bravo Dome. Bravo Dome CO2 production in the 1950s and 1960s was likely supported mostly by military and weapons activity, as by the end of the 1960s the situation once again looked much like it did in the 1910s: the Bravo Dome had a tremendous amount of gas to offer, but there were few applications. The rate of extraction was limited by the size of the market. Most of the dry ice plants closed, contributing, no doubt, to the depopulation of Harding County.

The whole idea of drilling for CO2 is now rather amusing. Our modern problems are so much different: we have too much CO2, and we're producing even more without even intending to. It has at times seemed like the industry of the future will be putting CO2 down into the ground, not taking it out. What happened out in Harding County was almost the opening of Pandora's box. A hundred years ago, before there was a dry ice industry in the US, newspaper articles already speculated as to the possibility of global warming by CO2. At the time, it was often presented as a positive outcome: all the CO2 released by burning coal would warm the environment and thus reduce the need for that coal, possibly even a self-balancing problem. It's even more ironic that CO2 was extracted mostly to make things colder, given the longer-term consequences. Given all that, you would be forgiven for assuming that drilling for CO2 was a thing of the past.

The CO2 extraction industry has always been linked to the oil industry, and oil has always been boom and bust. In 1982, there were 16 CO2 wells operating in the Bravo Dome field. At the end of 1985, just three years later, there were 258. Despite the almost total collapse of demand for CO2 refrigeration, demand for liquid CO2 was up by far. It turns out that American Production hadn't screwed up in 1917, at least not if they had known a little more about petroleum engineering.

In 1972, the Scurry Area Canyon Reef Operators Committee of West Texas started an experiment, attempting industrial application of a technique first proposed in the 1950s. Through a network of non-productive oil wells in the Permian Basin, they injected liquid CO2 deep underground. The rapidly evaporating liquid raised the pressure in the overall oil formation, and even lubricated and somewhat fractured the rock, all of which increased the flow rate at nearby oil wells. A decade later, the concept was proven, and CO2 Enhanced Oil Recovery (EOR) swept across the Permian Basin.

Today, it is estimated that about 62% of the global industrial production of CO2 is injected into the ground somewhere in North America to stimulate oil production. The original SACROC system is still running, now up to 414 injection wells. There are thousands more. Every day, over two billion cubic feet of CO2 are forced into the ground, pushing back up 245,000 barrels of additional oil.

British Petroleum's acquisition of American Production proved fortuitous. BP became one of the country's largest producers of CO2, extracted from the ground around Bueyeros and transported by pipeline directly to the Permian Basin for injection. In 2000, BP sold their Bravo Dome operations to Occidental Petroleum 1. Now going by Oxy, the petroleum giant has adopted a slogan of "Zero In". That's zero as in carbon emissions.

I would not have expected to describe Occidental Petroleum as "woke," but in our contemporary politics they stand out. Oxy mentions "Diversity, Inclusion, and Belonging" on the front page of their website, which was once attractive to investors but now seems more attractive to our nation's increasingly vindictive federal government. Still, Oxy is sticking to a corporate strategy that involves acknowledging climate change as real, which I suppose counts as refreshing. From a 2025 annual report:

Oxy is building an integrated portfolio of low-carbon projects, products, technologies and companies that complement our existing businesses; leveraging our competitive advantages in CO2 EOR, reservoir management, drilling, essential chemicals and major infrastructure projects; and are designed to sustain long term shareholder value as we work to implement our Net-Zero Strategy.

Yes, Oxy has made achieving net-zero carbon a major part of their brand, and yes, this model of reducing carbon emissions relies heavily on CO2 EOR: the extraction of CO2 from the ground.

In a faltering effort to address carbon emissions, the United States has leaned heavily on the promise of Carbon Capture and Storage (CCS) technologies. The idea is to take CO2 out of the environment (potentially by separating it from the air but, more practically, by capturing it in places where it is already concentrated by industrial processes) and to put it somewhere else. Yes, this has shades of the Australian television sketch about the ship whose front fell off, but the key to "sequestration" is time. If we can put enough carbon somewhere that it will say for enough time, we can reduce the "active" greenhouse gas content of our environment. The main way we have found of doing this is injecting it deep underground. How convenient, then, that the oil industry is already looking for CO2 for EOR.

CCS has struggled in many ways, chief among them that the majority of planned CCS projects have never been built. As with most of our modern carbon reduction economy, even the CCS that has been built is, well, a little bit questionable. There is something of a Faustian bargain with fossil fuels. As we speak, about 45 megatons of CO2 are captured from industrial processes each year for CCS. Of that 45 Mt, 9 Mt are injected into dedicated CO2 sequestration projects. The rest, 80%, is purchased by the oil industry for use in EOR.

This form of CCS, in which the captured CO2 is applied to an industrial process that leads to the production of more CO2, has taken to the name CCUS. That's Carbon Capture, Utilization, and Storage. Since the majority of the CO2 injected for EOR never comes back up, it is a form of sequestration. Although the additional oil produced will generally be burned, producing CO2, the process can be said to be inefficient in terms of CO2. In other words, the CO2 produced by burning oil from EOR is less in volume than the CO2 injected to stimulate recovery of that oil.


I put a lot of time into writing this, and I hope that you enjoy reading it. If you can spare a few dollars, consider supporting me on ko-fi. You'll receive an occasional extra, subscribers-only post, and defray the costs of providing artisanal, hand-built world wide web directly from Albuquerque, New Mexico.


Mathematically, CCUS, the use of CO2 to produce oil, leads to a net reduction in released CO2. Philosophically, though, it is deeply unsatisfying. This is made all the worse by the fact that CCUS has benefited from significant government support. Outright subsidies for CCS are uncommon, although they do exist. What are quite common are grants and subsidized financing for the capital costs of CCS facilities. Nearly all CCS in the US has been built with some degree of government funding, totaling at least four billion dollars, and regulatory requirements for CCS to offset new fossil fuel plants may create a de facto electrical ratepayer subsidy for CCS. Most of that financial support, intended for our low-carbon future, goes to the oil producers.

The Permian Basin is well-positioned for CCS EOR because it produces mostly natural gas. Natural gas in its raw form, "well gas," almost always includes CO2. Natural gas processing plants separate the combustible gases from noncombustible ones, producing natural gas that has a higher energy content and burns more cleanly—but, in the process, venting large quantities of CO2 into the atmosphere. Oxy is equipping its Permian Basin natural gas plants with a capture system that collects the CO2 and compresses it for use in EOR.

The problem is that CO2 consumption for EOR has, as always, outpaced production. There aren't enough carbon capture systems to supply the Permian Basin fields, so "sequestered" CO2 is mixed with "new" CO2. Bravo Dome CO2 production has slowly declined since the 1990s, due mostly to declining oil prices. Even so, northeastern New Mexico is still full of Oxy wells bringing up CO2 by the millions of cubic feet. 218 miles of pipeline deliver Bueyeros CO2 into West Texas, and 120 miles of pipeline the other way land it in the oil fields of Wyoming. There is very nearly one producing CO2 well per person in Harding County.

Considering the totality of the system, it appears that government grants, financing incentives, and tax credits for CCS are subsidizing not only natural gas production but the extraction of CO2 itself. Whether this is progress on climate change or a complete farce depends a mathematical analysis. CO2 goes in, from several different sources; CO2 goes out, to several different dispositions. Do we remove more from the atmosphere than we end up putting back? There isn't an obvious answer.

The oil industry maintains that CCS is one of the most practical means of reducing carbon emissions, with more CO2 injected than produced and a resulting reduction in the "net CO2 impact" of the product natural gas.

As for more independent researchers, well, a paper finding that CCS EOR "cannot contribute to reductions" isn't the worst news. A 2020 literature review of reports on CCS EOR projects found that they routinely fail to account for significant secondary carbon emissions and that, due to a mix of the construction and operational realities of CCS EOR facilities and the economics of oil consumption, CCS EOR has so far produced a modest net increase in greenhouse gas emissions.

They're still out there today, drilling for carbon dioxide. The reports from the petroleum institute today say that the Permian Basin might need even more shipped in. New Mexico is an oil state; Texas gets the reputation but New Mexico has the numbers. Per-capita oil production here is significantly higher than Texas and second only to North Dakota. New Mexico now produces more oil than Old Mexico, if you will, the country to our south.

Per capita, New Mexico ranks 12th for CO2 emissions, responsible for about 1% of the nation's total. Well, I can do a bit better: for CO2 intentionally extracted from the ground, New Mexico is #3, behind only Colorado and Mississippi for total production. We produce something around 17% of the nation's supply of extracted CO2, and we even use most of it locally. I guess that's something you could put a good spin on.

  1. By this time, Armand Hammer was no longer CEO of Occidental, which is unfortunate since it deprives me of an excuse to talk at length about how utterly bizarre Armand Hammer was, and about the United World College he founded in Las Vegas, NM. Suffice it to say, for now, that Occidental had multiple connections to New Mexico.

air traffic control: the IBM 9020

Previously on Computers Are Bad, we discussed the early history of air traffic control in the United States. The technical demands of air traffic control are well known in computer history circles because of the prominence of SAGE, but what's less well known is that SAGE itself was not an air traffic control system at all. SAGE was an air defense system, designed for the military with a specific task of ground-controlled interception (GCI). There is natural overlap between air defense and air traffic control: for example, both applications require correlating aircraft identities with radar targets. This commonality lead the Federal Aviation Agency (precursor to today's FAA) to launch a joint project with the Air Force to adapt SAGE for civilian ATC.

There are also significant differences. In general, SAGE did not provide any safety functions. It did not monitor altitude reservations for uniqueness, it did not detect loss of separation, and it did not integrate instrument procedure or terminal information. SAGE would need to gain these features to meet FAA requirements, particularly given the mid-century focus on mid-air collisions (a growing problem, with increasing air traffic, that SAGE did nothing to address).

The result was a 1959 initiative called SATIN, for SAGE Air Traffic Integration. Around the same time, the Air Force had been working on a broader enhancement program for SAGE known as the Super Combat Center (SCC). The SCC program was several different ideas grouped together: a newer transistorized computer to host SAGE, improved communications capabilities, and the relocation of Air Defense Direction Centers from conspicuous and vulnerable "SAGE Blockhouses" to hardened underground command centers, specified as an impressive 200 PSI blast overpressure resistance (for comparison, the hardened telecommunication facilities of the Cold War were mostly specified for 6 or 10 PSI).

At the program's apex, construction of the SCCs seemed so inevitable that the Air Force suspended the original SAGE project under the expectation that SCC would immediately obsolete it. For example, my own Albuquerque was one of the last Air Defense Sectors scheduled for installation of a SAGE computer. That installation was canceled; while a hardened underground center had never been in the cards for Albuquerque, the decision was made to otherwise build Albuquerque to the newer SCC design, including the transistorized computer. By the same card, the FAA's interest in a civilian ATC capability, and thus the SATIN project, came to be grouped together with the SCC program as just another component of SAGE's next phase of development.

SAGE had originally been engineered by MIT's Lincoln Laboratory, then the national center of expertise in all things radar. By the late 1950s a large portion of the Lincoln Laboratory staff were working on air defense systems and specifically SAGE. Those projects had become so large that MIT opted to split them off into a new organization, which through some obscure means came to be called the MITRE Corporation. MITRE was to be a general military R&D and consulting contractor, but in its early years it was essentially the SAGE company.

The FAA contracted MITRE to deliver the SATIN project, and MITRE subcontracted software to the Systems Development Corporation, originally part of RAND and among the ancestors of today's L3Harris. For the hardware, MITRE had long used IBM, who designed and built the original AN/FSQ-7 SAGE computer and its putative transistorized replacement, the AN/FSQ-32. MITRE began a series of engineering studies, and then an evaluation program on prototype SATIN technology.

There is a somewhat tenuous claim that you will oft see repeated, that the AN/FSQ-7 is the largest computer ever built. It did occupy the vast majority of the floorspace of the four-story buildings built around it. The power consumption was around 3 MW, and the heat load required an air conditioning system at the very frontier of HVAC engineering (you can imagine that nearly all of that 3 MW had to be blown out of the building on a continuing basis). One of the major goals of the AN/FSQ-32 was reduced size and power consumption, with the lower heat load in particular being a critical requirement for installation deep underground. Of course, the "deep underground" part more than wiped out any savings from the improved technology.

From Air Defense to Air Traffic Control

By the late 1950s, enormous spending for the rapid built-out of defense systems including SAGE and the air defense radar system (then the Permanent System) had fatigued the national budget and Congress. The winds of the Cold War had once again changed. In 1959, MITRE had begun operation of a prototype civilian SAGE capability called CHARM, the CAA High Altitude Remote Monitor (CAA had become the FAA during the course of the CHARM effort). CHARM used MIT's Whirlwind computer to process high-altitude radar data from the Boston ARTCC (Air Route Traffic Control Center), which it displayed to operators while continuously evaluating aircraft movements for possible conflicts. CHARM was designed for interoperability with SAGE, the ultimate goal being the addition of the CHARM software package to existing SAGE computers. None of that would ever happen; by the time the ball dropped for the year 1960 the Super Combat Center program had been almost completely canceled. SATIN, and the whole idea of civilian air traffic control with SAGE, became blast damage.

In 1961, the Beacon Report concluded that there was an immediate need for a centralized, automated air traffic control system. Mid-air collisions had become a significant political issue, subject of congressional hearings and GAO reports. The FAA seemed to be failing to rise to the task of safe civilian ATC, a perilous situation for such a new agency... and after the cancellation of the SCCs, the FAA's entire plan for computerized ATC was gone.

During the late 1950s and 1960s, the FAA adopted computer systems in a piecemeal fashion. Many enroute control centers (ARTCCs), and even some terminal facilities, had some type of computer system installed. These were often custom software running on commodity computers, limited to tasks like recording flight plans and making them available to controllers at other terminals. Correlation of radar targets with flight plans was generally manual, as were safety functions like conflict detection.

These systems were limited in scale—the biggest problem being that some ARTCCs remained completely manual even in the late 1960s. On the upside, they demonstrated much of the technology required, and provided a test bed for implementation. Many of the individual technical components of ATC were under development, particularly within IBM and Raytheon, but there was no coordinated nationwide program. This situation resulted in part from a very intentional decision by the FAA to grant more decision making power to its regional offices, a concept that was successful in some areas but in retrospect disastrous in others. In 1967, the Department of Transportation was formed as a new cabinet-level executive department. The FAA, then the Federal Aviation Agency, was reorganized into DOT and renamed the Federal Aviation Administration. The new Administration had a clear imperative from both the President and Congress: figure out air traffic control.

In the late 1960s, the FAA coined a new term: the National Airspace System 1, a fully standardized, nationwide system of procedures and systems that would safely coordinate air traffic into the indefinite future. Automation of the NAS began with NAS Enroute Stage A, which would automate the ARTCCs that handled high-altitude aircraft on their way between terminals. The remit was more or less "just like SAGE but with the SATIN features," and when it came to contracting, the FAA decided to cut the middlemen and go directly to the hardware manufacturer: IBM.

The IBM 9020

It was 1967 by the time NAS Enroute Stage A was underway, nearly 20 years since SAGE development had begun. IBM would thus benefit from considerable advancements in computer technology in general. Chief among them was the 1965 introduction of the System/360. S/360 was a milestone in the development of the computer: a family of solid-state, microcoded computers with a common architecture for software and peripheral interconnection. S/360's chief designer, Gene Amdahl, was a genius of computer architecture who developed a particular interest in parallel and multiprocessing systems. Soon after the S/360 project, he left IBM to start the Amdahl Corporation, briefly one of IBM's chief competitors. During his short 1960s tenure at IBM, though, Amdahl contributed IBM's concept of the "multisystem."

A multisystem consisted of multiple independent computers that operated together as a single system. There is quite a bit of conceptual similarity between the multisystem and modern concepts like multiprocessing and distributed computing, but remember that this was the 1960s, and engineers were probing out the possibilities of computer-to-computer communication for the first time. Some of the ideas of S/360 multisystems read as strikingly modern and prescient of techniques used today (like atomic resource locking for peripherals and shared memory), while others are more clearly of their time (the general fact that S/360 multisystems tended to assign their CPUs exclusively to a specific task).

One of the great animating tensions of 1960s computer history is the ever-moving front between batch processing systems and realtime computing systems. IBM had its heritage manufacturing unit record data processing machines, in which a physical stack of punched cards was the unit of work, and input and output ultimately occurred between humans on two sides of a service window. IBM computers were designed around the same model: a "job" was entered into the machine, stored until it reached the end of the queue, processed, and then the output was stored for later retrieval. One could argue that all computers still work this way, it's just process scheduling, but IBM had originally envisioned job queuing times measured in hours rather than milliseconds.

The batch model of computing was fighting a battle on multiple fronts: rising popularity of time-sharing systems meant servicing multiple terminals simultaneously and, ideally, completing simple jobs interactively while the user waited. Remote terminals allowed clerks to enter and retrieve data right where business transactions were taking place, and customers standing at ticket counters expected prompt service. Perhaps most difficult of all, fast-moving airplanes and even faster-moving missiles required sub-second decisions by computers in defense applications.

IBM approached the FAA's NAS Enroute Stage A contract as one that required a real-time system (to meet the short timelines necessary in air traffic control) and a multisystem (to meet the FAA's exceptionally high uptime and performance requirements). They also intended to build the NAS automation on an existing, commodity architecture to the greatest extent possible. The result was the IBM 9020.

IBM 9020 concept diagram

The 9020 is a fascinating system, exemplary of so many of the challenges and excitement of the birth of the modern computer. On the one hand, a 9020 is a sophisticated, fault-tolerant, high-performance computer system with impressive diagnostic capabilities and remarkably dynamic resource allocation. On the other hand, a 9020 is just six to seven S/360 computers married to each other with a vibe that is more duct tape and bailing wire than aerospace aluminum and titanium.

The first full-scale 9020 was installed in Jacksonville, Florida, late in 1967. Along with prototype systems at the FAA's experimental center and at Raytheon (due to the 9020's close interaction with Raytheon-built radar systems), the early 9020 computers served as development and test platforms for a complex and completely new software system written mostly in JOVIAL. JOVIAL isn't a particularly well-remembered programming language, based on ALGOL with modifications to better suit real-time computer systems. The Air Force was investing extensively in real-time computing capabilities for air defense and JOVIAL was, for practical purposes, an Air Force language.

It's not completely clear to me why IBM selected JOVIAL for enroute stage A, but we can make an informed guess. There were very few high-level programming languages that were suitable for real-time use at all in the 1960s, and JOVIAL had been created by Systems Development Corporation (the original SAGE software vendor) and widely used for both avionics and air defense. The SCC project, if it had been completed, would likely have involved rewriting large parts of SAGE in JOVIAL. For that reason, JOVIAL had been used for some of the FAA's earlier ATC projects including SATIN. At the end of the day, JOVIAL was probably an irritating (due to its external origin) but obvious choice for IBM.

More interesting than the programming language is the architecture of the 9020. It is, fortunately, well described in various papers and a special issue of IBM Systems Journal. I will simplify IBM's description of the architecture to be more legible to a modern reader who hasn't worked for IBM for a decade.

Picture this: seven IBM S/360 computers, of various models, are connected to a common address and memory bus used for interaction with storage. These computers are referred to as Compute Elements and I/O Control Elements, forming two pools of machines dedicated to two different sets of tasks. Also on that bus are something like 10 Storage Elements, specialized machines that function like memory controllers with additional features for locking, prioritization, and diagnostics. These Storage Elements provide either 131 kB or about 1 MB of memory each; due to various limitations the maximum possible memory capacity of a 9020 is about 3.4 MB, not all of which is usable at any given time due to redundancy.

At least three Compute Elements, and up to four, serve as the general-purpose part of the system where the main application software is executed. Three I/O Control Elements existed mostly as "smart" controllers for peripherals connected to their channels, the IBM parlance for what we might now call an expansion bus.

The 9020 received input from a huge number of sources (radar digitizers, teletypes at airlines and flight service stations, controller workstations, other ARTCCs). Similarly, it sent output to most of these endpoints as well. All of these communications channels, with perhaps the exception of the direct 9020-to-9020 links between ARTCCs, were very slow even by the standards of the time. The I/O Control Elements each used two of their high-speed channels for interconnection with display controllers (discussed later) and tape drives in the ARTCC, while the third high-speed channel connected to a multiplexing system called the Peripheral Adapter Module that connected the computer to dozens of peripherals in the ARTCC and leased telephone lines to radar stations, offices, and other ATC sites.

Any given I/O Control Element had a full-time job of passing data between peripherals and storage elements, with steps to validate and preprocess data. In addition to ATC-specific I/O devices, the Control Elements also used their Peripheral Adapter Modules to communicate with the System Console. The System Console is one of the most unique properties of the 9020, and one of the achievements of which IBM seems most proud.

Multisystem installations of S/360s were not necessarily new, but the 9020 was one of the first attempts to present a cluster of S/360s as a single unified machine. The System Console manifested that goal. It was, on first glance, not that different from the operator's consoles found on each of the individual S/360 machines. It was much more than that, though: it was the operator's console for all seven of them. During normal 9020 operation, a single operator at the system console could supervise all components of the system through alarms and monitors, interact with any element of the system via a teletypewriter terminal, and even manually interact with the shared storage bus for troubleshooting and setup. The significance of the System Console's central control was such that the individual S/360 machines, when operating as part of the Multisystem, disabled their local operator's consoles entirely.

One of the practical purposes of the System Console was to manage partitioning of the system. A typical 9020 had three compute elements and three I/O control elements, an especially large system could have a fourth compute element for added capacity. The system was sized to produce 50% redundancy during peak traffic. In other words, a 9020 could run the full normal ATC workload on just two of the compute elements and two of the I/O control elements. The remaining elements could be left in a "standby" state in which the multisystem would automatically bring them online if one of the in-service elements failed, and this redundancy mechanism was critical to meeting the FAA's reliability requirement. You could also use the out-of-service elements for other workloads, though.

For example, you could remove one of the S/360s from the multisystem and then operate it manually or run "offline" software. An S/360 operating this way is described as "S/360 compatibility mode" in IBM documentation, since it reduces the individual compute element to a normal standalone computer. IBM developed an extensive library of diagnostic tools that could be run on elements in standby mode, many of which were only slight modifications of standard S/360 tools. You could also use the offline machines in more interesting ways, by bringing up a complete ATC software chain running on a smaller number of elements. For training new controllers, for example, one compute element and one I/O control element could be removed from the multisystem and used to form a separate partition of the machine that operated on recorded training data. This partition could have its own assigned peripherals and storage area and largely operate as if it were a complete second 9020.

Multisystem Architecture

You probably have some questions about how IBM achieved these multisystem capabilities, given the immature state of operating systems design at the time. The 9020 used an operating system derived from OS/360 MVT, an advanced form of OS/360 with a multitasking capability that was state-of-the-art in the mid-1960s but nonetheless very limited and with many practical problems. Fortunately, IBM was not exactly building a general-purpose machine, but a dedicated system with one function. This allowed the software to be relatively simple.

The core of the 9020 software system is called the control program, which is similar to what we would call a scheduler today. During routine operation of the 9020, any of the individual computers might begin execution of the control program at any time—typically either because the computer's previous task was complete (along the lines of cooperative multitasking) or because an interrupt had been received (along the lines of preemptive multitasking). To meet performance and timing requirements, especially with the large number of peripherals involved, the 9020 extensively used interrupts which could either be generated and handled within a specific machine or sent across the entire multisystem bus.

The control program's main function is to choose the next task to execute. Since it can be started on any machine at any time, it must be reentrant. The fact that all of the machines have shared memory simplifies the control program's task, since it has direct access to all of the running programs. Shared memory also added the complexity that the control program has to implement locking and conflict detection to ensure that it doesn't start the same task on multiple machines at once, or start multiple tasks that will require interaction with the same peripheral.

You might wonder about how, exactly, the shared memory was implemented. The storage elements were not complete computers, but did implement features to prevent conflicts between simultaneous access by two machines, for example. By necessity, all of the memory management used for the multisystem is quite simple. Access conflicts were resolved by choosing one machine and making the other wait until the next bus cycle. Each machine had a "private" storage area, called the preferential storage area. A register on each element contained an offset added to all memory addresses that ensured the preferential storage areas did not overlap. Beyond that, all memory had to be acquired by calling system subroutines provided by the control program, so that the control program could manage memory regions. Several different types of memory allocations were available for different purposes, ranging from arbitrary blocks for internal use by programs to shared buffer areas that multiple machines could use to queue data for an I/O Control Element to send elsewhere.

At any time during execution of normal programs, an interrupt could be generated indicating a problem with the system (IBM gives the examples of a detection of high temperature or loss of A/C power in one of the compute elements). Whenever the control program began execution, it would potentially detect other error conditions using its more advanced understanding of the state of tasks. For example, the control program might detect that a program has exited abnormally, or that allocation of memory has failed, or an I/O operation has timed out without completing. All of these situations constitute operational errors, and result in the Control Program ceding execution to the Operational Error Analysis Program or OEAP.

The OEAP is where error-handling logic lives, but also a surprising portion of the overall control of the multisystem. The OEAP begins by performing self-diagnosis. Whatever started the OEAP, whether the control program or a hardware interrupt, is expected to leave some minimal data on the nature of the failure in a register. The OEAP reads that register and then follows an automated data-collection procedure that could involve reading other registers on the local machine, requesting registers from other machines, and requesting memory contents from storage elements. Based on the diagnosis, the OEAP has different options: some errors are often transient (like communications problems), so the OEAP might do nothing and simply allow the control program to start the task again.

On the other hand, some errors could indicate a serious problem with a component of the system, like a storage element that is no longer responding to read and write operations in its address range. In those more critical cases, the OEAP will rewrite configuration registers on the various elements of the system and then reset them—and on initialization, the configuration registers will cause them to assume new states in terms of membership in the multisystem. In this way, the OEAP is capable of recovering from "solid" hardware failures by simply reconfiguring the system to no longer use the failed hardware. Most of the time, that involves changing the failed element's configuration from "online" to "offline," and choosing an element in "online standby" and changing its configuration to "online." During the next execution of the control program, it will start tasks on the newly "online" element, and the newly "offline" element may as well have never existed.

The details are, of course, a little more complex. In the case of a failed storage element, for example, there's a problem of memory addresses. The 9020 multisystem doesn't have virtual memory in the modern sense, addresses are more or less absolute (ignoring some logical addressing available for specific types of memory allocations). That means that if a storage element fails, any machines which have been using memory addresses within that element will need to have a set of registers for memory address offsets reconfigured and then execution reset. Basically, by changing offsets, the OEAP can "remap" the memory in use by a compute or I/O control element to a different storage element. Redundancy is also built in to the software design to make these operations less critical. For example, some important parts of memory are stored in duplicate with an offset between the two copies large enough to ensure that they will never fall on the same physical storage element.

So far we have only really talked about the "operational error" part, though, and not the "analysis." In the proud tradition of IBM, the 9020 was designed from the ground up for diagnosis. A considerable part of IBM's discussion of the architecture of the Control Program, for example, is devoted to its "timing analysis" feature. That capability allows the Control Program to commit to tape a record of when each task began execution, on which element, and how long it took. The output is a set of start and duration times, with task metadata, remarkably similar to what we would now call a "span" in distributed tracing. Engineers used these records to analyze the performance of the system and more accurately determine load limits such as the number of in-air flights that could be simultaneously tracked. Of course, details of the time analysis system remind us that computers of this era were very slow: the resolution on task-start timestamps was only 1/2 second, although durations were recorded at the relatively exact 1/60th of a second.

That was just the control program, though, and the system's limited ability to write timing analysis data (which, even on the slow computers, tended to be produced faster than the tape drives could write it and so had to fit within a buffer memory area for practical purposes) meant that it was only enabled as needed. The OEAP provided long-term analysis of the performance of the entire machine. Whenever the OEAP was invoked, even if it determined that a problem was transient or "soft" and took no action, it would write statistical records of the nature of the error and the involved elements. When the OEAP detected an unusually large number of soft errors from the same physical equipment, it would automatically reconfigure the system to remove that equipment from service and then generate an alarm.

Alerts generated by the OEAP were recorded by a printer connected to the System Console, and indicated by lights on the System Console. A few controls on the System Console allowed the operator manual intervention when needed, for example to force a reconfiguration.

One of the interesting aspects of the OEAP is where it runs. The 9020 multisystem is truly a distributed one in that there is no "leader." The control program, as we have discussed, simply starts on whichever machine is looking for work. In practice, it may sometimes run simultaneously on multiple machines, which is acceptable as it implements precautions to prevent stepping on its own toes.

This model is a little more complex for the OEAP, because of the fact that it deals specifically with failures. Consider a specific failure scenario: loss of power. IBM equipped each of the functional components of the 9020 with a battery backup, but they only rate the battery backup for 5.5 seconds of operation. That isn't long enough for a generator to reliably pick up the load, so this isn't a UPS as we would use today. It's more of a dying gasp system: the computer can "know" that it has lost power and continue to operate long enough stabilize the state of the system for faster resumption.

When a compute element or I/O control element loses power, an interrupt is generated within that single machine that starts the OEAP. The OEAP performs a series of actions, which include generating an interrupt across the entire system to trigger reconfiguration (it is possible, even likely given the physical installations, that the power loss is isolated to the single machine) and resetting task states in the control program so that the machine's tasks can be restarted elsewhere. The OEAP also informs the system console and writes out records of what has happened. Ideally, this all completes in 5.5 seconds while battery power remains reliable.

In the real world, there could be problems that lead to slow OEAP execution, or the batteries could fail to make it for long enough, or for that matter the compute element could encounter some kind of fundamentally different problem. The fact that the OEAP is executing on a machine means that something has gone wrong, and so until the OEAP completes analysis, the machine that it is running on should be considered suspect. The 9020 resolves this contradiction through teamwork: beginning of OEAP execution on any machine in the total system generates an interrupt that starts the OEAP on other machines in a "time-down" mode. The "time-down" OEAPs wait for a random time interval and then check the shared memory to see if the original OEAP has marked its execution as completed. If not, the first OEAP to complete its time-down timer will take over OEAP execution and attempt to complete diagnostics from afar. That process can, potentially, repeat multiple times: in some scenario where two of the three compute elements have failed, the remaining third element will eventually give up on waiting for the first two and run the OEAP itself. In theory, someone will eventually diagnose every problem. IBM asserts that system recovery should always complete within 30 seconds.

Let's work a couple of practical examples, to edify our understanding of the Control Program and OEAP. Say that a program running on a Compute Element sets up a write operation for an I/O Control Element, which formats and sends the data to a Peripheral Adapter Module which attempts to send it to an offsite peripheral (say an air traffic control tower teleprinter) but fails. A timer that tracks the I/O operation will eventually fail, triggering the OEAP on the I/O control element running the task. The OEAP reads out the error register on its new home, discovers that it is an I/O problem related to a PAM, and then speaks over the channel to request the value of state registers from the PAM. These registers contain flags for various possible states of peripheral connections, and from these the OEAP can determine that sending a message has failed because there was no response. These types of errors are often transient, due to telephone network trouble or bad luck, so the OEAP increments counters for future reference, looks up the application task that tried to send the message and changes its state to incomplete, clears registers on the PAM and I/O control element, and then hands execution back to the Control Program. The Control Program will most likely attempt to do the exact same thing over again, but in the case of a transient error, it'll probably work this time.

Consider a more severe case, where the Control Program starts a task on a Compute Element that simply never finishes. A timer runs down to detect this condition, and an interrupt at the end of the timer starts the Control Program, which checks the state and discovers the still-unfinished task. Throwing its hands in the air, the Control Program sets some flags in the error register and hands execution to the OEAP. The OEAP starts on the same machine, but also interrupts other machines to start the OEAP in time-down mode in case the machine is too broken to complete error handling. It then reads the error register and examines other registers and storage contents. Determining that some indeterminate problem has occurred with the Compute Element, the OEAP triggers what IBM confusingly calls a "logout" but we might today call a "core dump" (ironically an old term that was more appropriate in this era). The "logout" entails copying the contents of all of the registers and counters to the preferential storage area and then directing, via channel, one of the tape drives to write it all to a tape kept ready for this purpose—the syslog of its day. Once that's complete, the OEAP will reset the Compute Element and hand back to the Control Program to try again... unless counters indicate that this same thing has happened recently. In that case, the OEAP will update the configuration register on the running machine to change its status to offline, and choose a machine in online-standby. It will write to that machine's register, changing its status to online. A final interrupt causes the Control Program to start on both machines, taking them into their new states.

Lengthy FAA procedure manuals described what would happen next. These are unfortunately difficult to obtain, but from IBM documentation we know that basic information on errors was printed for the system operator. The system operator would likely then use the system console to place the suspicious element in "test" mode, which completely isolates it to behave more or less like a normal S/360. At that point, the operator could use one of the tape drives attached to the problem machine to load IBM's diagnostic library and perform offline troubleshooting. The way the tape drives are hooked up to specific machines is important; in fact, since the OEAP is fairly large, it is only stored in one copy on one Storage Element. The 9020 requires that one of the tape drives always have a "system tape" ready with the OEAP itself, and low-level logic in the elements allows the OEAP to be read from the ready-to-go system tape in case the storage element that contains it fails to respond.

A final interesting note about the OEAP is a clever optimization called "problem program mode." During analysis and handling of an error, the OEAP can decide that the critical phase of error handling has ended and the situation is no longer time sensitive. For example, the OEAP might decide that no action is required except for updating statistics, which can comfortably happen with a slight delay. These lower-priority remaining tasks can be added to memory as "normal" application tasks, to be run by the Control Program like any other task after error handling is complete. Think of it as a deferral mechanism, to avoid the OEAP locking up a machine for any longer than necessary.

For the sake of clarity, I'll note again an interesting fact by quoting IBM directly: "OEAP has sole responsibility for maintaining the system configuration." The configuration model of the 9020 system is a little unintuitive to me. Each machine has its own configuration register that tells it what its task is and whether it is online or offline (or one of several states in between like online-standby). The OEAP reconfigures the system by running on any one machine and writing the configuration registers of both the machine it's running on, and all of the other machines via the shared bus. Most reconfigurations happen because the OEAP has detected a problem and is working around it, but if the operator manually reconfigures the system (for example to facilitate testing or training), they also do so by triggering an interrupt that leads the Control Program to start the OEAP. The System Console has buttons for this, along with toggles to set up a sort of "main configuration register" that determines how the OEAP will try to set up the system.

The Air Traffic Control Application

This has become a fairly long article by my norms, and I haven't even really talked about air traffic control that much. Well, here it comes: the application that actually ran on the 9020, which seems to have had no particular name, besides perhaps Central Computing Complex (although this seems to have been adopted mostly to differentiate it from the Display Complex, discussed soon).

First, let's talk about the hardware landscape of the ARTCC and the 9020's role. An ARTCC handles a number of sectors, say around 30. Under the 9020 system, each of these sectors has three controllers associated with it, called the R, D, and A controllers. The R controller is responsible for monitoring and interpreting the radar, the D controller for managing flight plans and flight strips, and the A controller is something of a generalist who assists the other two. The three people sit at something like a long desk, made up of the R, D, and A consoles side by side.

Sector control consoles

The R console is the most recognizable to modern eyes, as its centerpiece is a 22" CRT plan-view radar display. The plan-view display (PVD) of the 9020 system is significantly more sophisticated than the SAGE PVD on which it is modeled. Most notably, the 9020 PVD is capable of displaying text and icons. No longer does a controller use a light gun to select a target for a teleprinter to identify; the "data blocks" giving basic information on a flight were actually shown on the PVD next to the radar contact. A trackball and a set of buttons even allowed the controller to select targets to query for more information or update flight data. This was quite a feat of technology even in 1970, and in fact one that the 9020 was not capable of. Well, it was actually capable of it, but not initially.

The original NAS stage A architecture separated the air traffic control data function and radar display function into two completely separate systems. The former was contracted to IBM, the latter to Raytheon, due to their significant experience building similar systems for the military. Early IBM 9020 installations sat alongside a Raytheon 730 Display Channel, a very specialized system that was nearly as large as the 9020. The Display Channel's role was to receive radar contact data and flight information in digital form from the 9020, and convert it into drawing instructions sent over a high-speed serial connection to each individual PVD. A single Display Channel was responsible for up to 60 PVDs. Further complicating things, sector workstations were reconfigurable to handle changing workloads. The same sector might be displayed on multiple PVDs, and where sectors met, PVDs often overlapped so the same contact would be visible to controllers for both sectors. The Display Channel had a fairly complex task to get the right radar contacts and data blocks to the right displays, and in the right places.

Later on, the FAA opted to contract IBM to build a slightly more sophisticated version of the Display Channel that supported additional PVDs and provided better uptime. To meet that contract, IBM used another 9020. Some ARTCCs thus had two complete 9020 systems, called the Central Computer Complex (CCC) and the Display Channel Complex (DCC).

The PVD is the most conspicuous part of the controller console, but there's a lot of other equipment there, and the rest of it is directly connected to the 9020 (CCC). At the R controller's position, a set of "hotkeys" allow for quickly entering flight data (like new altitudes) and a computer readout device (CRD), a CRT that displays 25x20 text for general output. For example, when a controller selects a target on the PVD to query for details, that query is sent to the 9020 CCC which shows the result on the R controller's CRD above the PVD.

At the D controller's position, right next door, a large rack of slots for flight strips (small paper strips used to logically organize flight clearances, still in use today in some contexts) surrounds the D controller's CRD. The D controller also has a Computer Entry Device, or CED, a specialized keyboard that allows the D controller to retrieve and update flight plans and clearances based on requests from pilots or changes in the airspace situation. To their right, a modified teleprinter is dedicated to producing the flight strips that they arrange in front of them. Flight strips are automatically printed when an aircraft enters the sector, or when the controller enters changes. The A controller's position to the right of the flight strip printer is largely the same as the D controller's position, with another CRD and CED that operate independently from the D controller's—valuable during peak traffic.

While controller consoles are the most visible peripherals of the system, they're far from the only ones. Each 9020 system had an extensive set of teletypewriter circuits. Some of these were local; for example, the ATC supervisor had a dedicated TTY where they could not only interact with flight data (to assist a sector controller for example) but also interact with the status of the NAS automation itself (for example to query the status of a malfunctioning radar site and then remove it from use for PVDs).

Since the 9020 was also the locus of flight planning, TTYs were provided in air traffic control towers, terminal radar facilities, and even the dispatch offices of airlines. These allowed flight plans to be entered into the 9020 before the aircraft was handed off to enroute control. Flight service stations functioned more or less as the dispatch offices for general aviation, so they were similarly equipped with TTYs for flight plan management. In many areas, military controllers at air defense sectors were also provided with TTYs for convenient access to flight plans. Not least of all, each 9020 had high-speed leased lines to its neighboring 9020s. Flights passing from one ARTCC to the next had their flight strip "digitally passed" by transmission from one 9020 to the next.

A set of high-speed line printers connected to the 9020 printed diagnostic data as well as summary and audit reports on air traffic. Similar audit data, including a detailed record of clearances, was written to tape drives for future reference.

To organize the whole operation, IBM divided the software architecture of the system into the "supervisor state" and the "problem state." These are reasonably analogous to kernel and user space today, and "problem" is meant as in "the problem the computer solves" rather than "a problem has occurred." The Control Program and OEAP run in the supervisor state, everything else runs after the Control Program has set up a machine in the Problem State and started a given program.

IBM organized the application software into five modules, which they called the five Programs. These are Input Processing, Flight Processing, Radar Processing, Output Processing, and Liaison Management. Most of these are fairly self-explanatory, but the list reveals the remarkably asynchronous design of the system. Consider an example, we'll say a general aviation flight taking off from an airport inside of one of the ARTCC's sectors.

The pilot first contacts a Flight Service Station, which uses their TTY to enter a flight plan into the 9020. Next, the pilot interacts with the control tower, which in the process of giving a takeoff clearance uses their TTY to inform the 9020 that the flight plan is active. They may also update the flight plan with the aircraft's planned movements shortly after takeoff, if they have changed due to operating conditions. The Input Processing program handles all of these TTY inputs, parsing them into records stored on a Storage Element. In case any errors occur, like an invalid entry, those are also written to the Storage Element, where the Output Processing program picks them up and sends an appropriate message to the originating TTY. IBM notes that there were, as originally designed, about 100 types of input messages parsed by the input processing program.

As the aircraft takes off, it is detected by a radar site (such as a Permanent System radar or Air Route Surveillance Radar) which digitally encodes the radar contact (a Raytheon system) for transmission to the 9020. The Radar Processing program receives these messages, converts radial radar coordinates to the XY plane used by the system, correlates contacts with similar XY positions from multiple radar sites into a single logical contact, and computes each contact's apparent heading and speed to extrapolate future positions. Complicating things, the 9020 went into service during the development of secondary surveillance radar, also known as the transponder system 2. On appropriately equipped aircraft, the transponder provides altitude. The Radar Processing system makes an altitude determination on each aircraft, a slightly more complicated task than you might expect as, at the time, only some radar systems and some transponders provided altitude information. The Radar Processing program thus had to track if it had altitude information at all and, if so, where from. In the mean time, the Radar Processing program tracked the state of the radar sites and reported any apparent trouble (such as loss of data or abnormal data) to the supervisor.


I put a lot of time into writing this, and I hope that you enjoy reading it. If you can spare a few dollars, consider supporting me on ko-fi. You'll receive an occasional extra, subscribers-only post, and defray the costs of providing artisanal, hand-built world wide web directly from Albuquerque, New Mexico.


The Flight Processing program periodically evaluates all targets from the Radar Processing program against all filed flight plans, correlating radar targets with filed flight plans, calculating navigational deviations, and predicting future paths. Among other outputs, the Flight Processing program generated up-to-date flight strips for each aircraft and predicted their arrival times at each flight plan fix for controller's planning purposes. The Flight Processing program hosted a set of rules used for safety protections, such as separation distances. This capability was fairly minimal during the 9020's original development, but was enhanced over time.

The Output Processing program had two key roles. First, it handled data that was specifically queued for it because of a reactive need to send data to a given output. For example, if someone made a data entry error or a controller queried for a specific aircraft's flight plan, the Input Processing program placed the resulting data in memory, where the Output Processing program would "find it" to format and send to the correct device. The Output Processing program also continuously prepared common outputs like flight data blocks and radar station status messages that were formatted once to a common memory buffer to be sent to many devices in bulk. For example, a new flight strip for an aircraft would be formatted and stored once, and then sent in sequence to every controller position with a relation to that aircraft.

Legacy

The 9020 is just one corner of the evolution of air traffic control during the 1960s and 1970s, a period that also saw the introduction of secondary radar for civilian flights and the first effort to automate the role of flight service stations. These topics quickly spiral out into others: unlike the ARTCCs of the time, the flight service stations dealt extensively with weather and interacted with both FAA and National Weather Service teletype networks and computer systems. An early effort to automate the flight service function involved the use of a teletext system originally developed for agricultural use as a "flight briefing terminal." That wasn't the agricultural teletext system in Kentucky that I discussed, but a different one, in Kansas. Fascinating things everywhere you look!

This article has already become long, though, and so we'll have to save plenty for later. To round things out, let's consider the fate of the 9020. SAGE is known not only for its pioneering role in the computing art, but because of its remarkably long service life, roughly from 1958 to 1984. The 9020 was almost 20 years younger than SAGE, and indeed outlived it, but not by much. In 1982, IBM announced the IBM 3083, a newer implementation of the Enhanced S/370 architecture that was directly descended from S/360 but with greatly improved I/O capabilities. In 1986, the FAA accepted a new 3083-based system called "HOST." Over the following three years, all of the 9020 CCCs were replaced by HOST systems.

The 9020 was not to be forgotten so easily, though. First, the HOST project was mostly limited to hardware modernization or "rehosting." The HOST 3083 computers ran most of the same application code as the original 9020 system, incorporating many enhancements made over the intervening decades.

Second, there is the case of the Display Channel Complex. Once again, because of the complexity of the PVD subsystem the FAA opted to view it as a separate program. While an effort was started to replace the 9020 DCCs alongside the 9020 CCCs, it encountered considerable delays and was ultimately canceled. The 9020 DCCs remained in service controlling PVDs until the ERAM Stage A project replaced the PVD system entirely in the 1990s.

While IBM's efforts to market the 9020 overseas generally failed, a 9020 CCC system (complete with simplex test machine) was sold to the UK Civil Aviation Authority for use in the London Air Traffic Centre. This 9020 remained in service until 1990, and perhaps because of its singularity and unusually long life, it is better remembered as a historic object. There are photos.

Figure from study of typical 9020 message volume

  1. The term National Airspace System (NAS) is still in use today, but is now more of a concept than a physical thing. The NAS is the totality of the regulations, procedures, and communications systems used in air traffic control. During the NAS Enroute Stage A project, IBM and the FAA both seem to have used "NAS" to describe the ARTCC computer system as a physical object, although I think it was debatable even then whether or not this was an appropriate use of the term. One of the difficulties in researching the history of civilian air traffic control is that the FAA seems to have been particularly bad about names. "NAS Enroute Stage A" is not very wieldy but is one of the only terms that unambiguously refers to the late-'60s, early-'70s IBM 9020-based ARTCC system, and even then it is confusing with the later enroute automation modernization (ERAM) program, complete with its own stage A. I refer to the ARTCC automation system simply as "the IBM 9020" even though this is incorrect (consider for example that the complete system often involved a display subsystem built by Raytheon), and you will find contemporary references to it as "NAS," "NAS stage A," "NAS automation," etc.

  2. One of the responsibilities of the 9020 was the assignment of non-overlapping transponder codes as well.

Flock and Urban Surveillance

Some years ago, I had a frustrating and largely fruitless encounter with the politics of policing. As a member of an oversight commission, I was particularly interested in the regulation of urban surveillance. The Albuquerque Police Department, for reasons good and bad, has often been an early adopter of surveillance technology. APD deployed automated license plate readers, mounted on patrol cars and portable trailers, in 2013. Initially, the department kept a six-month history of license plate data. For six months, police could retrospectively search the database to reconstruct a vehicle, or person's, movements—at least, those movements that happened near select patrol cars and "your speed is" trailers. Lobbying by the American Civil Liberties Union and public pressure on APD and city council lead to a policy change to retain data for only 14 days, a privacy-preserving measure that the ACLU lauded as one of the best ALPR policies in the nation.

Today, ALPR is far more common in Albuquerque. Lowering costs and a continuing appetite for solving social problems with surveillance technology means that some parts of the city have ALPR installed at every signalized intersection—every person's movements cataloged at a resolution of four blocks. The data is retained for a full year. Some of it is offered, as a service, to law enforcement agencies across the country.

One of the most frustrating parts of the mass surveillance debate is the ability of law enforcement agencies and municipal governments to advance wide-scale monitoring programs, weather the controversy, and then ratchet up retention and sharing after public attention fades. For years, expansive ALPR programs spread through most American cities with little objection. In my part of the country, it seemed that the controversy over ALPR had been completely forgotten until one particularly significant ALPR vendor—Flock Safety—started repeatedly stepping in long-festering controversies with such wild abandon that they are clearly either idiots or entirely unconcerned about public perception.

PTZ camera on light pole

I try not to be too cynical but I am, unfortunately, more inclined to the latter. Companies like Flock know that they are in treacherous territory, morally and legally. They know that their customers are mostly governments or organizations with elected leaders that are subject to popular opinion. They know that helping Texas law enforcement track down abortion seekers in other states is a "bad look." They know all of these things, but they do not particularly care. They don't have to care: decades of incipient corruption, legal and political maneuvering, and the routine inefficacy of municipal politics has created an environment where public opinion doesn't matter.

I can't definitely tell you where public opinion lies on ALPR, although it seems like the average person might be mildly in support. From at least my experience, in my corner of the world, I will tell you this: it doesn't matter. Police departments and the means by which they purchase and field technology are so isolated from the political process that it is extremely difficult to imagine a scenario where voters could affect change. Year by year, city by city, the police become more dug in. Law enforcement agencies across the country have found that the most straightforward way to address privacy concerns around surveillance technology is to keep the department's purchase and deployment of that technology a secret. Most city governments at least passively support this approach. The vendors of surveillance systems facilitate, support, and even demand secrecy through their contract terms.

More recently, Las Vegas and the Bay Area have offered a model even more opaque to public scrutiny: law enforcement surveillance technology is simply purchased by wealthy private donors, almost invariably from the software industry, and then either the systems or their use are donated to the city. If carefully designed, these programs can be completely exempt from public information rules. They can take the form, for example, of a business association that runs its own private surveillance state, involving the public and ostensibly accountable police only when an arrest is made. We are privatizing mass surveillance.

Flock continues to generate enormous press, mostly on the back of persistent investigation by 404 Media. More recently, security researchers have published significant defects in the design of Flock's technology that can make the original video publicly accessible. Just about every time that someone looks into Flock, the company turns out to be less ethical, the users less concerned about compliance, the design of the system itself less competent than charitable viewers had assumed.

I'm trying not to be a doomer for Christmas, but I am sometimes frustrated with Flock coverage because it can miss the entire history of this issue. I think that a more contextually complete discussion of urban surveillance could be useful. And I am sitting in a coffee shop, in a trendy part of town, looking out the window at a PTZ camera on the side of a traffic light. That camera, I know, is owned and operated by APD's Real Time Crime Center (RTCC). Between the police department itself, city agencies that participate in the RTCC, and businesses that volunteer real-time access to their surveillance, there are thousands of others like it. Courtesy of a transit station, there are at least a dozen within my view.

Whether or not this is a good or bad idea, whether or not it is effective in reducing crime, whether or not it will be leveraged against political opposition; these are tricky questions. I suppose what worries me is that it feels like hardly anyone is asking them any more. It took Flock's remarkable ability to step on rakes and the apparent victory of fascism in national politics for anyone to remember that the construction of ubiquitous surveillance is a project that started many years ago, and that has proceeded largely unhindered ever since.

Domestic Signals Intelligence

Within the tech community, there has historically been much attention to the ability to track people by passively observing Bluetooth traffic. This technique has been widely used, both in commercial and government applications. There are popular "smart city" street lighting systems, for example, that allow every street light to passively collect signatures of the people passing underneath it. To my knowledge, these techniques are not actually very widely used by law enforcement. There are, perhaps, two reasons: one of mechanisms of government, the other of countermeasures.

First, "smart city" data collection systems are usually funded and deployed by municipal works or environmental departments. While police could make arrangements for access to that data, those arrangements would require the kind of inter-agency Memoranda of Understanding that tend to lead to far more public scrutiny than police acquisitions of surveillance products. Besides, since they aren't deployed for law enforcement purposes, they're often not useful sources for the areas that law enforcement find most interesting: higher-crime areas that are usually lower-income and, thus, less likely to have working street lights at all—much less "smart" ones. Besides, smartphones have widely adopted randomization of Bluetooth and WiFi identifiers, and protocol revisions have reduced the number of identifiers transmitted in plaintext. Passive signals intelligence just doesn't work as well as it used to, at least at the capability level of a municipality.

Talking about Bluetooth and WiFi on phones does raise the question of the "phone" part, the cellular interface, which is targeted by the family of devices often known as "Stingrays" after the trademark of a particular manufacturer. Fortunately, improvements in the security design of cellular protocols is making these less effective over time. Unfortunately, the technology continues to advance, sometimes undoing the improvements of newer GSM revisions. Federal and local law enforcement continue to purchase and use these devices, largely in secret, benefiting from a particular model of federal ownership/local use that makes it especially difficult to get a police department to even confirm or deny that they have ever deployed them. "Stingrays" or IMSI surveillance is mostly a shadow world of rumors and carefully worded non-denials. Despite technical measures against them, they are clearly still in use, and thus clearly still useful.

Optics

Out in the rest of the world, visual surveillance is more salient than radio. Early rollouts of police-operated surveillance, dating back to at least the 1970s, generated some controversy over privacy implications. Two things have since happened: first, law enforcement have relied on an increasing number of public-private partnerships and commercial vendors to gain access to surveillance without directly owning it. Second, police video surveillance has largely been normalized, and no longer faces much opposition or even public notice.

One of the interesting changes here is one of visibility: police video surveillance has often edged in and out of public awareness to fit the politics. In periods of pro-privacy, anti-surveillance sentiment, departments rely more on voluntary arrangements for access to cameras installed by others. In periods of pro-police, anti-crime sentiment, departments install cameras with flashing lights and police badges. Both types tend to persist after the next change in the tide.

The public usually knows little about these systems, a result of intentional opacity by police departments and a general lack of interest by the press. That leads to a lot of confusion. Where I live, we do have an extensive network of police-operated cameras on intersection traffic signal arms. And yet, if you ask the average person to identify a police surveillance camera, they will point to the camera for the traffic signal's video-based lane occupancy sensing every single time. That's not a police camera, it's barely even a camera as the video is rarely retained. All of these people, it seems, have worked themselves into a sort paranoia where they think that every camera is an eye of the police. Well, the police are watching them, from about ten feet over. The "speed dome" PTZ cameras get so much less attention, perhaps because they are usually mounted on the pole further from sightline, or perhaps because they are of a type more common for commercial surveillance systems that we have learned to simply ignore.

Video surveillance is an interesting topic to me, philosophically. I am largely unconcerned with the privacy implications of most video surveillance installations (such as the one on my own house) because, historically, the video was recorded locally and generally reviewed only when there was a specific reason. The simple fact that reviewing large amounts of video is so time consuming meant that the pervasive surveillance potential of video surveillance was, at one time not so long ago, quite limited.

Motorola ALPR on signal arm

Of course, the age of the computer has somewhat changed that situation. There are two phenomenon of the automation of surveillance that I think should be considered separately: first, machine vision has improved to an extent that computers can automatically process surveillance video to extract events and identities. Second, the appified, everything-social-media attitude of consumer products creates new dynamics in access to surveillance data, and those dynamics are spreading upwards into the commercial segment.

Machine Vision

Historically, much of the attention to video pervasive surveillance has centered around facial recognition. Facial recognition has indeed been applied to video surveillance for years, but I think that the average person vastly overestimates how effective and widely used facial recognition is.

The vast majority of currently installed surveillance cameras do not produce video of sufficient quality for facial recognition. That has less to do with the quality of the video itself (although that is poorer than you think for most real systems) and more to do with the way that surveillance cameras are used and installed. Most cameras are positioned high up with wide coverage of a room; this perspective is ideal for reconstructing a series of events but just about the worst case for facial recognition. For most surveillance cameras, human faces are small and at indirect angles. There is little geometry that you can extract from a face that is about ten pixels wide and subjected to aggressive h264 compression, which is how most surveillance video comes out.

Practical facial recognition systems involve cameras installed specifically for that purpose, roughly at eye level where they will get close-up, straight-on images of people who pass by. Next time you visit a bank branch, look by the exit doors for a conspicuously thick height strip. Height strips by exit doors were invented to allow clerks to give police a more accurate description of a robber, but they have since evolved to serve largely as subterfuge. Somewhere around 5' 6", you will notice a small hole, and behind that hole is a camera. Several manufacturers offer these and they seem very popular in financial services.

Kroger is more to the point: they've just been installing dome cameras right at eye height on their exit doors, for at least a decade.

When discussing surveillance, it's important to remember that the vast majority of real-world video surveillance systems are old, inexpensive, and poorly maintained. Even where cameras are installed specifically for a good view of faces, there probably isn't any facial recognition in use, most of the time. Facial recognition products are expensive and don't currently manifest many benefits unless the organization is large enough to have a security department to work with the resulting data, which requires a degree of operational maturity beyond most surveillance users (e.g. gas stations).

All of that said, there are plenty of real-world facial recognition deployments. Rite Aid, for example, prominently rolled out facial recognition to flag known shoplifters at their stores... a rollout that went so poorly that it lead to a lawsuit and an FTC settlement including the end of the facial recognition program and a five-year moratorium on further attempts. This is not to say that facial recognition on video surveillance isn't legal (although in some states it isn't or at least requires a lot of disclosure), but there is definitely a degree of legal and reputational hazard involved.

The ACLU has periodically conducted call-around surveys on use of facial recognition. The most notable trend is that most large chains now refuse to talk about it. I could be wrong, but my experience with corporate communications behavior leads to me to interpret a refusal to comment along these lines: Home Depot, for example, is a very large company that will have various initiatives underway, and either confirming or denying their use of facial recognition would probably be wrong in some cases and expose them to compliance or legal risk in others. I would bet good money that Home Depot has some facial recognition technology deployed at some locations, but knowing how slowly security technology tends to roll out in that kind of company and how complicated the legal and compliance situation can become, it's probably limited. They are probably acutely aware of the controversy surrounding this type of surveillance and, given the example of Rite Aid, the ways a rollout could go wrong. That means that surveillance will spread slowly.

But it will spread. At this point, I think it is inevitable that facial recognition will become widely used in video surveillance. I just think the point at which "facial recognition is everywhere" remains some years away, due to all the normal reasons: technical limitations, slow-moving bureaucracies, and a somewhat complex and unclear regulatory situation. It will inevitably happen, for the same reasons as well: aggressive sales by facial recognition vendors.

There are some sectors where facial recognition is very common, although I don't get the impression that retail is one of these yet. Casinos, for example—some of the larger Las Vegas casinos have reportedly had facial recognition systems in use for decades, and Nevada law is very permissive of them. Casinos are, of course, pretty much ideal users. Large institutions with a lot of financial risk and large, sophisticated security departments. Few other businesses outside of Target can compete with the size and sophistication of casino security departments. There's a whole lot of money flying around, and they can spend some of it on expensive per-camera licensing without much leadership objection.

License Plate Reading

Popular attention to facial recognition has mostly fallen away as industry and media focus has shifted to another application of machine vision that is, it turns out, a whole lot easier: license plates. License plates are designed for readability, and most states use retroreflective paints that give you an absolutely beautiful high-contrast image under coaxial (i.e. mounted alongside the camera) infrared illumination. There's not much that is easier to read by machine vision. Automated license plate readers have been available for quite a long time: US CBP had an experimental ALPR installation at a Texas border crossing in 1994. That system was actually deemed a failure and removed, but technology improved and there were permanent installations at larger border crossings by the end of the 1990s.

For a long time, the dominant vendor of ALPR equipment in the US was Motorola. Motorola's product line remains popular for vehicle-mounted systems, but the high price of the cameras, controllers, and software package had a side benefit of limiting the pervasiveness of ALPR. The equipment was just too expensive to put up all over the place.

In Albuquerque, for example, the ALPR program long consisted of Motorola systems mounted on portable "your speed is" trailers. The portable nature of these setups made the expense more worthwhile, and besides, portability has its own utilities: an APD detective once told me, for example, of how they would leave ALPR trailers in front of the houses of people suspected to lead criminal gangs. While there was value in the intelligence collection, the main motivation was intimidation: while you might call the "your speed is" trailers a concealed system, they're not all that subtle, and one supposes that most vehicle-based criminals (the main kind here) are aware that they function as the eyes of the police.

At some point, in response to growing budgets or lowered costs I'm not sure, APD began installing fixed Motorola ALPR systems on the light arms of major intersections. I know of around a dozen installations of this type in Albuquerque, which is the beginning of a widespread capability to monitor public movements but not exactly the dystopian pervasive surveillance of Minority Report.

ALPR works pretty well, but it is not perfect. Cameras need to be installed with fairly narrow optics aimed at the right spot, and infrared illumination makes reading far more reliable. Speaking of Las Vegas, I used to use a certain casino parking garage with an ALPR-based payment system with some regularity. It printed the license plate, as read by the ALPR, on the parking ticket, which is why I know that it was almost comically inept at reading my very legible California plate. It always got about half the characters wrong. I have gotten much better results in my own home experiments with budget equipment, so I figure that system must have been very poorly installed or maintained, but I'm sure there are plenty of others out there just like it.

That's the tricky thing about video surveillance, from the "blue team" side of the house: people don't tend to pay a lot of attention to it until there's been an incident, at which point they find out that the lens has had mud on it for the last three months (a much bigger problem with ALPR cameras that used to be mounted pretty low to the ground for a better look angle). I bring this up because I think that people tend to vastly overestimate the quality of real-world video surveillance, and I like to take every opportunity to remind people that the main failure case of retail video surveillance used to be failure to replace the continuous-loop tape cassette before it was completely demagnetized by repeated recording. Now, in 2025, the continuous-loop tape cassettes are pretty much gone, but maintenance practices haven't improved. Lots of the cameras you see in public only barely work or don't work at all. So it goes.

Flock

In 2017, though, a VC-backed (and specifically YCombinator) company called Flock Safety introduced a bold new idea to ALPR: a Silicon Valley sales model. Flock's system is built to be low-cost, and the sensors are smaller, simpler, and cheaper than Motorola's. I suppose they might be less effective as a result, but a reduced "catch" rate doesn't really detract from mass-surveillance ALPR installations that much. Flock has also greatly expanded their customer base, emphasizing sales to private organizations as well as law enforcement and government. Speaking of Home Depot, for example, Home Depot seems to have installed their own Flock cameras in all of their parking lots. Lowes Home Improvement has done the same.

I wanted to know more definitely how much Flock systems cost, because I suspected they were making significant inroads just through low pricing. It's a little tricky to say definitively because Flock is a "call for quote" kind of company and I think they offer contracts on different price bases. Scouring contracting documents, meeting notes, etc., it seems like a "typical" cost for a Flock camera is around $4,000 with about $3,000 a year in per-camera software licensing fees.

That might seem expensive but it compares well to the five-figure prices I have heard associated with Motorola systems, especially since the Flock offering is more "white glove" with installation and maintenance packaged. Motorola systems are usually purchased through an integrator who adds their own considerable margin.

Flock camera on light pole

Flock also designs their cameras to be amenable to solar power, which radically reduces install costs compared to Motorola systems that usually need a utility worker out to splice power from a streetlight. More even than a price reduction, it makes Flock cameras much more available to organizations like HOAs that control territory in a sense but do not have the full bucket truck or utility work order capabilities of a municipal government.

Another recent innovation in ALPR is less traceable to Flock but certainly seems associated with them: flexible funding sources. Police departments have limited budgets with which to acquire new technology, and technology vendors have to compete with other budget priorities like salaries, vehicles, and black-on-black tactical vinyl jobs for those vehicles. ALPR seems especially attractive for public-private partnership mechanisms, so there are a lot of Flock installations that were funded by business associations, HOAs, neighborhood associations, and other "indirect" sources. Some of these systems are owned and operated by the police with only the funding donated, others are owned and operated by the private group that paid for them. This can result in curious deployment decisions: sometimes the lowest-crime neighborhoods are the most replete with ALPR, as they tend to be wealthier and more politically organized communities with the wherewithall to put up the money.

The most important thing to understand about Flock, though, is that it has built on Amazon's concept of "Ring neighbors" to build a sort of nationwide, ALPR-centric Nextdoor. Flock customers can basically check a box that allows other Flock customers to access data from their sensors, and of course Flock strongly encourages users to turn sharing on. While there are some audit and access controls available on Flock data sharing, they seem like pretty minimal efforts that are often ignored.

Flock sharing has generated a lot of press, especially with some dramatic examples like use by a Texas sheriff to locate an abortion patient and use by ICE/CBP to track suspects. These are both examples that raise one of the most alarming aspects of the Flock situation: many states and municipalities have laws in place that limit or at least monitor collaboration of local police with other police agencies and federal law enforcement. Some people find this surprising, but it's important to understand that the United States is a republic of nominally independent governments. Laws, policies, and priorities can vary greatly from jurisdiction to jurisdiction. There is a specific historical thread, related to the tracing of escaped slaves, that has made resource sharing between different law enforcement agencies a known area of moral and legal treachery for a very long time.

And yet, it turns out, most Flock customers seem to have sharing turned on, possibly entirely without their knowledge. There are now multiple well-established cases of local law enforcement agencies violating state laws by having data sharing with ICE/CBP enabled. It is possible for Flock users to turn off or restrict sharing, but I think a lot of them honestly don't know that. Some state Attorneys General have ordered Flock users to disable sharing, some have restricted or banned Flock products entirely, but in general it's a very messy situation. It appears that a lack of care by law enforcement and other Flock customers, facilitate and no doubt encouraged by Flock's motivation towards "network effect" lock-in, has resulted in widespread and brazen violation of privacy laws that is only now making its way to the courts.

In other words, the tech industry happened.

Acoustics

I will not spend much time here discussing wide-area acoustic surveillance like ShotSpotter, in part because I have written a bit about it before. It's a complex issue: a well-designed gunshot detection system would probably be a good thing, but I find SoundThinking (manufacturer of the ShotSpotter system) to be profoundly untrustworthy.

Futures

The changes we are already seeing will continue: ALPR will become more ubiquitous, facial recognition will advance further into the public sphere, and the tech industry will continue to centralize data and facilitate queries by law enforcement. There's a lot of money to me made out of the whole thing, and funding towards law enforcement or public safety purchases are usually politically safe. Pretty much everything is stacked in the direction of more pervasive surveillance in the United States.

Do you find that upsetting? It seems that some people do, and some people do not. I am probably not as opposed to surveillance of public spaces as the most vocal privacy advocates, but I am also convinced that vendor-enabled mass surveillance technology like Flock is subject to enormous abuse and will inevitably undermine constitutional protections. Unfortunately, vocal organizing against mass surveillance has become pretty limited. The ACLU is doing a lot of good work in this area, but I don't see much public organizing.

The best thing you can do is probably to advocate for transparency. The most alarming part of this whole thing, to me, is the way that police departments have brazenly structured purchases of surveillance technology to get around public record and approval requirements. Companies like Flock and SoundThinking encourage this, and write it into their contracts. The end result is that many police departments have installed cameras and microphones in all kinds of places, and will not disclose when, where, how many, or how they are used. We should not allow that kind of secrecy, but preventing it seems to require legislation. The federal situation seems like a loss, so the best pressure point might be to lobby for municipal or state legislation that will require police departments to disclose their surveillance programs. Even better would be a requirement for review and approval of surveillance purchases, but unfortunately that kind of rule often already exists and police departments still structure their purchase arrangements to avoid invoking it.

I suppose the bottom line is this: keep bringing it up. Mass surveillance in the US often feels like a lost cause, but I suppose it's only lost if we give up. It doesn't take that many people showing up at a city council meeting to make something a priority to the councilors; and perhaps the police can only stonewall for so long. It's worth a shot.

speed reading (the meaning of language)

One of the difficult things about describing a grift, or at least what became a grift, is judging the sincerity with which the whole thing started. Scams often crystallize around a kernel of truth: genuinely good intentions that start rolling down the hill to profitability and end up crashing through every solid object along the way. I'm not totally sure about Evelyn Wood; she seems to have had all the best in mind but still turned so quickly to hotel conference room seminars that I have trouble lending her the benefit of the doubt.

Still, she was a teacher, and I am inclined to be sympathetic to teachers. Funny, then, that Wood's journey to fame started with another teacher. His curious reading behavior, whether interpreted as intense attention or half-assed inattention, set into motion one of the mid-century's greatest and, perhaps, most embarrassing executive self-help sensations.

In 1929, Evelyn Wood earned a bachelor's in English at the University of Utah. The following two decades are a bit obscure; she took various high-school jobs around Utah leading ultimately to Salt Lake City's Jordan High School. There, as a counselor to girl students, Wood found that many students struggled because of their reading. Assigned books were arduous, handouts discarded. These students struggled to read so severely that it hampered their performance in every area. She launched a remedial reading program of her own design, during which she made her first discovery: as her students learned to read faster, their comprehension improved. Then their grades—in every subject—followed suit. Reading, she learned, was a foundational skill. A person could learn more, do more, achieve more, if only they could read faster.

Wood became fascinated with reading, probably the reason for her return to the University of Utah for a master's degree in speech. Around 1946, she turned her thesis in to Dr. Lowell Lees. Lees was the chair of the Speech and Theater Department, and had a hand in much of the development of Utah theater from the Great Depression until his death in the 1950s. A period photo of Lees depicts him with a breastplate-microphone intercom headset and a look of concentration, hands on the levers of a mechanical variac dimmer rack. He is backstage of either "Show Boat" or "A Midsummer Night's Dream" at the university's summer theater festival. A theater department chair on lights seems odd, yes, but theater was Lees passion.

Perhaps reading was not. When Wood turned her thesis into Lees, he "read, graded, and returned the thesis within a matter of minutes." Wood was amazed that he seemed to just leaf through the pages, but then still had insightful questions to ask. Perhaps I am too cynical. It feels most likely to me that Lees was already familiar with the contents (he was probably Wood's advisor and would have discussed the research plenty of times before) and just didn't bother to read the final document. To Wood, though, something more remarkable had happened. With a series of tests, she convinced herself that Dr. Lees could read over 6,000 words per minute with full comprehension.

Evelyn Wood Reading Dynamics advertisement

A typical American college graduate can read at about 250 words per minute, at least if the material isn't too challenging. Some people, Wood contends, are "10x readers." They read so quickly, and with such good understanding, that they simply outpace the rest of us at every intellectual pursuit. What's more, Wood could make you one of those people. As she tells it, she spent two years, probably in the 1950s, tracking down fifty some examples of other exceptional readers. She published "Reading Skills" in 1958, a book evidently based on some of this research but more focused on remedial skills for grade students than executive achievement.

The introduction of Reading Skills tells us of ten different students. Anna was pretty, but she couldn't read. Joseph hated school, because he couldn't read. Carl's hair is a mess, and his parents neglectful. He also can't read. All of them became proficient readers through Wood's program. But Wood had more in mind than grade students. A year later, with her business-educated husband, she brought her reading program to adults by launching a chain of training centers under the name Evelyn Wood Reading Dynamics.

Books neither bored nor scared me any longer. I could read almost any book within an hour, and more important, I better understood that which I read.

Reading Dynamics became a sensation. Over the following years, Evelyn Wood institutes opened across the country. The speed reading movement received a considerable boost from President John F. Kennedy—he claimed to read at 1,200 words per minute, a skill he learned in part through a correspondence speed reading course. It wasn't one of Evelyn Wood's, but that detail was mostly lost on the public and the success of the Kennedies became linked to Reading Dynamics. He seems to have bought the same course for his brother Ted, and encouraged his staff to take speed reading courses as well. Reading Dynamics didn't miss the marketing opportunity, and indeed the very first dedicated Institute opened in Washington, D.C. and advertised specifically to politicians. Senators and representatives were among her earliest students and her strongest advocates.

Evelyn Wood Reading Dynamics underwent several changes of ownership through the 1960s, but Wood stayed on as developer of the training materials. Soon there were more than 60 institutes, and newspaper ads directed the interested public to "free mini-lessons" held in the meeting rooms of fine hotels across the country. It became a franchise system, with the Woods personally owning the franchise for Utah and Idaho. The company's fortunes have trended up and down with those of speed reading as a concept, but genuine Evelyn Wood speed reading courses are still available today from business training firm Pryor. There has been a bit of a reckoning: far from the 1,000+ WPM rates promised by early Evelyn Wood marketing material, Pryor now advertises "a potential rate of 400-700 words per minute." These numbers align with the upper end of reading speeds observed among the general population. In effect, Pryor no longer claims that speed reading courses will make you a faster reader than more conventional methods of training reading, like just doing a lot of it.

The science has never really been with speed reading. As early as 1959, when Reading Dynamics hit Washington, researchers and educators called Wood's data and methods into question. As with most self-help materials, Wood's writing was heavy on anecdotes and light on quantitative analysis. Certain elements of her method contradicted psychology's growing understanding of human language and perception. At the core of the problem, though, was her promise of comprehension.

It is obvious that a person can "read" a document very quickly, if we relax our definition of "read." This is just as obvious to the developers of speed reading courses. Many advise students to start by skimming, flipping through the whole book or document and taking in the headings and subjects. You can certainly get through a book under an hour that way, but of course, you haven't exactly read it. Think of it as a lossy process: the less time you spend on a document, the less you comprehend and retain its contents. That seems pretty intuitive, doesn't it?

But Wood disagreed, or at least, the company she founded did. It can be a little difficult to untangle Evelyn Wood's original theory from the many generations of Reading Dynamics and competing speed reading systems that followed. Subsequent owners of Reading Dynamics, which included companies like the publisher of Encyclopedia Britannica, made significant revisions to the material. By the 1970s, Wood's role was more as a celebrity spokesperson than a teacher. In any case, Reading Dynamics came to emphasize a key principle that reading faster actually improves comprehension. The most skilled readers, Reading Dynamics taught, don't even read words. They scan a page vertically, not horizontally, taking in an entire line at a time by peripheral vision. There is no need to sound out, read, recognize, or even really see individual words, as the mind actually processes language in large chunks at a time. Reading occurs mostly subconsciously, so in a way all you have to do is see the text and believe that you have read it, and you will retain the content.

In a 2016 review paper on speed reading, a team of psychologists deliver bad news: it just doesn't hold up. Laboratory studies confirm that the eye only has the acuity to distinguish words in a small area, and that reading requires fixating on just about every word individually. That doesn't even matter, though, because other laboratory experiments strongly suggest that the limiting factor on reading speed is not the eyes at all but the mind. Even when clever computer techniques are developed to present text more quickly, comprehension trails off at about the same speed. In fact, when humans read, we regard an even smaller area of our vision than the limits of the fovea would suggest. When looking at a word, we are basically blind to anything further than about seven characters away.

The problems with speed reading are not merely theoretical, though. The researchers considered studies of actual speed readers, people who had either completed speed reading courses or claimed to naturally read at exceptional speeds. Almost no studies can be found that support the claim of faster reading with retained comprehension. Speed readers perform poorly on comprehension tests on new material. People who "speed read" a document generally show similar comprehension to people who have no speed reading training but skimmed the document in the same period of time. When speed readers have performed better, researchers suspect the result comes more from advanced familiarity with the material (a common problem with speed reading courses that use the same texts repeatedly), broader general education (you retain more from non-fiction material if you already knew the information to begin with), and greater experience and confidence in "interpolating" by speculating as to the content of the text that wasn't actually read.

Ultimately, eye tracking experiments tend to confirm the worst. People who speed read don't do all that much actual reading. Skilled speed readers skip much of the text completely, and tend to make things up when asked about things they never fixated on. Most interesting, there seems to be a certain Dunning-Kruger effect at play. People who have speed-read a book on a subject, for example, tend to rate their knowledge of the subject highly and then perform very poorly on questions about it (often scoring similar to chance on multiple choice tests). Speed reading, it turns out, is a placebo. It makes you feel like you have read something, even though you haven't.

And yet we still have speed reading. Wood's efforts were perhaps sincere, but the commercial imperative of the growing Reading Dynamics institutes steered the whole thing away from evidence-based methods and towards ideas with an increasingly tenuous connection to reality. The on-again, off-again success of Reading Dynamics left a lot of room for imitators, or innovators, depending on your perspective. Evelyn Wood's original strain of speed reading has mostly fallen away, replaced by a new set of courses that build on Wood's ideas—the worst of them.

Take, for example, the work of Paul Scheele. Scheele is one of those business conference motivational speaker types, the kind of person who is introduced with a vast and impressive resume but who doesn't seem to have really done anything. With a PhD in "Leadership and Change," he founded Scheele Learning Systems to market a series of innovations. His work is so interconnected with other self-help and new-age grifts that it can be hard to untangle what comes from where, but one of their key programs clearly builds on the Evelyn Wood method: PhotoReading.

The basic concept of PhotoReading is that the mind is able to subconsciously process far more information than the conscious mind. In a marketing sheet, he writes:

Your conscious mind can handle seven pieces of information at a time, while your subconscious mind can handle a staggering 20,000 pieces of information. That's the difference between regular reading and PhotoReading.

So imagine a future in which you pick up a book, flip through the pages, and in a matter of minutes gain a full command of the material contained therein. The key is that you don't actually have to read anything, you just have to see it and your subconscious mind files every word away for later retrieval.

Well, of course, it's not quite that simple. There's a whole technique to it, a technique that you can learn from a self-guided digital course for only $530. Sure, that might seem a little steep, but consider that it's a package that includes not only the course but "The PhotoReading Activator Paraliminal CD." Paraliminal activation or paraliminal hypnosis is another major product from Scheele Learning Systems, although I think it's licensed at least in part from a different organization (Centerpointe Research Institute) founded by different cranks ("transcendental meditation" enthusiasts Bill Harris and Wes Wait). The idea of paraliminal activation is roughly halfway between subliminal inducement videos 1 and binaural beats 2, in that it's both of them mixed together. Incidentally, a lot of subliminal videos are like that anyway, so I'm not sure that Scheele is offering anything you can't get for free. All of these organizations offer rotating carousels of endorsements from famous and successful customers. The fact that these happy customers are almost invariably self-help authors or business conference motivational speakers goes unremarked upon.

Scheele's ultimate claim is that PhotoReading allows you "to 'mentally photograph' the printed page at 25,000 words per minute." 600-800 WPM is an excellent, exceptional reading rate among the normal population. For today's speed reading industry, though, 25,000 WPM is the bar to meet. "Harry Potter and the Deathly Hallows" counts up to about 198,000 words. An experienced PhotoReader, then, ought to be able to complete it in around eight minutes. Well, celebrity speed reader Ann Jones says it took her 47, so no one is perfect. She knocked out "Go Set a Watchman" in 25 and a half, and that on live television. Yes, for the most successful speed readers, people who claim rates in excess of 10,000 WPM, there's almost always some aspect of performance involved... whether that's television appearances or elected office. 25,000 WPM became cemented because it's the rate at which the Guinness World Records clocked celebrity speed reader Howard Berg. Actually, there's a woman who claims to have a Guinness World Record at 80,000 WPM, but it's hard to substantiate as Guinness stopped publishing the speed reading record at all decades ago. I suppose it became too questionable for even them.

The reason I'm so fascinated by speed reading is its close interconnection to the concept of the executive. One of the earliest newspaper ads for Reading Dynamics reads "For Executives, Businessmen, Students, Housewives." The housewives part doesn't quite fit the theme, but I think that might be better understood with the Utah LDS context of the housewife side hustle. Multi-level marketing schemes were becoming a cornerstone of the Salt Lake City business scene during the 1960s, a role they still fill today, and MLM brands like Avon found their success in part by melding the two worlds of the housewife and the business executive. Feminine products sold with masculine hustle, you might venture; some housewives were applying themselves to business with a zeal that would make a railroad baron blush.

For students, the motivation is more obvious. Much of education comes down to reading, and we all remember the feeling of a paper due in two days on a book that you haven't yet opened. For the student, getting good comprehension of a text in a fraction of the time is an incredible offer. So promising was speed reading for education that, in its early days, it found considerable adoption in the educational establishment. Many universities offered speed reading courses, some even made them core curriculum. A particularly prominent speed reading course at Harvard served as the pattern on which many others were taught. Besides a series of demonstration films developed Harvard, devices called "reading regulators" or "reading accelerators" were popular lab equipment for these courses. They automated Evelyn Wood's idea of running a ruler down the page, sliding a metal shield down the page faster and faster to force the student to read at a higher and higher rate. For a few years, speed reading for universities became an entire industry, but it was short lived. Academic speed reading courses faded away as criticisms of Wood's theories became better known and attempts at validating speed reading continued to fail.

"Speed reading," it turns out, did not work out in education. But perhaps that's a matter of framing. If we consider the broader landscape of "things that promise to save you time reading," speed reading is just one in a long line of ideas. It turns out that students have been trying to skip the reading for just about as long as there has been reading—consider Monarch Notes, a line of book summaries and critical commentary already available a hundred years ago. From Monarch to CliffsNotes to Chegg, students have looked to a whole sector of the publishing industry to do the hard work of actually reading books for them. You could say that the purpose of these digests or study guides is to help a student maintain the appearance that they have read a text even though they have not, by imparting only the parts of the text that are most important... important either because they are key to the plot or theme, or because they are likely to appear on exams or be expected in papers.

While students have an obvious need for these types of summaries (scoring well on assignments with less time invested), the appeal to the business executive might seem a little fuzzier. Well, unless we take the cynical view that the ultimate goal of an executive is to look smart, and I'm not sure that you really have to be so cynical to accept that as truth.

Summarizations are obviously "lossy," in that a digest form of a book cannot possibly contain the full information of the original book. Similarly, the weight of scientific evidence, as well as most credible practical experience, tells us that speed reading is a lossy process. There is, as the psychologists put it, no silver bullet in reading. Comprehension takes time; less time means less comprehension; and while you likely can improve your reading speed it will take years of practice.

And yet book summaries are an even larger industry than speed reading, and one that is both older and better adapted to the modern age. There clearly is a market for fast, low-comprehension reading of large texts. The audience is not purely made up of people seeking to create the appearance of work they have not done, although that's clearly a large part of it. Consider the magazine book review: long a staple of magazines, book reviews serve two purposes. They give you an idea of whether or not a book is worth reading, but they also summarize the book, or at least explain the major themes. That gives you some of the content of the book, the major ideas and a few choice details, in just a page of three-column prose. A third of that might be taken up by a wine club ad, to boot.

The case of the magazine book review reminds us that there is a serious, a respectable application for summaries. The perfect example might be the lawyer or doctor, people who are paid explicitly for their expertise and education but who also make heavy use of digests and summaries and desk references. There is a lot of information in the modern world, even in any given field, and no one can keep track of all of it. You might need to speed read, to use the CliffsNotes, just to keep up with the state of the field and find the things that you do need to read in their full length.

And so we have seen the dual facets of the executive demand for speed reading: the businessman, the leader, the executive is the perfect intersection of the professional need to find what to read and the personal need to look like you have done a lot of reading. Executives are expected to know what's out there, but also to seem like they already know all of it. It's a matter of opinion which of these is more prominent, but I think we can all agree that publications like CTO Magazine are aimed at that dual purpose.

Well, these days, publications like CTO Magazine are mostly aimed at drumming up AI hype. That's the other thing about business publishing: it is itself a business, and as beholden to the trends as any other.

The funny thing about speed reading is that it has never been that credible. Evelyn Wood's theories were inconsistent with the research and, frankly, a bit "out there" even as she developed them into a business in the 1960s. Experiments on speed reading, some of them conducted by the same people selling courses, have always shown iffy to clearly negative results. And yet speed reading has, in its good times, enjoyed a level of credibility and popularity that seems out of step with even its promises and certainly with its outcomes.

US Presidents Kennedy, Carter, and Nixon were all speed readers. Carter and Nixon both arranged Evelyn Wood Reading Dynamics courses for their staff, and it seems that Kennedy probably purchased some sort of course for White House staff as well. This was very much perceived as an endorsement from the top, and speed reading became not just a new innovation in education, not just a trend, but practically a requirement for any serious leader. Marketing, and the celebrity adoption that it intentionally engineered, outpaced the results. Evelyn Wood's newspaper ads and reputation got so far out front of the actual pedagogy that today's speed reading industry, spinning ever farther from reason, continues to coast on the same set of presidents.

That's not to say that there has been nothing new in speed reading. In 1984, psychologist Mary C. Potter described a method called "rapid serial visual presentation" or RSVP. The idea of RSVP is to eliminate the whole eye movement part of reading entirely, instead using a computer to present one word at a time, each centered in the same location. In theory, the words can be presented faster and faster until the user is reading more quickly than the visual system allows. Well, that's a theory at least. It's inconsistent with later research suggesting that reading speed is limited by cognition rather than perception, but most of that wasn't yet known at the time. Even so, Potter doesn't seem to have viewed RSVP as a speed reading technique. She described it as a method for cognitive research, one that could enable new experiments and improve old results by controlling for the many variables involved in scanning a page of text.

The idea of RSVP as a speed reading technique seems to have been popularized by software startup Spritz, who launched an RSVP speed reading application in 2014. Spritz seems to have spun it as "text streaming," although I think that might have been a later branding innovation. The claims of Spritz are relatively modest, only 1,000 WPM in most cases and sometimes as low as 600 WPM. These are speeds achievable (even if only narrowly) without technical assistance for exceptionally fast readers. Even so, it doesn't really work out. Research on the RSVP method of speed reading finds that comprehension decreases with increasing speed. Amusingly, some experiments show that RSVP results in decreased comprehension even when run at the same speed the subject reads naturally. Psychologists tend to attribute that effect to the fact that RSVP prevents going back and rereading a sentence that you didn't fully understand—a behavior that seems to be a natural and even required part of good reading, despite the fact that Evelyn Wood and virtually every speed reading theorist since has outlawed it.

The fact that the RSVP concept is fundamentally at odds with blinking is probably the major cause of a reported increase in fatigue, as well, but none of these shortcomings have prevented the massive popularity of RSVP within the tech industry especially. Spritz, the company, has gone basically nowhere, but the concept has graduated from TED talks to a huge inventory of browser extensions, mobile apps, CLI tools, and various and sundry GitHub projects that all make the same claims about increased reading speed. "Speed Reading Makes a Comeback" was the title of an NBC News spot on Iris Reading, more of a traditional Wood-style speed reading training company that has since wholeheartedly embraced the RSVP concept.

If software is part of the speed reading story, and a particularly core part of it today, we will have to take on the elephant in the room: in a certain sense, a very real sense, summarizing text is now the largest single driver of the US economy.

The appeal of summarization to the business executive has never gone away; the underlying technology has just evolved. Since the 2022 launch of ChatGPT, television spots, bus shelter ads, and the collective buzz of the south end of the San Francisco Peninsula have promised first and foremost that AI will relieve us of the obligation of reading. An LLM can read your email, read the news, read a book, or read the comments. Actually, the LLM has already read a lot of these things. On command, it can summarize them to you.

AI advertising seems to imagine a world that is, well, oddly familiar: one in which students, housewives, and, yes, business executives can save hours of each day by using the LLM to, in effect, read at 25,000 WPM. It also seems that the same basic principles apply: the LLM's output loses some of the content of the original material. It might also gain some content, a benefit of all of the other things that the LLM has also been trained on. Still: there's always a certain rounding out, a sanding down of the details.

What strikes me most about LLM summaries is just how long they are. When I have asked Claude to summarize reading notes, it has routinely produced output that is longer than the original notes. This problem can probably be addressed by prompting, although my efforts at appending everything from "be brief" to "for the love of God keep it to one paragraph" have failed to produce a good result. Maybe I'm holding it wrong, maybe I'm an idiot, I possess no qualifications in this area besides decades as a natural language user and an unfinished degree in technical writing. But experience suggests that my coworkers have the same problem. I see AI generated meeting summaries, AI generated issue descriptions, AI generated sales documents. One of their common properties is that they are astoundingly, uselessly verbose.

Of course, modern AI can do so much more than summarize text. "Generative AI" promises not only to summarize, but also to create something new. Perhaps that's why the LLM is so verbose. I, personally, find that I make up for my lackluster interpersonal skills by writing. Perhaps LLMs make up for their similar limitations, their fundamentally text-based, screen-resident nature, by using the one tool that they have. The LLM cannot think or feel, yet it can write. So it writes: a simple question answered with such energy that it merits four distinct bulleted lists, each with an emoji-laden heading and an introductory paragraph. I suppose I can sympathize. We must imagine Grok happy.


I put a lot of time into writing this, and I hope that you enjoy reading it. If you can spare a few dollars, consider supporting me on ko-fi. You'll receive an occasional extra, subscribers-only post, and defray the costs of providing artisanal, hand-built world wide web directly from Albuquerque, New Mexico.


I do not mean to criticize AI too harshly, although I think the level of criticism that this entire industry phenomena deserves is high enough that you have to go big.

But the relationship between speed reading and the LLM—between Sam Altman and Evelyn Wood—is vague but vivid. The software industry's imagined future, in which people use LLMs to generate text that other people use LLMs to summarize, genuinely haunts me. AI has created a profound contradiction: it promises the productivity gains of speed reading, the ease of CliffsNotes, but it doesn't just shorten text. It also lengthens it. My joking reference to Camus, shoddy as it is, becomes more meaningful. ChatGPT pushes the written word up the hill, it watches it roll back down again.

I read "The Myth of Sisyphus" for the same reason everyone else did: high school. IB English HL. Yes, I went to one of those schools. If you are not familiar you can look it up and one of the top results, at least for me, is a clearly LLM-generated article that is four or five times longer than it should be based on the factual content. You can have your web browser's LLM feature summarize it back down for you, if you want. The result comes out a lot less useful than the Wikipedia article but it is, as they say, disruptive nonetheless.

If the purpose of reading is solely to acquire information and accumulate thought units, then surely speed and efficiency are the essential criteria. Regressing is obviously a morbid symptom since it is destructive of time and energy, while horizontal reading not only taxes the optic muscles, but requires that the same tome remain clutched by fingers which could be more profitably employed in reaching for yet another volume.

We're all full of opinions on the era of AI. I am perhaps not as pessimistic as you might think: the machine learning innovations of the last few years clearly do have useful applications. Even summarizing text has its time and place. I suppose that what frustrates me most about it all is the lack of ambition. LLMs train on text, take text as input, and generate text as output. A room of Silicon Valley visionaries, presented with this astounding tool, came up with such world-changing applications as "reading emails" and "writing emails." The whole industry is still struggling to move past this trivial, boring, frequently nonproductive use.

As for lip motions, any toddler knows that the mouth is tardier than the eye, and retardation is one of the most dreaded words in an educator's terminology. Furthermore, the speed cult is quick to point out that slow readers are rarely careful ones, and generally speaking, comprehension appears to increase with reading velocity. Speed reading, it would appear, is all profit and no loss and if it can make good its claims at linking efficiency and comprehension, then it is well that its methodology and objectives are incorporated into any reading program.

There is the potential, the AI's industries advocates say, of AI actually expanding human creativity. Machine learning methods of producing "art," whether text or image or audio or video, will lower the barrier of entry to artistic production. Of course, that depends a lot on how you define "artistic production," but at least it's a rare promise of a better future rather than a worse one. It only takes a brief interaction with the modern internet to realize that we do live in an age blessed with text. We are rich in the written word like never before, so wealthy with words that they crowd out the actual information. Search results are mostly AI-generated, but the search engine doesn't want you to look at them anyway, it's provided its own AI-generated treatise. The headings, the paragraphs, the bulleted lists, they run down the page, drip from our screens, they leave our desks filthy with content.

But prior to debating the plausibility of the claims in an Evelyn Wood brochure, it would seem logical to consider the desirability of the goals—goals which appear to have slipped unchallenged into the realm of pedagogical axioms. Are such facile reading practices worthy of unqualified adulation? A careful look at their implications suggests that such seeming saints can in fact be devils.

It's enough to drive you to madness. Why do we use computers to write text that no one will read? Why do we use computers to read text that no one wrote?

Years ago, in college, during a previous AI winter, I sat in my room reading a shitty science fiction novel. Leo, from across the hall, walked in. "What class is that for?" he asked.

"Not for a class," I responded.

"So you're just reading it?"

Taken by themselves, the cardinal virtues of reading efficiency can collectively demean the entire reading process by treating it as a function rather than as an art.

There is nothing new under the sun. We have done this all before: we have fixated on reading as production, production as profitable, and reading thus, ultimately, unimportant. A detail to be optimized away. An expense. ChatGPT didn't start this. It won't end it. That's what I remind myself: we are living through just another step in the evolution of culture.

But then I still worry. What if this is it? Between short-form video and AI, between social media's pivot to stoking fascism and the publishing industry's pivot to reprinting AO3, what if language arts are done?

None of these people care. That's the one thing I can say confidently, or at least say that I truly believe. These people building the cutting edge of natural language, these industry titans who style themselves as the loyalists of our nation and revolutionaries of the arts, they don't give a damn about writing or reading. Text is an asset, an asset to extract, refine, and dispense. They're just trying to make it through the news and their Twitter feeds and a half dozen pop-science books as fast as possible so that they can be, feel, or at least look like they're well-read. They assume that everyone else feels the same way.

How can any teacher extol the pleasures of reading when classroom practice implicitly asserts that books are mines to be stripped and not pastures in which to dwell and delight?

I've been quoting from Leonard R. Mendelsohn, whose paper "Jetting to Utopia: The Speed Reading Phenomenon" ignores the question of whether or not speed reading works and instead considers whether or not it is a good idea. His context was the classroom of the 1970s: speed reading had caught on in education, and Mendelsohn worried. Well-intentioned teachers were training their students to absolutely optimize the mechanics of reading. In the process, Mendelsohn feared, they had forgotten the point.

Reading can provide fodder for the brain by the ready conversion of wood pulp and printer's ink into social poise, persuasiveness, and a financially rewarding livelihood.

Things have changed a great deal since Mendelsohn's day. The wood pulp is gone, so too the printer ink, and so too the financial rewards. Writing is, I suppose, more of an art than ever before, as my chosen industry devotes its full might to destroying my chosen avocation.

Although reading might be branded with the explicit label "fun," it is not long before the apt student reaches the conclusion that speed, concepts, and information are all one knows and all one needs to know.

Mendelsohn's paper ran in the journal "Language Arts." It's about four pages long, about 2,300 words. It took me around ten minutes to read. An accomplished student of Evelyn Wood could read it in just a couple of minutes. With some chiding to stay brief and cut it out with the bulleted lists, Claude summarized it in a few sentences.

For the journal, though, the paper is not quite long enough. Its last page is only half full. The journal editor made up the difference, they found some filler. It's a poem about clouds.

ChatGPT can do so much, but it can't do the work of a poet. It can't match Christa Kessler, age 10, Powhatan School, Boyce, Virginia. She wrote "Clouds" almost fifty years ago, an editor used it to round out the layout of a journal, JSTOR coughed it up along with my article, and now I am thinking about how clouds really are interludes in the middle of a great blue sea.

That's what it's like to read slow. That's what it means to write.

  1. If you don't immediately know the exact kind of YouTube video I'm talking about, maybe "become a catgirl subliminal" will jog your mind. Or just look it up and find out for yourself. Remember to stay hydrated.

  2. One of the hard things about writing about these kinds of fringe or parascientific topics is that they get all tangled up in each other and I have a hard time not getting lost on tangents. Fortunately I think that many of my readers have the same kind of internet exposure that I do and are probably familiar with the concept or claims made about binaural beats. You might be less aware that the whole thing dates back to the 1970s and perennially pops up in any kind of self-help or "neurogenics" or whatever context, including many speed reading courses. To be fair, back in the 1970s the idea was new and full of potential. Now it is not; decades of scientific investigation have failed to produce clear evidence that binaural beats do anything.

CodeSOD: Awaiting A Reaction

Today's Anonymous submitter sends us some React code. We'll look at the code and then talk about the WTF:

// inside a function for updating checkboxes on a page
if (!e.target.checked) {
  const removeIndex = await checkedlist.findIndex(
    (sel) => sel.Id == selected.Id,
  )
  const removeRowIndex = await RowValue.findIndex(
    (sel) => sel == Index,
  )

// checkedlist and RowValue are both useState instances.... they should never be modified directly
  await checkedlist.splice(removeIndex, 1)
  await RowValue.splice(removeRowIndex, 1)

// so instead of doing above logic in the set state, they dont
  setCheckedlist(checkedlist)
  setRow(RowValue)
} else {
  if (checkedlist.findIndex((sel) => sel.Id == selected.Id) == -1) {
    await checkedlist.push(selected)
  }
// same, instead of just doing a set state call, we do awaits and self updates
  await RowValue.push(Index)
  setCheckedlist(checkedlist)
  setRow(RowValue)
}

Comments were added by our submitter.

This code works. It's the wrong approach for doing things in React: modifying objects controlled by react, instead of using the provided methods, it's doing asynchronous push calls. Without the broader context, it's hard to point out all the other ways to do this, but honestly, that's not the interesting part.

I'll let our submitter explain:

This code is black magic, because if I update it, it breaks everything. Somehow, this is working in perfect tandem with the rest of the horrible page, but if I clean it up, it breaks the checkboxes; they're no longer able to be clicked. Its forcing React somehow to update asynchronously so it can use these updated values correctly, but thats the neat part, they aren't even being used anywhere else, but somehow the re-rendering page only accepts awaits. I've tried refactoring it 5 different ways to no avail

That's what makes truly bad code. Code so bad that you can't even fix it without breaking a thousand other things. Code that you have to carefully, slowly, pick through and gently refactor, discovering all sorts of random side-effects that are hidden. The code so bad that you actually have to live with it, at least for awhile.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!

CodeSOD: All Docked Up

Aankhen has a peer who loves writing Python scripts to automate repetitive tasks. We'll call this person Ernest.

Ernest was pretty proud of some helpers he wrote to help him manage his Docker containers. For example, when he wanted to stop and remove all his running Docker containers, he wrote this script:

#!/usr/bin/env python
import subprocess

subprocess.run("docker kill $(docker ps -q)", shell=True)
subprocess.run("docker rm $(docker ps -a -q)", shell=True)

He aliased this script to docker-stop, so that with one command he could… run two.

"Ernest," Aankhen asked, "couldn't this just be a bash script?"

"I don't really know bash," Ernest replied. "If I just do it in bash, if the first command fails, the second command doesn't run."

Aankhen pointed out that you could make bash not do that, but Ernest replied: "Yeah, but I always forget to. This way, it handles errors!"

"It explicitly doesn't handle errors," Aankhen said.

"Exactly! I don't need to know when there are no containers to kill or remove."

"Okay, but why not use the Docker library for Python?"

"What, and make the software more complicated? This has no dependencies!"

Aankhen was left with a sinking feeling: Ernest was either the worst developer he was working with, or one of the best.

[Advertisement] Keep all your packages and Docker containers in one place, scan for vulnerabilities, and control who can access different feeds. ProGet installs in minutes and has a powerful free version with a lot of great features that you can upgrade when ready.Learn more.

CodeSOD: To Shutdown You Must First Shutdown

Every once in awhile, we get a bit of terrible code, and our submitter also shares, "this isn't called anywhere," which is good, but also bad. Ernesto sends us a function which is called in only one place:

///
/// Shutdown server
///
private void shutdownServer()
{
    shutdownServer();
}

The "one place", obviously, is within itself. This is the Google Search definition of recursion, where each recursive call is just the original call, over and over again.

This is part of a C# service, and this method shuts down the server, presumably by triggering a stack overflow. Unless C# has added tail calls, anyway.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!

Anti-Simplification

Our anonymous submitter relates a tale of simplification gone bad. As this nightmare unfolds, imagine the scenario of a new developer coming aboard at this company. Imagine being the one who has to explain this setup to said newcomer.

Imagine being the newcomer who inherits it.

A "Storm P machine" - the Danish equivalent of a Rube Goldberg machine.

David's job should have been an easy one. His company's sales data was stored in a database, and every day the reporting system would query a SQL view to get the numbers for the daily key performance indicators (KPIs). Until the company's CTO, who was proudly self-taught, decided that SQL views are hard to maintain, and the system should get the data from one of those new-fangled APIs instead.

But how does one call an API? The reporting system didn't have that option, so the logical choice was Azure Data Factory to call the API, then output the data to a file that the reporting system could read. The only issue was that nobody on the team spoke Azure Data Factory, or for that matter SQL. But no problem, one of David's colleagues assured, they could do all the work in the best and most multifunctional language ever: C#.

But you can't just write C# in a data factory directly, that would be silly. What you can do is have the data factory pipeline call an Azure function, which calls a DLL that contains the bytecode from C#. Oh, and a scheduler outside of the data factory to run the pipeline. To read multiple tables, the pipeline calls a separate function for each table. Each function would be based on a separate source project in C#, with 3 classes each for the HTTP header, content, and response; and a separate factory class for each of the actual classes.

After all, each table had a different set of columns, so you can't just re-use classes for that.

There was one little issue: the reporting system required an XML file, whereas the API would export data in JSON. It would be silly to expect a data factory, of all things, to convert this. So the CTO's solution was to have another C# program (in a DLL called by a function from a pipeline from an external scheduler) that reads the JSON document saved by the earlier program, uses foreach to go over each element, then saves the result as XML. A distinct program for each table, of course, requiring distinct classes for header, content, response, and factories thereof.

Now here's the genius part: to the C# class representing the output data, David's colleague decided to attach one different object for each input table required. The data class would use reflection to iterate over the attached objects, and for each object, use a big switch block to decide which source file to read. This allows the data class to perform joins and calculations before saving to XML.

To make testing easier, each calculation would be a separate function call. For example, calculating a customer's age was a function taking struct CustomerWithBirthDate as input, use a foreach loop to copy all the data except replacing one field, and return a CustomerWithAge struct to pass to the next function. The code performed a bit slowly, but that was an issue for a later year.

So basically, the scheduler calls the data factory, which calls a set of Azure functions, which call a C# function, which calls a set of factory classes to call the API and write the data to a text file. Then, the second scheduler calls a data factory, which calls Azure functions, which call C#, which calls reflection to check attachment classes, which read the text files, then call a series of functions for each join or calculation, then call another set of factory classes to write the data to an XML file, then call the reporting system to update.

Easy as pie, right? So where David's job could have been maintaining a couple hundred lines of SQL views, he instead inherited some 50,000 lines of heavily-duplicated C# code, where adding a new table to the process would easily take a month.

Or as the song goes, Somebody Told Me the User Provider should use an Adaptor to Proxy the Query Factory Builder ...

[Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.

Error'd: That's What I Want

First up with the money quote, Peter G. remarks "Hi first_name euro euro euro, look how professional our marketing services are! "

1

 

"It takes real talent to mispell error" jokes Mike S. They must have done it on purpose.

0

 

I long wondered where the TikTok profits came from, and now I know. It's Daniel D. "I had issues with some incorrectly documented TikTok Commercial Content API endpoints. So I reached out to the support. I was delighted to know that it worked and my reference number was . PS: 7 days later I still have not been contacted by anyone from TikTok. You can see their support is also . "

2

 

Fortune favors the prepared, and Michael R. is very fortunate. "I know us Germans are known for planning ahead so enjoy the training on Friday, February 2nd 2029. "

3

 

Someone other than dragoncoder047 might have shared this earlier, but this time dragoncoder047 definitely did. "Digital Extremes (the developers of Warframe) were making many announcements of problems with the new update that rolled out today [February 11]. They didn’t mention this one!"

4

 

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

CodeSOD: Qaudruple Negative

We mostly don't pick on bad SQL queries here, because mostly the query optimizer is going to fix whatever is wrong, and the sad reality is that databases are hard to change once they're running; especially legacy databases. But sometimes the code is just so hamster-bowling-backwards that it's worth looking into.

Jim J has been working on a codebase for about 18 months. It's a big, sprawling, messy project, and it has code like this:

AND CASE WHEN @c_usergroup = 50 AND NOT EXISTS(SELECT 1 FROM l_appl_client lac WHERE lac.f_application = fa.f_application AND lac.c_linktype = 840 AND lac.stat = 0 AND CASE WHEN ISNULL(lac.f_client,0) <> @f_client_user AND ISNULL(lac.f_c_f_client,0) <> @f_client_user THEN 0 ELSE 1 END = 1 ) THEN 0 ELSE 1 END = 1 -- 07.09.2022

We'll come back to what it's doing, but let's start with a little backstory.

This code is part of a two-tier application: all the logic lives in SQL Server stored procedures, and the UI is a PowerBuilder application. It's been under development for a long time, and in that time has accrued about a million lines of code between the front end and back end, and has never had more than 5 developers working on it at any given time. The backlog of feature requests is nearly as long as the backlog of bugs.

You may notice the little date comment in the code above. That's because until Jim joined the company, they used Visual Source Safe for version control. Visual Source Safe went out of support in 2005, and let's be honest: even when it was in support it barely worked as a source control system. And that's just the Power Builder side- the database side just didn't use source control. The source of truth was the database itself. When going from development to test to prod, you'd manually export object definitions and run the scripts in the target environment. Manually. Yes, even in production. And yes, environments did drift and assumptions made in the scripts would frequently break things.

You may also notice the fields above use a lot of Hungarian notation. Hungarian, in the best case, makes it harder to read and reason about your code. In this case, it's honestly fully obfuscatory. c_ stands for a codetable, f_ for entities. l_ is for a many-to-many linking table. z_ is for temporary tables. So is x_. And t_. Except not all of those "temporary" tables are truly temporary, a lesson Jim learned when trying to clean up some "junk" tables which were not actually junk.

I'll let Jim add some more detail around these prefixes:

an "application" may have a link to a "client", so there is an f_client field; but also it references an "agent" (which is also in the f_client table, surpise!) - this is how you get an f_c_f_client field. I have no clue why the prefix is f_c_ - but I also found c_c_c_channel and fc4_contact columns. The latter was a shorthand for f_c_f_c_f_c_f_contact, I guess.

"f_c_f_c_f_c_f_c" is also the sound I'd make if I saw this in a codebase I was responsible for. It certainly makes me want to change the c_c_c_channel.

With all this context, let's turn it back over to Jim to explain the code above:

And now, with all this background in mind, let's have a look at the logic in this condition. On the deepest level we check that both f_client and f_c_f_client are NOT equal to @f_client_user, and if this is the case, we return 0 which is NOT equal to 1 so it's effectively a negation of the condition. Then we check that records matching this condition do NOT EXIST, and when this is true - also return 0 negating the condition once more.

Honestly, the logic couldn't be clearer, when you put it that way. I jest, I've read that twelve times and I still don't understand what this is for or why it's here. I just want to know who we can prosecute for this disaster. The whole thing is a quadruple negative and frankly, I can't handle that kind of negativity.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

CodeSOD: Repeating Your Existence

Today's snippet from Rich D is short and sweet, and admittedly, not the most TFs of WTFs out there. But it made me chuckle, and sometimes that's all we need. This Java snippet shows us how to delete a file:

if (Files.exists(filePath)) {
    Files.deleteIfExists(filePath);
}

If the file exists, then if it exists, delete it.

This commit was clearly submitted by the Department of Redundancy Department. One might be tempted to hypothesize that there's some race condition or something that they're trying to route around, but if they are, this isn't the way to do it, per the docs: "Consequently this method may not be atomic with respect to other file system operations." But also, I fail to see how this would do that anyway.

The only thing we can say for certain about using deleteIfExists instead of delete is that deleteIfExists will never throw a NoSuchFileException.

[Advertisement] Plan Your .NET 9 Migration with Confidence
Your journey to .NET 9 is more than just one decision.Avoid migration migraines with the advice in this free guide. Download Free Guide Now!

CodeSOD: Blocked Up

Agatha has inherited some Windows Forms code. This particular batch of such code falls into that delightful category of code that's wrong in multiple ways, multiple times. The task here is to disable a few panels worth of controls, based on a condition. Or, since this is in Spanish, "bloquear controles". Let's see how they did it.

private void BloquearControles()
{
	bool bolBloquear = SomeConditionTM; // SomeConditionTM = a bunch of stuff. Replaced for clarity.

	// Some code. Removed for clarity.
	
	// private System.Windows.Forms.Panel pnlPrincipal;
	foreach (Control C in this.pnlPrincipal.Controls)
	{
		if (C.GetType() == typeof(System.Windows.Forms.TextBox))
		{
			C.Enabled = bolBloquear;
		}
		if (C.GetType() == typeof(System.Windows.Forms.ComboBox))
		{
			C.Enabled = bolBloquear;
		}
		if (C.GetType() == typeof(System.Windows.Forms.CheckBox))
		{
			C.Enabled = bolBloquear;
		}
		if (C.GetType() == typeof(System.Windows.Forms.DateTimePicker))
		{
			C.Enabled = bolBloquear;
		}
		if (C.GetType() == typeof(System.Windows.Forms.NumericUpDown))
		{
			C.Enabled = bolBloquear;
		}
	}
	
	// private System.Windows.Forms.GroupBox grpProveedor;
	foreach (Control C1 in this.grpProveedor.Controls)
	{
		if (C1.GetType() == typeof(System.Windows.Forms.TextBox))
		{
			C1.Enabled = bolBloquear;
		}
		if (C1.GetType() == typeof(System.Windows.Forms.ComboBox))
		{
			C1.Enabled = bolBloquear;
		}
		if (C1.GetType() == typeof(System.Windows.Forms.CheckBox))
		{
			C1.Enabled = bolBloquear;
		}
		if (C1.GetType() == typeof(System.Windows.Forms.DateTimePicker))
		{
			C1.Enabled = bolBloquear;
		}
		if (C1.GetType() == typeof(System.Windows.Forms.NumericUpDown))
		{
			C1.Enabled = bolBloquear;
		}
	}

	// private System.Windows.Forms.GroupBox grpDescuentoGeneral;
	foreach (Control C2 in this.grpDescuentoGeneral.Controls)
	{
		if (C2.GetType() == typeof(System.Windows.Forms.TextBox))
		{
			C2.Enabled = bolBloquear;
		}
		if (C2.GetType() == typeof(System.Windows.Forms.ComboBox))
		{
			C2.Enabled = bolBloquear;
		}
		if (C2.GetType() == typeof(System.Windows.Forms.CheckBox))
		{
			C2.Enabled = bolBloquear;
		}
		if (C2.GetType() == typeof(System.Windows.Forms.DateTimePicker))
		{
			C2.Enabled = bolBloquear;
		}
		if (C2.GetType() == typeof(System.Windows.Forms.NumericUpDown))
		{
			C2.Enabled = bolBloquear;
		}
	}

	// Some more code. Removed for clarity.
}

This manages two group boxes and a panel. It checks a condition, then iterates across every control beneath it, and sets their enabled property on the control. In order to do this, it checks the type of the control for some reason.

Now, a few things: every control inherits from the base Control class, which has an Enabled property, so we're not doing this check to make sure the property exists. And every built-in container control automatically passes its enabled/disabled state to its child controls. So there's a four line version of this function where we just set the enabled property on each container.

This leaves us with two possible explanations. The first, and most likely, is that the developer responsible just didn't understand how these controls worked, and how inheritance worked, and wrote this abomination as an expression of that ignorance. This is extremely plausible, extremely likely, and honestly, our best case scenario.

Because our worse case scenario is that this code's job isn't to disable all of the controls. The reason they're doing type checking is that there are some controls used in these containers that don't match the types listed. The purpose of this code, then, is to disable some of the controls, leaving others enabled. Doing this by type would be a terrible way to manage that, and is endlessly confusing. Worse, I can't imagine how this behavior is interpreted by the end users; the enabling/disabling of controls following no intuitive pattern, just filtered based on the kind of control in use.

The good news is that Agatha can point us towards the first option. She adds:

They decided to not only disable the child controls one by one but to check their type and only disable those five types, some of which aren't event present in the containers. And to make sure this was WTF-worthy the didn't even bother to use else-if so every type is checked for every child control

She also adds:

At this point I'm not going to bother commenting on the use of GetType() == typeof() instead of is to do the type checking.

Bad news, Agatha: you did bother commenting. And even if you didn't, don't worry, someone would have.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

CodeSOD: Popping Off

Python is (in)famous for its "batteries included" approach to a standard library, but it's not that notable that it has plenty of standard data structures, like dicts. Nor is in surprising that dicts have all sorts of useful methods, like pop, which removes a key from the dict and returns its value.

Because you're here, reading this site, you'll also be unsurprised that this doesn't stop developers from re-implementing that built-in function, badly. Karen sends us this:

def parse_message(message):
    def pop(key):
        if key in data:
            result = data[key]
            del data[key]
            return result
        return ''

    data = json.loads(message)
    some_value = pop("some_key")
    # <snip>...multiple uses of pop()...</snip>

Here, they create an inner method, and they exploit variable hoisting. While pop appears in the code before data is declared, all variable declarations are "hoisted" to the top. When pop references data, it's getting that from the enclosing scope. Which while this isn't a global variable, it's still letting a variable cross between two scopes, which is always messy.

Also, this pop returns a default value, which is also something the built-in method can do. It's just the built-in version requires you to explicitly pass the value, e.g.: some_value = data.pop("some_key", "")

Karen briefly wondered if this was a result of the Python 2 to 3 conversion, but no, pop has been part of dict for a long time. I wondered if this was just an exercise in code golf, writing a shorthand function, but even then- you could just wrap the built-in pop with your shorthand version (not that I'd recommend such a thing). No, I think the developer responsible simply didn't know the function was there, and just reimplemented a built-in method badly, as so often happens.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

Error'd: Perverse Perseveration

Pike pike pike pike Pike pike pike.

Lincoln KC repeated "I never knew Bank of America Bank of America Bank of America was among the major partners of Bank of America."

4

 

"Extra tokens, or just a stutter?" asks Joel "An errant alt-tab caused a needless google search, but thankfully Gemini's AI summary got straight-to-the-point(less) info. It is nice to see the world's supply of Oxford commas all in once place. "

0

 

Alessandro M. isn't the first one to call us out on our WTFs. "It’s adorable how the site proudly supports GitHub OAuth right up until the moment you actually try to use it. It’s like a door with a ‘Welcome’ sign that opens onto a brick wall." Meep meep.

1

 

Float follies found Daniel W. doubly-precise. "Had to go check on something in M365 Admin Center, and when I was on the OneDrive tab, I noticed Microsoft was calculating back past the bit. We're in quantum space at this point."

2

 

Weinliebhaber Michael R. sagt "Our German linguists here will spot the WTF immediately where my local wine shop has not. Weiẞer != WEIBER. Those words mean really different things." Is that 20 euro per kilo, or per the piece?

3

 

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

CodeSOD: The Counting Machine

Industrial machines are generally accompanied by "Human Machine Interfaces", HMIs. This is industrial slang for a little computerized box you use to control the industrial machine. All the key logic and core functionality and especially the safety functionality is handled at a deeper computer layer in the system. The HMI is just buttons users can push to interact with the machine.

Purchasers of those pieces of industrial equipment often want to customize that user interface. They want to guide users away from functions they don't need, or make their specific workflow clear, or even just brand the UI. This means that the vendor needs to publish an API for their HMI.

Which brings us to Wendy. She works for a manufacturing company which wants to customize the HMI on a piece of industrial equipment in a factory. That means Wendy has been reading the docs and poking at the open-sourced portions of the code, and these raise more questions than they answer.

For example, the HMI's API provides its own set of collection types, in C#. We can wonder why they'd do such a thing, which is certainly a WTF in itself, but this representative line raises even more questions than that:

Int32 Count { get; set; }

What happens if you use the public set operation on the count of items in a collection? I don't know. Wendy doesn't either, as she writes:

I'm really tempted to set the count but I fear the consequences.

All I can hear in my head when I think about "setting the Count" is: "One! One null reference exception! Two! TWO null reference exceptions! HA HA HA HA!"

Count von Count kneeling.png
By http://muppet.wikia.com/wiki/Count_von_Count

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

CodeSOD: Safegaurd Your Comments

I've had the misfortune of working in places which did source-control via comments. Like one place which required that, with each section of code changed, you needed to add a comment with your name, the ticket number, and the reason the change was made. You know, the kind of thing you can just get from your source control service.

In their defense, that policy was invented for mainframe developers and then extended to everyone else, and their source control system was in Visual Source Safe. VSS was a) terrible, and b) a perennial destroyer of history, so maybe they weren't entirely wrong and VSS was the real WTF. I still hated it.

In any case, Alice's team uses more modern source control than that, which is why she's able to explain to us the story of this function:

public function calculateMassGrossPay(array $employees, Payroll $payroll): array
{
    // it shouldn't enter here, but if it does by any change, do nth
    return [];
}

Once upon a time, this function actually contained logic, a big pile of fairly complicated logic. Eventually, a different method was created which streamlined the functionality, but had a different signature and logic. All the callers were updated to use that method instead- by commenting out the line which called this one. This function had a comment added to the top: // it shouldn't enter here.

Then, the body of this function got commented out, and the return was turned into an empty array. The comment was expanded to what you see above. Then, eventually, the commented-out callers were all deleted. Years after that, the commented out body of this function was also deleted, leaving behind the skeleton you see here.

This function is not referenced anywhere else, not even in a comment. It's truly impossible for code to "enter here".

Alice writes: "Version control by commented out code does not work very well."

Indeed, it does not.

[Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.

Representative Line: Years Go By

Henrik H's employer thought they could save money by hiring offshore, and save even more money by hiring offshore junior developers, and save even more money by basically not supervising them at all.

Henrik sends us just one representative line:

if (System.DateTime.Now.AddDays(-365) <= f.ReleaseDate) // 365 days means one year 

I appreciate the comment, that certainly "helps" explain the magic number. There's of course, just one little problem: It's wrong. I mean, ~75% of the time, it works every time, but it happily disregards leap years. Which may or may not be a problem in this case, but if they got so far as learning about the AddDays method, they were inches from using AddYears.

I guess it's true what they say: you can lead a dev to docs, but you can't make them think.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

WTF: Home Edition

The utility closet Ellis had inherited and lived with for 17 years had been a cesspool of hazards to life and limb, a collection of tangible WTFs that had everyone asking an uncaring god, "What were they thinking?"

Every contractor who'd ever had to perform any amount of work in there had come away appalled. Many had even called over their buddies to come and see the stunning mess for themselves:

INTERIOR OF UTILITY ROOM SHOWING STORAGE CLOSET AT PHOTO CENTER LEFT AND HOT WATER HEATER CLOSET AT PHOTO CENTER RIGHT. VIEW TO EAST. - Bishop Creek Hydroelectric System, HAER CAL,14-BISH.V,7A-28

  • All of the electrical components, dating from the 1980s, were scarily underpowered for what they were supposed to be powering.
  • To get to the circuit breaker box—which was unlabeled, of course—one had to contort themselves around a water heater almost as tall as Ellis herself.
  • As the house had no basement, the utility closet was on the first floor in an open house plan. A serious failure with said water heater would've sent 40 gallons (150 liters) of scalding-hot tsunami surging through the living room and kitchen.
  • The furnace's return air vent had been screwed into crumbling drywall, and only prayers held it in place. Should it have fallen off, it would never have been replaceable. And Ellis' cat would've darted right in there for the adventure of a lifetime.
  • To replace the furnace filter, Ellis had to put on work gloves, unscrew a sharp sheet-metal panel from the side of the furnace, pull the old filter out from behind a brick (the only thing holding it in place), manipulate the filter around a mess of water and natural gas pipes to get it out, thread the new filter in the same way, and then secure it in place with the brick before screwing the panel back on. Ellis always pretended to be an art thief in a museum, slipping priceless paintings around security-system lasers.
  • Between the water tank, furnace, water conditioning unit, fiber optical network terminal, and router, there was barely room to breathe, much less enough air to power ignition for the gas appliances. Some genius had solved this by cutting random holes in several walls to admit air from outside. One of these holes was at floor-level. Once, Ellis opened the closet door to find a huge puddle on the floor, making her fear her hot water heater was leaking. As it turned out, a power-washing service had come over earlier that day. When they'd power-washed the exterior of her home, some of that water shot straight through one of those holes she hadn't known about, giving her utility closet a bonus bath.
  • If air intake was a problem, venting the appliances' exhaust was an even worse issue. The sheet-metal vents had calcified and rusted over time. If left unaddressed, holes could've formed that would've leaked carbon monoxide into Ellis' house.

Considering all the above, plus the fact that the furnace and air conditioner were coming up on 20 years of service, Ellis couldn't put off corrective action any longer. Last week, over a span of 3 days, contractors came in to exorcise the demons:

  • Upgrading electricals that hadn't already been dealt with.
  • Replacing the hot water tank with a wall-mounted tankless heater.
  • Replacing the furnace and AC with a heat pump and backup furnace, controlled by a new thermostat.
  • Creating new pipes for intake and venting (no more reliance on indoor air for ignition).
  • Replacing the furnace return air vent with a sturdier one.
  • Putting a special hinged door on the side of the furnace, allowing the filter to be replaced in a matter of seconds (RIP furnace brick).

With that much work to be done, there were bound to be hiccups. For instance, when the Internet router was moved, an outage occurred: for no good reason, the optical network terminal refused to talk to Ellis' Wifi router after powering back up. A technician came out a couple days later, reset the Internet router, and everything was fine again.

All in all, it was an amazing and welcome transformation. As each new update came online, Ellis was gratefully satisfied. It seemed as though the demons were finally gone.

Unbeknownst to them all, there was one last vengeful spirit to quell, one final WTF that it was hell-bent on doling out.

It was late Friday afternoon. Despite the installers' best efforts, the new thermostat still wasn't communicating with the new heat pump. Given the timing, they couldn't contact the company rep to troubleshoot. However, the thermostat was properly communicating with the furnace. And so, Ellis was left with the furnace for the weekend. She was told not to mess with the thermostat at all except to adjust the set point as desired. They would follow back up with her on Monday.

For Ellis, that was perfectly fine. With the historically cold winter they'd been enduring in her neck of the woods, heat was all she cared about. She asked whom to contact in case of any issues, and was told to call the main number. With all that squared away, she looked forward to a couple of quiet, stress-free days before diving back into HVAC troubleshooting.

Everything was fine, until it wasn't. Around 11AM on Saturday, Ellis noticed that the thermostat displayed the word "Heating" while the furnace wasn't actually running. Maybe it was about to turn on? 15 minutes went by, then half an hour. Nothing had changed except for the temperature in her house steadily decreasing.

Panic set in at the thought of losing heat in her home indefinitely. That fell on top of a psyche that was already stressed out and emotionally exhausted from the last several days' effort. Struggling for calm, Ellis first tried to call that main number line for help as directed. She noticed right away that it wasn't a real person on the other end asking for her personal information, but an AI agent. The agent informed her that the on-call technician had no availablity that weekend. It would pencil her in for a service appointment on Monday. How did that sound?

"Not good enough!" Ellis cried. "I wanna speak to a representative!"

"I understand!" replied the blithe chatbot. "Hold on, let me transfer you!"

For a moment, Ellis was buoyed with hope. She'd gotten past the automated system. Soon, she'd be talking with a live person who might even be able to walk her through troubleshooting over the phone.

The new agent answered. Ellis began pouring her heart out—then stopped dead when she realized it was another AI agent, this time with a male voice instead of a female one. This one proceeded through nearly the same spiel as the first. It also scheduled her for a Monday service appointment even though the other chatbot had already claimed to have done so.

This was the first time an AI had ever pulled such a trick on Ellis. It was not a good time for it. Ellis hung up and called the only other person she could think to contact: her sales rep. When he didn't answer, she left a voicemail and texts: no heat all weekend was unacceptable. She would really appreciate a call back.

While playing the horrible waiting game, Ellis tried to think about what she could do to fix this. They had told her not to mess with the thermostat. Well, from what she could see, the thermostat was sending a signal to the furnace that the furnace wasn't responding to for whatever reason. It was time to look at the docs. Fortunately, the new furnace's manual was resting right on top of it. She spread it open on her kitchen table.

OK, Ellis thought, this newfangled furnace has an LED display which displays status codes. Her old furnace had lacked such a thing. Lemme find that.

Inside her newly remodeled utility closet, she located the blinking display, knelt, and spied the code: 1dL. Looking that up in the doc's troubleshooting section, she found ... Normal Operation. No action.

The furnace was OK, then? Now what?

Aside from documentation, another thing Ellis knew pretty well was tech support. She decided to break out the ol' turn-it-off-and-on-again. She shut off power to both the furnace and thermostat, waited a few minutes, then switched everything back on, crossing her fingers.

No change. The indoor temperature kept dropping.

Her phone rang: the sales rep. He connected her with the on-call technician for that weekend, who fortunately was able to arrive at her house within the hour.

One tiny thermostat adjustment later, and Ellis was enjoying a warm house once more.

What had happened?

This is where an understanding of heat pumps comes into play. In this configuration, the heat pump is used for cooling and for heating, unless the outside temperature gets very cold. At that point, the furnace kicks in, which is more efficient. (Technology Connections has some cool videos about this if you're curious.)

Everything had been running fine for Ellis while the temperatures had remained below freezing. The problem came when, for the first time in approximately 12 years, the temperature rose above 40F (4C). At that point, the new thermostat decided, without telling Ellis, I'm gonna tell the HEAT PUMP to heat the joint!

... which couldn't do anything just then.

Workaround: the on-call technician switched the thermostat to an emergency heat mode that used the furnace no matter what.

Ellis had been told not to goof around with the thermostat. Even if she had, as a heat pump neophyte, she wouldn't have known to go looking for such a setting. She might've dug it up in a manual. Someone could've walked her through it over the phone. Oh, well. There is heat again, which is all that matters.

They will attempt to bring the heat pump online soon. We shall see if the story ends here, or if this becomes The WTF That Wouldn't Die.

P.S. When Ellis explained the AI answering service's deceptive behavior, she was told that the agent had been universally complained about ever since they switched to it. Fed up, they told Ellis they're getting rid of it. She feels pretty chuffed about more people seeing the light concerning garbage AI that creates far more problems than it solves.

[Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.

Error'd: Three Blinded Mice

...sent us five wtfs. And so on anon.

Item the first, an anon is "definitely not qualified" for this job. "These years of experience requirements are getting ridiculous."

0

 

Item the second unearthed by a farmanon has a loco logo. "After reading about the high quality spam emails which are indistinguishable from the company's emails, I got one from the spammer just starting his first day."

1

 

In thrid place, anon has only good things to say: "I'm liking their newsletter recommendations so far."

2

 

"A choice so noice, they gave it twoice," quipped somebody.

3

 

And foinally, a tdwtfer asks "I've seen this mixmastered calendering on several web sites. Is there an OSS package that is doing this? Or is it a Wordpress plugin?" I have a sneaking suspicion I posted this before. Call me on it.

4

 

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

Online Age Verification Tools for Child Safety Are Surveilling Adults

By: Nick Heer

Barbara Booth, CNBC:

Civil liberties’ advocates warn that concentrating large volumes of identity data among a small number of verification vendors can create attractive targets for hackers and government demands. Earlier this year, Discord disclosed a data breach that exposed ID images belonging to approximately 70,000 users through a compromised third-party service, highlighting the security risks associated with storing sensitive identity information.

[…]

According to Tandy, as more states adopt age-verification mandates and companies race to comply, the infrastructure behind those systems is likely to become a permanent fixture of online life. Taken together, industry leaders say the rapid spread of age-verification laws may push platforms toward systems that verify age once and reuse that credential across services.

The hurried implementation of age verification sounds fairly terrible, counterproductive, illegal in the U.S., and discriminatory, but we should not pretend that we are only now being subject to risky and overbearing surveillance on the web. The ecosystem powering behavioural ad targeting — including data brokers, the biggest of which have reported staggering data breaches for a decade — has all but ensured our behaviour on popular websites and in mobile apps is already tracked and tied to some proxy for our identity.

That is not an excuse for the poor implementation of age verification, nor justification for its existence. If anything, it is a condemnation of the current state of the web that this barely moves the needle on privacy. If I had to choose whether to compromise for commerce or for the children, it would be the latter, but the correct answer is, likely, neither.

⌥ Permalink

Grammarly’s ‘Expert Review’ Feature Presents Fake Advice in the Names of Real Journalists and Authors

By: Nick Heer

Casey Newton, Platformer:

On Friday I learned to my surprise that I had become an editor for Grammarly. The subscription-based writing assistant has introduced a feature named “expert review” that, in the company’s words, “is designed to take your writing to the next level — with insights from leading professionals, authors, and subject-matter experts.”

Read a little further, though, and you’ll learn that these “insights” are not actually “from” leading professionals, or any human person at all. Rather, they are AI-generated text, which may or may not reflect whichever “leading professional” Grammarly slapped their names on.

Miles Klee, Wired:

As advertised on a support page, Grammarly users can solicit tips from virtual versions of living writers and scholars such as Stephen King and Neil deGrasse Tyson (neither of whom responded to a request for comment) as well as the deceased, like the editor William Zinsser and astronomer Carl Sagan. Presumably, these different AI agents are trained on the oeuvres of the people they are meant to imitate, though the legality of this content-harvesting remains murky at best, and the subject of many, many copyright lawsuits.

I do not think a disclaimer explaining it does “not indicate any affiliation with Grammarly or endorsement by those individuals or entities” will sufficiently distance the company from its claim of providing “insights from leading professionals, authors, and subject-matter experts” attributed to the names of people who did not agree to participate in this. Apparently, it is incumbent upon them to opt out by emailing expertoptout@superhuman.com. Most people will obviously not do this — because why would anyone realize they need to opt out? — but especially those who are dead yet are still being called upon for their expertise. Let Carl Sagan rest.

⌥ Permalink

Apple Used to Design Its Laptops for Repairability

By: Nick Heer

Charlie Sorrel, of iFixit:

Apple’s MacBooks haven’t always been monolithic, barely repairable slabs of aluminum, glass, and glue. They used to be almost delightful in their repairable features, from their batteries to their Wi-Fi cards. Powerbooks, iBooks, and especially early MacBooks showed what happens when Apple applies its design skills directly to repairability and maintenance, instead of to thinness above all. Today we’re going to take a look at the best repairability features that Apple has ditched.

These four complaints range from the somewhat quaint — swappable Wi-Fi cards — to the stuff I actually miss, which is everything else. RAM and disk upgrades are a gimme since the cost-per-gigabyte (generally) declines over time, and I would love easily swappable batteries. But right now, nearly four years into owning this MacBook Pro, I would also really like to be able to swap in a new keyboard in the future. Not only are the keycaps unintentionally becoming polished, some oft-used keys feel a little mushy. Not much, and barely enough to notice, but I imagine their clickiness will not improve over time.

One quibble, emphasis mine:

[…] I have an old 2012 MacBook Air running Linux. I swapped the HDD for an SSD, maxed out the RAM, and dropped in a new battery, and I see no reason it wouldn’t easily keep rolling for another 10 years.

Unlikely. The 2012 MacBook Air only came with an SSD; a standard hard disk was not an option.

⌥ Permalink

Another Appearance Control Is Coming to Accessibility Settings in iOS 26.4

By: Nick Heer

Juli Clover, MacRumors:

Apple renamed the prior Reduce Highlighting Effects Accessibility setting to “Reduce Bright Effects,” and explained what it does.

Apple says the feature “minimizes highlighting and flashing when interacting with onscreen elements, such as buttons or the keyboard.

In my testing, this does exactly what you would expect. In places like toolbar buttons — or the buttons in the area of what is left of a toolbar, anyhow — the passcode entry screen, and Control Centre, the glowing tap effects are minimized or removed.

I do not find those effects particularly distracting, and I think turning them off saps some of the life out of the Liquid Glass design language, but I can see why some would be bothered by them. It is not the case that iOS 26 would be better if none of these appearance controls were present, only that they should not be necessary.

⌥ Permalink

Minister for Innovation, Science, and Economic Development Announces ‘Guardrails’ for TikTok Canada Operations

By: Nick Heer

There are three agreed-upon policies which, in the airy language of a government press release, seem reasonable enough to apply to all social platforms, yet are only relevant to TikTok. The first is exceedingly vague:

TikTok will implement enhanced protection for Canadians’ personal information, including new security gateways and privacy-enhancing technologies to control access to Canadian user data in order to reduce the risk of unauthorized or prohibited access.

There are no details about what the “new security gateways and privacy-enhancing technologies” are, nor why the sole goal is preventing “prohibited access” rather than “exploitative access”.

The second — complying with the recommendations of the Privacy Commissioner — was already underway, and the third is an “independent third-party monitor”, which seems fine.

⌥ Permalink

Sponsor: Magic Lasso Adblock: Effortlessly Block Ads on Your iPhone, iPad, Mac, and Apple TV

By: Nick Heer

Do you want an all-in-one solution to block ads, trackers, and annoyances across all your Apple devices?

Then download Magic Lasso Adblock — the ad blocker designed for you.

Sponsor: Magic Lasso Adblock

With Magic Lasso Adblock you can effortlessly block ads on your iPhone, iPad, Mac, and Apple TV.

Magic Lasso is a single, native app that includes everything you need:

  • Safari Ad Blocking — Browse 2.0× faster in Safari by blocking all ads, with no annoying distractions or pop ups

  • YouTube Ad Blocking — Block all YouTube ads in Safari, including all video ads, banner ads, search ads, plus many more

  • App Ad Blocking — Block ads and trackers across the news, social media, and game apps on your device, including other browsers such as Chrome and Firefox

  • Apple TV Ad Blocking — Watch your favourite TV shows with less interruptions and protect your privacy from in-app ad tracking with Magic Lasso on your Apple TV

Best of all, with Magic Lasso Adblock, all ad blocking is done directly on your device, using a fast, efficient Swift-based architecture that follows our strict zero data collection policy.

With over 5,000 five star reviews, it’s simply the best ad blocker for your iPhone, iPad, Mac, and Apple TV.

And unlike some other ad blockers, Magic Lasso Adblock respects your privacy, doesn’t accept payment from advertisers, and is 100% supported by its community of users.

So, ensure your browsing history, app usage, and viewing habits stay private with Magic Lasso Adblock.

Join over 400,000 users and download Magic Lasso Adblock today.

⌥ Permalink

Mixing News Coverage and ‘Prediction Markets’ Is a Dangerous Gamble

By: Nick Heer

Nilay Patel and Liz Lopatto discussed “prediction markets” on the Verge’s “Decoder” podcast; here is Patel’s summary:

Insider trading is supposed to be illegal, and so is operating an unregulated sports book. So you’re now starting to see Kalshi and Polymarket getting hit from both sides of this broader regulatory debate, and 2026 is shaping up to be the year that all of this really comes to a head. To what end? It’s hard to say, especially as these companies cozy up to the Trump administration.

But it’s also becoming increasingly untenable for prediction markets to sit in the middle of the tension between gambling on the news and trying to self-regulate such that they don’t encourage insider trading.

A little under a month after Gallup announced it would stop polling for presidential approval, the Associated Press said it would begin integrating Kalshi bets into its election coverage. As Patel and Lopatto say, however, election betting is among the least problematic news gambling.

⌥ Permalink

The Central Lie of Prediction Markets

By: Nick Heer

Charlie Warzel, the Atlantic (gift link):

Prediction markets claim to harness the wisdom of crowds to provide reliable public data: Because people are putting real money behind their opinions, they are expressing what they actually believe is most likely to happen, which, according to the reasoning of these platforms, means that events will unfold accordingly. Many news organizations, and Substack, now have partnerships with prediction markets — the subtext being that they provide some kind of news-gathering function. Some users who distrust mainstream media turn to the markets in place of traditional journalism.

But in reality, prediction markets produce the opposite of accurate, unbiased information. They encourage anyone with an informational edge to use their knowledge for personal financial gain. In this way, prediction markets are the perfect technology for a low-trust society, simultaneously exploiting and reifying an environment in which believing the motives behind any person or action becomes harder.

I had no idea so-called “prediction markets” like Kalshi and Polymarket were promoting themselves as forecasters of real information, let alone that anyone believed them. I always assumed “prediction markets” was a euphemism.

A spokesperson for Kalshi told Warzel that betting on current events is a way to “create accurate, unbiased forecasts”, and that is something we can verify. If this were true, bettors should have been able to forecast, for example, the popular vote split of the 2024 U.S. presidential election. Polls had Harris and Trump neck and neck, but on election day, 75.8% of Kalshi bettors believed Harris would prevail. There is not much granularity to Kalshi’s charts, but the forecast on Polymarket was favourable to Harris at 5:00 PM on November 5 — election day — and it flips to a Trump lead at the next available data point, 5:00 AM the following day, and well after it was obvious Trump won the popular vote.

This is just a way to gamble on current events, which is tragic and pathetic. We do not need to pretend these sites are anything more substantial than that.

⌥ Permalink

Google and Epic Games Announce Settlement

By: Nick Heer

Sameer Samat, Google’s president of Android Ecosystem:

Today we are announcing substantial updates that evolve our business model and build on our long history of openness globally.  We’re doing that in three ways: more billing options, a program for registered app stores, and lower fees and new programs for developers.

Epic Games CEO Tim Sweeney on X:

Google is opening up Android all the way with robust support for competing stores, competing payments, and a better deal for all developers. So, we’ve settled all of our disputes worldwide. THANKS GOOGLE!

Simon Sharwood, the Register:

Epic Games approved of the changes.

“These changes will evolve Android into a true open platform with competition among stores,” the company stated. “Globally, developers will have choices in how they make payments using Google Play’s payment system and competing payment systems, with reduced fees and the ability to point users outside apps to make purchases.

Epic also said “Google will take steps to support the future open metaverse,” a probable reference to the deal that will see games made with the Unity engine made available within Fortnite.

Neither Sweeney nor Epic Games can express anything less than elation with this outcome in no small part because they signed away their ability to do that. It still amazes me that concession ended up in the final agreement. It seems like the kind of thing that Google’s very expensive lawyers would pitch as leverage with Epic Games’ not-quite-as-expensive lawyers. In an interview with Dean Takahashi, of GamesBeat, it seems like Epic was eager to settle with terms that apply worldwide:

Asked why Sweeney decided to settle rather than litigate in every court in the world, he said, “This is just a really important thing that people should understand. The Epic versus Google court decision in the United States only has effect in the United States. It does nothing about the rest of the world. And the United States is about 30% of Google Play revenue and about 5% of Google Play users.”

He said it was never going to be a complete worldwide solution, and the court, throughout the proceedings, very clearly, said that the court wanted to establish competition among stores and competition among payments without setting prices in the market.

Curiously, not long before this settlement, Google announced it would begin requiring Android developers to be verified for their software to be installable, even by side-loading. I am curious if the combination of these changes meaningfully impacts users’ security or privacy. At a glance, the changes that settled this lawsuit seem like a welcome set of improvements that, sure, was assuredly not an altruistic fight by Epic Games, and will probably result in Sweeney getting even richer.

Regardless, it is notable for these sweeping changes to be brought to Android phones worldwide in the coming years, while Apple’s App Store is a patchwork of region-specific policies difficult for developers to navigate. It is too bad there is not really competition between these stores. Most people who buy smartphones choose the platform as a whole and accept whatever software experience they are provided. They do not need to bother themselves with the business terms of each store. With the improvements to third-party stores on Android, it sets up the possibility for greater competition within that platform. Apple should do the same.

⌥ Permalink

Apple’s New Studio Displays, Plural

By: Nick Heer

In hardware terms alone, Apple has been delivering an incredible run of Macs arguably since 2020, and easily since 2021. There are quibbles, sure — the display notch still bugs some people, the keyboard material wears poorly, and repairability has declined — but these are, overall, pretty sweet machines. The Macs announced this week seem like they will continue that hot streak.

I happen to be in the market for a new Mac, perhaps this year, and I should be spoiled for choice. I kind of am — the Mac Mini and Mac Studio are both alluring. But I am sadly attached to the room offered by my beloved 27-inch iMac, and Apple’s new lineup of displays is a sore point.

Stephen Hackett, 512 Pixels:

Yes, those are two different products, but they both feature 27-inch, 5K displays in the same enclosure as the previous Studio Display.

Starting at $1599, the new Studio Display is a slight upgrade to the 2022 model.

[…]

The much more interesting of the pair is the $3299 Studio Display XDR.

Those prices are, respectively, $2,100 and $4,500 in Canada. I am not a stranger to spending a lot of money on a screen — I bought a Thunderbolt Display at $1,000 — but that is a lot of money for even the basest of base models, especially since I have no idea whether the sketchy firmware issues have been resolved.

It is not that these displays are bad — far from it — but it is extraordinary that we are ten years removed from 27-inch Retina iMacs that started at just $200 more than the Studio Display is today. Only recently are we seeing more choice in 27-inch 5K displays at considerably lower prices, though without Apple’s very nice stand and quality of materials. At least the XDR has a seemingly new panel.

Three of the seven models in the Mac lineup require an external display. Apple has two choices: one really advanced one that costs as much as a generously-specced Mac Studio, and another that feels like it is stumbling along.

Anyway, here I go again looking for a sick deal I will not find on a Pro Display XDR. Those things really hold their value. Pity.

⌥ Permalink

Tim Sweeney Is Contractually Prohibited from Criticizing Google’s Developer Terms for Years

By: Nick Heer

Sean Hollister, the Verge:

But Google has finally muzzled Tim Sweeney. It’s right there in a binding term sheet for his settlement with Google.

On March 3rd, he not only signed away Epic’s rights to sue and disparage the company, he signed away his right to advocate for any further changes to Google’s app store polices. He can’t criticize Google’s app store practices. In fact, he has to praise them.

The terms (PDF) helpfully clarify that Epic is still allowed to “advocat[e] changes to the policies or practices of […] other companies, including Apple”. This does not mean future criticism of Apple’s business practices — or past criticism, for that matter — is unwarranted or invalid, but it now carries the blunted quality of someone who is not allowed to make the same complaints about Google.

⌥ Permalink

A Toolkit for Hacking iPhones, Possibly Created for the U.S. Government, Has Leaked

By: Nick Heer

Google’s Threat Intelligence Group:

Google Threat Intelligence Group (GTIG) has identified a new and powerful exploit kit targeting Apple iPhone models running iOS version 13.0 (released in September 2019) up to version 17.2.1 (released in December 2023). The exploit kit, named “Coruna” by its developers, contained five full iOS exploit chains and a total of 23 exploits. The core technical value of this exploit kit lies in its comprehensive collection of iOS exploits, with the most advanced ones using non-public exploitation techniques and mitigation bypasses.

The Coruna exploit kit provides another example of how sophisticated capabilities proliferate. Over the course of 2025, GTIG tracked its use in highly targeted operations initially conducted by a customer of a surveillance vendor, then observed its deployment in watering hole attacks targeting Ukrainian users by UNC6353, a suspected Russian espionage group. We then retrieved the complete exploit kit when it was later used in broad-scale campaigns by UNC6691, a financially motivated threat actor operating from China. […]

Andy Greenberg, Wired:

Conspicuously absent from Google’s report is any mention of who the original surveillance company “customer” that deployed Coruna may have been. But the mobile security company iVerify, which also analyzed a version of Coruna it obtained from one of the infected Chinese sites, suggests the code may well have started life as a hacking kit built for or purchased by the US government. Google and iVerify both note that Coruna contains multiple components previously used in a hacking operation known as “Triangulation” that was discovered targeting Russian cybersecurity firm Kaspersky in 2023, which the Russian government claimed was the work of the NSA. (The US government didn’t respond to Russia’s claim.)

I am so curious to know how this thing made it outside the carefully guarded digital walls of the U.S. government or a contractor. While a rare event, it is not the first time the classified weapons of espionage have become public.

⌥ Permalink

U.S. Immigration Police Bought Real-Time Ad Bidding Data for Automated Tracking System

By: Nick Heer

Joseph Cox, 404 Media:

Customs and Border Protection (CBP) bought data from the online advertising ecosystem to track peoples’ precise movements over time, in a process that often involves siphoning data from ordinary apps like video games, dating services, and fitness trackers, according to an internal Department of Homeland Security (DHS) document obtained by 404 Media.

[…]

Although CBP described the move as a pilot, the DHS Office of the Inspector General (OIG) later found both CBP and ICE did not limit themselves to non-operational use. The OIG found that CBP, ICE, and the Secret Service all illegally used the smartphone location data, and found a CBP official used the data to track coworkers with no investigative purpose. CBP and ICE went on to repeatedly purchase access to location data.

There are people out there who will insist, to this day, that behaviourally targeted advertising is not actually a mechanism for surveillance despite all the evidence showing it is, in fact, an essential component.

⌥ Permalink

Annotators in Kenya Describe How They Review Sensitive Data Captured by Meta’s Ray-Bans

By: Nick Heer

Naipanoi Lepapa, Ahmed Abdigadir, and Julia Lindblom, Svenska Dagbladet:

The workers in Kenya say that it feels uncomfortable to go to work. They tell us about deeply private video clips, which appear to come straight out of Western homes, from people who use the glasses in their everyday lives.

Several describe video material showing bathroom visits, sex and other intimate moments.

Another worker talks about people coming out of bathrooms.

It is appalling that massively rich corporations like Meta continue to offload critical tasks like these onto people who receive little support or pay. I recently finished “Ghost Work” by Mary L. Gray and Siddharth Suri and, while not my favourite book nor surfacing anything conceptually new, is worth your time. Meta can and should be doing far better, but can avoid association with labour atrocities better than, say, Nike in the 1990s in part because I doubt most people think too much about human intervention in artificial intelligence. Meta does not celebrate the hard work of its contract labour in Kenya; it does not even acknowledge them.

Speaking of not acknowledging the human labour involved, this story is the obvious nightmare you would expect. Some of these incidents of sensitive video recordings appear to be accidental, while others are seemingly deliberate. Without excusing the people who seem to be recording creepy videos on purpose, I assume few people would have believed it would be seen by someone at a company they probably have not heard about.

At first glance, it appears that we have significant control over our data. It states that voice recordings may only be saved and used for improvement or training of other Meta products if the user actively agrees.

But for the AI assistant to function, voice, text, image and sometimes video must be processed and may be shared onwards. This data processing is done automatically and cannot be turned off.

This is the kind of thing I would expect would be bundled into the additional diagnostic information Meta asks if you would like to opt into sharing. But Meta says this “does not include the photos and videos captured by your glasses”. That is, as this investigation found, part of the mandatory data collection.

This is offensive on behalf of users who might be less likely to consent if they had this full information. But it is also offensive to their romantic partners, friends, acquaintances, and passers-by, none of whom agreed to have their image or conversations adjudicated by these contractors.

⌥ Permalink

⌥ The Window Chrome of Our Discontent

By: Nick Heer

In a WWDC 2011 session, Dan Schimpf explained some of the goals of the refreshed design for Aqua in Mac OS X Lion were “meant to focus the user attention on the active window content”. This sentiment was echoed by John Siracusa in his review of Lion for Ars Technica:

Apple says that its goal with the Lion user interface was to highlight content by de-emphasizing the surrounding user interface elements.

When Apple redesigned Mac OS X again in 2014 with Yosemite, it promised…

[…] a fresh modern look where controls are clearer, smarter and easier to understand, and streamlined toolbars put the focus on your content without compromising functionality.

Then, when it revealed the Big Sur redesign in 2020, it explained:

The entire experience feels more focused, fresh, and familiar, reducing visual complexity and bringing users’ content front and centre.

And you will never guess what it promised in 2025 with the announcement of MacOS Tahoe and Liquid Glass, as introduced by Alan Dye:

Our goal is a beautiful new design that brings joy and delight to every user experience. One that’s more personal, and puts greater focus on your content — all while still feeling instantly familiar.

It is not just Apple, either. Here is Microsoft’s Jensen Harris at Build 2011 describing a key goal for the company’s then-new Metro design language:

Metro-style apps have room to breathe. They’re not about the chrome, they’re about the content. […] For years, Windows was always about adding stuff. We added bars, and panes, and doodads, and widgets, and gadgets, and bars — and stuff everywhere. And that’s how we defined our U.I., based on what new widgets we added. Now, we’ve receded into the background, and the app is sitting out there on the stage.

And later, as Microsoft rolled out app updates with its Fluent Design language, it described them in familiar terms:

With the updated OneDrive, your content takes center stage. The improved visual design reduces clutter and distractions, allowing you to focus on what’s important – your content.

This is a laudable goal if the opposite is, I assume, increasing the amount of clutter in user interfaces and making them more distracting. Nobody wants that. Then again, while the objective may be quite reasonable, there are surely different ways of achieving it — but Apple has embraced a single strategy: make the interface blend into the document. (I will be focusing on MacOS here as it is the platform I am most familiar with.)

Here is what a Pages document looks like running under Mac OS X Lion:

Click to expand (except on mobile).
A screenshot of Pages running under Mac OS X Lion

Here is that same document in a newer version of Pages running on MacOS Catalina, with the Yosemite-era design language that replaced the one that came before:

Click to expand (except on mobile).
A screenshot of Pages running under MacOS Catalina

Here it is in the last version of Pages on MacOS Tahoe, using the design language introduced with Big Sur:

Click to expand (except on mobile).
A screenshot of Pages running under MacOS Tahoe

And, finally, the newest version of Pages on MacOS Tahoe using the current Liquid Glass visual design language:

Click to expand (except on mobile).
A screenshot of Pages running under MacOS Tahoe

There are welcome improvements in newer versions of this comparison, like the introduction of the “Format” panel on the right-hand side, which makes better use of widescreen landscape-oriented displays, and allows for larger controls. While I admire the density of the Lion-era screenshot, the mini-sized controls in that formatting menu are harder to click.1

Overall, however, what Apple has done to Pages over this period of time is representative of a broader trend of minimizing the delineation of user interface elements from each other and the document itself. This is the only tool in the toolbox, and I am skeptical it achieves what Apple intends.

Compare again the two more recent screenshots against the ones that came before, and focus on the toolbar at the top of each. In the older two, there is a well-defined separation between the toolbar — the window itself — and the document. In the Big Sur visual language, however, the toolbar is the same bright white as the document. By Tahoe and the Liquid Glass language, there is barely a distinction; the buttons simply float over the document. And, bizarrely, that degrades further with the “Reduce Transparency” accessibility preference enabled:

Click to expand (except on mobile).
A screenshot of Pages running under MacOS Tahoe with the Reduce Transparency setting enabled

(Also, no, your eyes do not deceive you: the icons in the drop cap menu, barely visible in the lower-right, are indeed pixellated.)

For me, this means a constant distraction from my document because the whole window has a similar visual language. As the toolbar and its buttons become one with the document, they lose their ability to fade into the background. In the two older examples, the contrast of the well-defined toolbar allows me to treat them as an entirely separate thing I do not need to pay attention to.

This is further justified by the lower contrast within those two older toolbars. In Lion, the grey background and moderately saturated colours are a quiet reminder of tools that are available without them being intrusive. The mix of shapes is a sufficient differentiator, something Apple threw away in the following screenshot. By making all the buttons literal and with the same bright background, the toolbar becomes a little more distracting — but at least it does not blend into the document. Without the context of the previous screenshot, the colours of each icon seem almost random, and I find the yellow-on-white “Table” button difficult to distinguish at a glance from the black-on-yellow-on-white “Comment” button.

The Big Sur-era design language is, frankly, an atrocious regression. The heterogeneous shapes may have returned, but in the form of monochromatic medium-grey icons set against a uniform white background. The icons are not bad, per se — though putting “Add Page” and “Insert” next to each other in this default toolbar layout, both represented by a plus sign, is a little confusing. But I will bet you would not guess that some of these are buttons, while others are pop-up buttons with a submenu.

Finally, there is Liquid Glass which, in its default form, has more contrast than the previous example; with “Reduce Transparency” enabled, which is how I use MacOS, it has even less. The buttons themselves have a greater amount of internal contrast with bigger, darker grey icons on a white background. This is preferential within the context of the toolbar compared to the thin, small, and low-contrast buttons in the past example, but it also means this toolbar has similar contrast to the document itself.

I would not go so far as to argue that Pages ’09 has a perfect user interface and that everything since has been a regression. The average colours used for the icon fill in both older toolbars generally fails accessibility contrast checks which, remarkably, the Big Sur design will pass. The icons in Pages ’09 rely on dark outlines and unique shapes to have sufficient contrast with the toolbar background. However, Apple has since discarded most variables it could change to design these interfaces. Every button contains an icon of a single uniform colour, within barely defined holding containers of the same shape, and without text labels by default.

This monochromatic look means any splash of colour is distracting. The yellow accent used in Pages is garish — though, thankfully, something that can mostly be mitigated by changing the Theme Colour in System Settings, under Appearance. (Unfortunately, the yellow background remains on the “Update” button in the most recent version of Pages regardless of the system accent colour.) But perhaps you also noticed the purple icon in the Liquid Glass screenshot above. Here is the full toolbar:

Click to expand (except on mobile).
A screenshot of the full Pages toolbar featuring monochromatic dark grey icons except for a few purple ones

Those purple icons signify features that are part of Apple Creator Studio, a paid subscription to Pages and other applications that allows you to — in the order they are presented above — generate an image, artificially boost the resolution of an image, and access a stock image library. If you would like to insert one of your own images into your Pages document, that feature has been moved to the paperclip icon. Yes, it is a menu and not a button, despite lacking the disclosure triangle of the zoom menu right beside it, and it also reminds you about the “Content Hub” and “Generate Image” features. In Pages under Lion, colour was used in the icons to help guide the user as they complete a task — click the green thing to add a shape; click the darker yellow thing to add a table. Colour is not being used in the newer version to signify these are A.I. features, as the “Writing Tools” icon remains dark grey. In this version, the coloured icons are there to guide the user to premium add-ons regardless of whether they are currently paying for them.

I decided to focus on Pages for this comparison because it has lived so many different lives in MacOS. However, it is perhaps an imperfect representation for the rest of the system. Across Mac OS X Lion, for example, the toolbars of first-party applications like Finder and Preview almost exclusively use monochromatic icons. This has been true since Mac OS X Leopard, which also introduced barely differentiated folder icons. Some toolbars in Tiger, introduced two years prior, featured icons inside uniform capsule shapes. These were questionable ideas at the time, but they still retained defining characteristics. The capsules, for example, may have had a uniform shape, but contained within were full-colour icons. Most importantly, they were all clearly controls that were differentiated from the document.

Perhaps Apple has some user studies that suggest otherwise, but I cannot see how dialling back the lines between interface and document is supposed to be beneficial for the user. It does not, in my use, result in less distraction while I am working in these apps. In fact, it often does the opposite. I do not think the prescription is rolling back to a decade-old design language. However, I think Apple should consider exploring the wealth of variables it can change to differentiate tools within toolbars, and to more clearly delineate window chrome from document.


  1. These screenshots are a bit limited as, to capture a high-resolution interface, I switched my mid-2012 MacBook Air to a 720 × 450 display output, which shrank the available space for Pages in the Lion and Catalina screenshots. ↥︎

Software Quality Postscript and Clarification

By: Nick Heer

I have a document open in BBEdit right now named “2025-06-22 – MacOS SaaS.markdown”. I started drafting this thing last year about how Apple has transitioned its operating systems to something closer to a software-as-a-service model. I was trying to describe how the difference between major versions has become generally more modest since many features are rolled out across the year, and how — particularly on Apple’s non-Mac platforms — updates are more-or-less forced since the company stops digitally certifying older versions.

It is not a perfect comparison and not quite a fully-developed idea — note the difference between the filename and the last sentence above — but I thought it was going somewhere. Of course, you had no idea about this because I never published, which is why it must have seemed strange when I dropped a reference to software-as-a-service in the middle of my piece about software quality:

There was a time when remaining on an older major version of an operating system or some piece of software meant you traded the excitement of new features for the predictability of stability. That trade-off no longer exists; software-as-a-service means an older version is just old, not necessarily more reliable.

Riccardo Mori was understandably confused by this:

[…] I very much enjoy using older Mac OS versions, but not being able to browse the Web properly and securely, not being able to correctly sign in to check a Gmail account, not being able to fetch some RSS feeds because you can’t authenticate securely or establish a secure connection is very frustrating. Not having Dropbox work on my 2009 MacBook Pro running OS X 10.11 El Capitan is a minor annoyance and means I just won’t have access to certain personal files and that I’ll have to sync manually whatever I do on this other machine.

But if I put these two factors aside, there’s nothing about those older Macs, nothing about the older Mac OS versions they run that makes them less reliable. […]

What Mori explains as this paragraph continues is what I had meant to write at the time. What I should have written was this (emphasis mine):

There was a time when remaining on an older major version of an operating system or some piece of software meant you traded the excitement of new features for the predictability of stability. That trade-off no longer exists; an operating system on a software-as-a-service treadmill means an older version is just old, not necessarily more reliable.

The cycle of having a major new version ready to preview by June and shipping in September means the amount of time Apple spends focusing on the current version must necessarily shrink. How many teams at the company do you suppose are, right now, working on MacOS 26 when WWDC is a little over three months away? Engineering efforts are undoubtably beginning to prioritize MacOS 27. There are new features to prepare, after all.

So, yes, what Mori writes is what I was trying to express. I wish I had given that sentence a little more thought. Do read Mori’s piece — the second part, “On Software Frugality”, is thought-provoking.

⌥ Permalink

The Perfect Music App

By: Nick Heer

Jon Hicks, last year:

Music apps leave me wanting.

While I collect albums both physically (Vinyl + CD) and digitally (from Bandcamp), there are still missing pieces that streaming services provide: discovering new music, sharing playlists and seeing what friends are playing so that I can try their recommendations. They’re a valuable part of my listening habits, but none of them feel like ‘the one’. […]

I only stumbled across this today, but it remains a wonderful encapsulation of the state of music apps today. I share Hicks’ criteria, though I would add three things for myself:

  1. More expansive metadata. I would like genres that work more like tags. An artist may generally make records in one genre, but different albums have different influences. Even individual songs may considerably differ in sound and style. This is the kind of thing that would help me make playlists or find songs that sound better together.

    This would be a management challenge across the tens of thousands of songs in my library, but I feel like integration with RateYourMusic and other databases might help partially automate this.

  2. iPhone syncing over a wire. One of Hicks’ criteria is streaming and local library in the same app, and I completely agree. But I do not want anything — especially iPhone syncing — to be predicated on an assumption I have Apple’s first-party iCloud Music stuff turned on.

  3. No lock-in. I want to be able to point it at my existing library and for things to just work. I would like to be able to import my entire setup from Music — all my playlists, including smart playlists, plus all my stats and ratings — and I would like it to be stored in a format some other application could read if I ever need to move to a different client in the future.

There are many indie apps that get close to this. I checked out Radiccio recently, but it unfortunately does not work with the iMac on which my music library is stored. Maybe that is the fourth criteria: backwards compatibility as far as possible.

Nobody has ever said I am easy to please.

⌥ Permalink

The Political Effects of Twitter’s Feed Algorithm

By: Nick Heer

Germain Gauthier, et al., in a recent peer-reviewed paper in Nature:

Feed algorithms are widely suspected to influence political attitudes. However, previous evidence from switching off the algorithm on Meta platforms found no political effects. Here we present results from a 2023 field experiment on Elon Musk’s platform X shedding light on this puzzle. We assigned active US-based users randomly to either an algorithmic or a chronological feed for 7 weeks, measuring political attitudes and online behaviour. Switching from a chronological to an algorithmic feed increased engagement and shifted political opinion towards more conservative positions, particularly regarding policy priorities, perceptions of criminal investigations into Donald Trump and views on the war in Ukraine. In contrast, switching from the algorithmic to the chronological feed had no comparable effects. Neither switching the algorithm on nor switching it off significantly affected affective polarization or self-reported partisanship. […]

One can be pedantic about the use of “algorithmic” and “the algorithm” to describe a particular set of rules for recommending tweets, given that you could also say a reverse-chronological timeline is its own kind of algorithm. A simple one, to be sure, but an algorithm. I will not quibble with this.

Here is one thing I will be pedantic about, though: this study is not an examination of the “political effects of X’s feed algorithm”, as the title of the study suggests. It was conducted in 2023 — just a little bit after Elon Musk bought the platform and when it was still named Twitter. That is a long time ago in online platform terms, and the recommendations engine has probably changed a lot since — but almost certainly not in the direction of political even-handedness — even though the GitHub commit log suggests it has not been.

This study’s design seems better to me than a report published shortly after the 2024 U.S. presidential election, which I found limited and unconvincing.

There should always be a way for users to set a reverse-chronological timeline, and to opt out of recommendations features. We should be suspicious of any platform that refuses to trust us with control over our own experience.

⌥ Permalink

It Sure Looks to Me Like Meta Is Winding Down Its V.R. Efforts

By: Nick Heer

Samantha Ryan, “VP of Content” at Meta’s Reality Labs:

We’ve recently made some pretty big changes, including right-sizing our Reality Labs investment to ensure that our efforts remain sustainable over time. We’ve been in this space for over a decade, and we aren’t going anywhere. We’re in it for the long haul.

By “right-sizing”, Ryan means laying off ten percent of the Reality Labs workforce, and pouring money into the Ray-Ban partnership instead of metaverse initiatives. By “in it for the long haul”, Ryan means shifting the definition of the “metaverse” to meet Mark Zuckerberg’s latest obsession. They did not whiff by renaming the entire company around a crappy update to Second Life; you just are not getting it.

Ryan:

Our goal remains constant: to empower developers and creators as they build long-term, sustainable businesses. We used to have a pretty well-defined audience for VR, but as we’ve grown, we’ve attracted new audiences — who want different things — and the onus is on us to make sure that each of these distinct groups can find the apps and games that appeal to them.

That’s why we’re changing our roadmaps to increase your chances for success. We’re explicitly separating our Quest VR platform from our Worlds platform in order to create more space for both products to grow. We’re doubling down on the VR developer ecosystem while shifting the focus of Worlds to be almost exclusively mobile. By breaking things down into two distinct platforms, we’ll be better able to clearly focus on each.

Meta can say it is “doubling down on the V.R. developer ecosystem” all it wants, but it announced in January it would be shutting down its work-focused V.R. app with only a month’s notice, and it has cancelled third-party headsets. Now, it is saying Horizon Worlds is basically a phone app. Last February, Andrew Bosworth wrote in a memo about the importance of this very strategy:

[…] And Horizon Worlds on mobile absolutely has to break out for our long term plans to have a chance. […]

As I write this, Meta Horizon is the fifty-seventh most popular free game in the Canadian App Store, just two spots behind Hole.io, “the most addictive black hole game”. Maybe people do not, in general, want to wear a computer on their entire head — not for the thousands of dollars Apple is charging, and not for the hundreds Meta is.

⌥ Permalink

Personal (Computer) Assistants

By: Nick Heer

Omar Shahine:

For years, I’ve wanted a personal assistant. Someone who knows my preferences, manages my inbox, tracks my packages, and helps my family stay organized. The problem? Good assistants are expensive, require training, and still need constant direction.

So I built one. His name is Lobster. 🦞

The key insight that made this work wasn’t technical—it was conceptual. I stopped thinking “AI chatbot” and started thinking “new hire.”

I think this analogy is downright perfect.

When I first read this piece, my mind started to spin with all the things I could offload to my own digital personal assistant. Imagine how much time I could save by… wait. What could I use it for? Shahine says it helps summarize recent emails, figure out travel details, find event tickets, and more, all through iMessage conversations. This is a remarkable technical achievement. But what it drove home for me is how little I could ultimately relate to the scenarios presented by Shahine, even as I am trying to plan dinner with friends and a couple of trips later in the year.

Perhaps the same is true for you, too. Take a moment and think about what tasks you would give a personal assistant that can only work through software. Is it a long list? Is delegating checking your email saving you time? If you automate your vacation planning, does it make you happier than figuring that out alongside your partner or family? I am not saying Shahine is wrong or misguided. I just cannot see my life in this, and I do not think I am alone.

⌥ Permalink

The story of one of my worst programming failures

By: cks

Somewhat recently, GeePaw Hill shared the story of what he called his most humiliating experience as a skilled and successful computer programmer. It's an excellent, entertaining story with a lesson for all of us, so I urge you to read it. Today I'm going to tell the story of one of my great failures, where I may have quietly killed part of a professor's research project by developing on a too-small machine.

Once upon a time, back when I was an (advanced) undergraduate, I was hired as a part time research programmer for a Systems professor to work on one of their projects, at first with a new graduate student and then later alone (partly because the graduate student switched from Systems to HCI). One of this professor's research areas was understanding and analyzing disk IO patterns (a significant research area at the time), and my work was to add detailed IO tracing to the Ultrix kernel. Some of this was porting work the professor had done with the 4.x BSD kernel (while a graduate student and postdoc) into the closely related, BSD-derived Ultrix kernel, but we extended the original filesystem level tracing down all the way to capturing block IO traces (still specifically attributed to filesystem events).

We were working on Ultrix because my professor had a research and equipment grant from DEC. DEC was interested in this sort of information for improving the IO performance of the Ultrix kernel, and part of the benefit of working with DEC was that DEC could arrange for us to get IO traces from real customers with real workloads, instead of university research system workloads. Eventually the modified kernel worked, gathered all the data that we wanted (and gave us some insights even on our systems), and was ready for the customer site. We talked to DEC and it was decided that the best approach was that I would go down to Boston with the source code, meet with the DEC people involved, we'd build a kernel for the customer's setup, and then I'd go with the DEC people to the customer site to actually boot into it and turn the tracing on.

Very shortly after we booted the new kernel on the customer's machine and turned tracing on, the kernel paniced. It was a nice, clear panic message from my own code, basically an assertion failure, and what it said was more or less 'disk block number too large to fit into data field'. I looked at that and had a terrible sinking feeling.

This was long enough ago (with small enough disks) that having very compact trace data was extremely important, especially at the block IO layer (where we were generating a lot of trace records). As a result, I'd carefully designed the on-disk trace records to be as small as possible. As part of that I'd tried to cut down the size of fields to be only as big as necessary, and one of the fields I'd minimized was the disk block address of the IO. My minimized field was big enough for the block addresses on our Ultrix machines (donated by DEC), with not very big disks, but it was now obviously too small for the bigger disks that the company had bought from DEC for their servers. In a way I was lucky that I'd taken the precaution of putting in the size check that paniced, because otherwise we could have happily wasted time collecting corrupted traces with truncated block addresses.

(All of this was long enough ago that I can't remember how small the field was, although my mind wants to say 24 bits. If it was 24 bits, I had to be using 4 Kbyte filesystem block addresses, not 512-byte sector addresses.)

Once I saw the panic message, both the mistake and the fix were obvious, and the code and so on were well structured enough that it was simple to make the change; I could almost have done it on the spot (or at least while in Boston). But, well, you only get one kernel panic from your new "we assure you this is going to work" kernel on a customer's machine, especially if you only have one evening to gather your trace data and you can't rebuild a kernel from source while at the customer's site, so the DEC people and I had to pack up and go back empty handed. Afterward, I flew back to Toronto from Boston, made the simple change, and tested everything. But I never went back to Boston for another visit with DEC, and I don't think that part of my professor's research projects went anywhere much after that.

(My visit to Boston and its areas did feature getting driven around at somewhat unnervingly fast speeds on the Massachusetts Turnpike in the sports car of one of the DEC people involved.)

So that's the story of how I may have quietly killed one of my professor's research projects by developing on a too-small machine.

(That's obviously not the only problem. When I was picking the field size, I could have reached out somehow to ask how big DEC's disks got, or maybe ran the field size past my professor to see if it made sense. But I was working alone and being trusted with all of this, and I was an undergraduate, although I had significant professional programming experience by then.)

Sidebar: Fixing an earlier spectacular failure

(All of the following is based on my fallible memory.)

The tracing code worked by adding trace records to a buffer in memory and then writing out the buffer to the trace file when it was necessary. The BSD version of the code that I started with (which traced only filesystem level IO) did this synchronously, created trace records even for writing out the trace buffer, and didn't protect itself against being called again. A recursive call would deadlock but usually it all worked because you didn't add too many new trace records while writing out the buffer.

(Basically, everything that added a trace record to the buffer checked to see if the buffer was too full and if it was, immediately called the 'flush the trace buffer' code.)

This approach blew up spectacularly when I added block IO tracing; the much higher volume of records being added made deadlocks relatively common. The whole approach to writing out the trace buffer had to change completely, into a much more complex one with multiple processes involved and genuinely asynchronous writeout. I still have a vivid memory of making this relatively significant restructuring and then doing a RCS ci with a commit message that included a long, then current computing quote about replacing one set of code with known bugs with a new set of code with new unknown ones.

(At this remove I have no idea what the exact quote was and I can't find it in a quick online search. And unfortunately the code and its RCS history is long since gone.)

Power glitches can leave computer hardware in weird states

By: cks

Late Friday night, the university's downtown campus experienced some sort of power glitch or power event. A few machines rebooted, a number of machines dropped out of contact for a bit (which probably indicates some switches restarting), and most significantly, some of our switches wound up in a weird, non-working state despite being powered on. This morning we cured the situation by fully power cycling all of them.

This isn't the first time we've seen brief power glitches leave things in unusual states. In the past we've seen it with servers, with BMCs (IPMIs), and with switches. It's usually not every machine, either; some machines won't notice and some will. When we were having semi-regular power glitches, there were definitely some models of server that were more prone to problems than others, but even among those models it usually wasn't universal.

It's fun to speculate about reasons why some particular servers of a susceptible model would survive and others not, but that's somewhat beside today's point, which is that power glitches can get your hardware into weird states (and your hardware isn't broken when and because this happens; it can happen to hardware that's in perfectly good order). We'd like to think that the computers around us are binary, either shut off entirely or working properly, but that clearly isn't the case. A power glitch like this peels back the comforting illusion to show us the unhappy analog truth underneath. Modern computers do a lot of work to protect themselves from such analog problems, but obviously it doesn't always work completely.

(My wild speculation is that the power glitch has shifted at least part of the overall system into a state that's normally impossible, and either this can't be recovered from or the rest of the system doesn't realize that it has to take steps to recover, for example forcing a full restart. See also flea power, where a powered off system still retains some power, and sometimes this matters.)

PS: We've also had a few cases where power cycling the hardware wasn't enough, which is almost certainly flea power at work.

PPS: My steadily increasing awareness of the fundamentally analog nature of a lot of what I take as comfortably digital has come in part from exposure on the Fediverse to people who deal with fun things like differential signaling for copper Ethernet, USB, and PCIe, and the spooky world of DDR training, where very early on your system goes to some effort to work out the signal characteristics of your particular motherboard, RAM, and so on so that it can run the RAM as fast as possible (cf).

(Never mind all of the CPU errata about unusual situations that aren't quite handled properly.)

If there are URLs in your HTTP User-Agent, they should exist and work

By: cks

One of the things people put in their HTTP User-Agent header for non-browser software is a URL for their software, project, or whatever (I'm all for this). This is a a good thing, because it allows people operating web servers to check out who and what you are and decide for themselves if they're going to allow it. Increasingly (and partly for social reasons), I block many 'generic' User-Agent values that come to my attention, for example through their volume.

(I don't block all of them, but if your User-Agent shows up and I can't figure out what it is and whether or not it's legitimate and used by real people, that's probably a block.)

However, there's an important and obvious thing about any URLs in your HTTP User-Agent, which is that they should actually work. The domain or host should exist, the URL should exist in the web server, and the URL's contents should actually explain the software, project, or organization involved. Plus, if you use a HTTPS website, the TLS certificate should be valid.

(A related thing is a generic URL that doesn't give me anything to go on. For example, your URL on a code forge, and either it's not obvious which one of your repositories is doing things or you don't have any public repositories.)

For me, a non-working URL is much more suspicious than a missing URL. HTTP User-Agents without URLs are reasonably common (especially in feed readers), so I don't find them immediately suspicious. Non-working URLs in mysterious User-Agents certainly look like you're attempting to distract me with the appearance of a proper web agent but without the reality of it. If a User-Agent with such a non-working URL comes to my attention, I'm very likely to block it in some way (unless it's very clear that it's a legitimate program used by real people, and it merely has bad habits with its User-Agent).

You would think that people wouldn't make this sort of mistake, but I regret to say that I've seen it repeatedly, in all of the variations. One interesting version I've seen is User-Agent strings with the various 'example.<TLD>' domains in their URLs. I suspect that this comes from software that has some sort of 'operator URL' setting and provides a default value if you don't set one explicitly. I've also seen .lan and .local URLs in User-Agents, which takes somewhat more creativity.

As usual, my view is that software shouldn't provide this sort of default value; instead, it should refuse to work until you configure your own value. However, this makes it slightly more annoying to use, so it will be less popular than more accommodating software. Of course, we can change that calculation by blocking everything that mentions 'example.com', 'example.org', 'example.net' and so on in its User-Agent.

Restricting IP address access to specific ports in eBPF: a sketch

By: cks

The other day I covered how I think systemd's IPAddressAllow and IPAddressDeny restrictions work, which unfortunately only allows you to limit this to specific (local) ports only if you set up the sockets for those ports in a separate systemd.socket unit. Naturally this raises the question of whether there is a good, scalable way to restrict access to specific ports in eBPF that systemd (or other interested parties) could use. I think the answer is yes, so here is a sketch of how I think you'd this.

Why we care about a 'scalable' way to do this is because systemd generates and installs its eBPF programs on the fly. Since tcpdump can do this sort of cross-port matching, we could write an eBPF program that did it directly. But such a program could get complex if we were matching a bunch of things, and that complexity might make it hard to generate on the fly (or at least make it complex enough that systemd and other programs didn't want to). So we'd like a way that still allows you to generate a simple eBPF program.

Systemd uses cgroup socket SKB eBPF programs, which attach to a cgroup and filter all network packets on ingress or egress. As far as I can understand from staring at code, these are implemented by extracting the IPv4 or IPv4 address of the other side from the SKB and then querying what eBPF calls a LPM (Longest Prefix Match) map. The normal way to use an LPM map is to use the CIDR prefix length and the start of the CIDR network as the key (for individual IPv4 addresses, the prefix length is 32), and then match against them, so this is what systemd's cgroup program does. This is a nicely scalable way to handle the problem; the eBPF program itself is basically constant, and you have a couple of eBPF maps (for the allow and deny sides) that systemd populates with the relevant information from IPAddressAllow and IPAddressDeny.

However, there's nothing in eBPF that requires the keys to be just CIDR prefixes plus IP addresses. A LPM map key has to start with a 32-bit prefix, but the size of the rest of the key can vary. This means that we can make our keys be 16 bits longer and stick the port number in front of the IP address (and increase the CIDR prefix size appropriately). So to match packets to port 22 from 128.100.0.0/16, your key would be (u32) 32 for the prefix length then something like 0x00 0x16 0x80 0x64 0x00 0x00 (if I'm doing the math and understanding the structure right). When you query this LPM map, you put the appropriate port number in front of the IP address.

This does mean that each separate port with a separate set of IP address restrictions needs its own set of map entries. If you wanted a set of ports to all have a common set of restrictions, you could use a normally structured LPM map and a second plain hash map where the keys are port numbers. Then you check the port and the IP address separately, rather than trying to combine them in one lookup. And there are more complex schemes if you need them.

Which scheme you'd use depends on how you expect port based access restrictions to be used. Do you expect several different ports, each with its own set of IP access restrictions (or only one port)? Then my first scheme is only a minor change from systemd's current setup, and it's easy to extend it to general IP address controls as well (just use a port number of zero to mean 'this applies to all ports'). If you expect sets of ports to all use a common set of IP access controls, or several sets of ports with different restrictions for each set, then you might want a scheme with more maps.

(In theory you could write this eBPF program and set up these maps yourself, then use systemd resource control features to attach them to your .service unit. In practice, at that point you probably should write host firewall rules instead, it's likely to be simpler. But see this blog post and the related VCS repository, although that uses a more hard-coded approach.)

Your terminal program has to be where xterm's ziconbeep feature is handled

By: cks

I recently wrote about things that make me so attached to xterm. One of those things is xterm's ziconbeep feature, which causes xterm to visibly and perhaps audibly react when it's iconified or minimized and gets output. A commentator suggested that this feature should ideally be done in the window manager, where it could be more general. Unfortunately we can't do the equivalent of ziconbeep in the window manager, or at least we can't do all of it.

A window manager can sound an audible alert when a specific type of window changes its title in a certain way. This would give us the 'beep' part of ziconbeep in a general way, although we're treading toward a programmable window manager. But then, Gnome Shell now does a lot of stuff in JavaScript and its extensions are written in JS and the whole thing doesn't usually blow up. So we've got prior art for writing an extension that reacts to window title changes and does stuff.

What the window manager can't really do is reliably detect when the window has new output, in order to trigger any beeping and change the visible window title. As far as I know, neither X nor Wayland give you particularly good visibility into whether the program is rendering things, and in some ways of building GUIs, you're always drawing things. In theory, a program might opt to detect that it's been minimized and isn't visible and so not render any updates at all (although it will be tracking what to draw for when it's not minimized), but in practice I think this is unfashionable because it gets in the way of various sorts of live previews of minimized windows (where you want the window's drawing surface to reflect its current state).

Another limitation of this as a general window manager feature is that the window manager doesn't know what changes in the appearance of a window are semantically meaningful and which ones are happening because, for example, you just changed some font preference and the program is picking up on that. Only the program itself knows what's semantically meaningful enough to signal for people's attention. A terminal program can have a simple definition but other programs don't necessarily; your mail client might decide that only certain sorts of new email should trigger a discreet 'pay attention to me' marker.

(Even in a terminal program you might want more control over this than xterm gives you. For example, you might want the terminal program to not trigger 'zicon' stuff for text output but instead to do it when the running program finishes and you return to the shell prompt. This is best done by being able to signal the terminal program through escape sequences.)

How I think systemd IP address restrictions on socket units works

By: cks

Among the systemd resource controls are IPAddressAllow= and IPAddressDeny=, which allow you to limit what IP addresses your systemd thing can interact with. This is implemented with eBPF. A limitation of these as applied to systemd .service units is that they restrict all traffic, both inbound connections and things your service initiates (like, say, DNS lookups), while you may want only a simple inbound connection filter. However, you can also set these on systemd.socket units. If you do, your IP address restrictions apply only to the socket (or sockets), not to the service unit that it starts. To quote the documentation:

Note that for socket-activated services, the IP access list configured on the socket unit applies to all sockets associated with it directly, but not to any sockets created by the ultimately activated services for it.

So if you have a systemd socket activated service, you can control who can access the socket without restricting who the service itself can talk to.

In general, systemd IP access controls are done through eBPF programs set up on cgroups. If you set up IP access controls on a socket, such as ssh.socket in Ubuntu 24.04, you do get such eBPF programs attached to the ssh.socket cgroup (and there is a ssh.socket cgroup, perhaps because of the eBPF programs):

# pwd
/sys/fs/cgroup/system.slice
# bpftool cgroup list ssh.socket
ID  AttachType      AttachFlags  Name
12  cgroup_inet_ingress   multi  sd_fw_ingress
11  cgroup_inet_egress    multi  sd_fw_egress

However, if you look there are no processes or threads in the ssh.socket cgroup, which is not really surprising but also means there is nothing there for these eBPF programs to apply to. And if you dump the eBPF program itself (with 'ebpftool dump xlated id 12'), it doesn't really look like it checks for the port number.

What I think must be going on is that the eBPF filtering program is connected to the SSH socket itself. Since I can't find any relevant looking uses in the systemd code of the `SO_ATTACH_*' BPF related options from socket(7) (which would be used with setsockopt(2) to directly attach programs to a socket), I assume that what happens is that if you create or perhaps start using a socket within a cgroup, that socket gets tied to the cgroup and its eBPF programs, and this attachment stays when the socket is passed to another program in a different cgroup.

(I don't know if there's any way to see what eBPF programs are attached to a socket or a file descriptor for a socket.)

If this is what's going on, it unfortunately means that there's no way to extend this feature of socket units to get per-port IP access control in .service units. Systemd isn't writing special eBPF filter programs for socket units that only apply to those exact ports, which you could in theory reuse for a service unit; instead, it's arranging to connect (only) specific sockets to its general, broad IP access control eBPF programs. Programs that make their own listening sockets won't be doing anything to get eBPF programs attached to them (and only them), so we're out of luck.

(One could experiment with relocating programs between cgroups, with the initial cgroup in which the program creates its listening sockets restricted and the other not, but I will leave that up to interested parties.)

Sometimes, non-general solutions are the right answer

By: cks

I have a Python program that calculates and prints various pieces of Linux memory information on a per-cgroup basis. In the beginning, its life was simple; cgroups had a total memory use that was split between 'user' and '(filesystem) cache', so the program only needed to display either one field or a primary field plus a secondary field. Then I discovered that there was additional important (ie, large) kernel memory use in cgroups and added the ability to report it as an additional option for the secondary field. However, this wasn't really ideal, because now I had a three-way split and I might want to see all three things at once.

A while back I wrote up my realization about flexible string formatting with named arguments. This sparked all sorts of thoughts about writing a general solution for my program that could show any number of fields. Recently I took a stab at implementing this and rapidly ran into problems figuring out how I wanted to do it. I had multiple things that could be calculated and presented, I had to print not just the values but also a header with the right field names, I'd need to think about how I structured argparse argument groups in light of argparse not supporting nested groups, and so on. At a minimum this wasn't going to be a quick change; I was looking at significantly rewriting how the program printed its output.

The other day, I had an obvious realization: while it would be nice to have a fully general solution that could print any number of additional fields, which would meet my needs now and in the future, all that I needed right now was an additional three-field version with the extra fields hard-coded and the whole thing selected through a new command line argument. And this command line argument could drop right into the existing argparse exclusive group for choosing the second field, even though this feels inelegant.

(The fields I want to show are added with '-c' and '-k' respectively in the two field display, so the morally correct way to select both at once would be '-ck', but currently they're exclusive options, which is enforced by argparse. So I added a third option, literally '-b' for 'both'.)

Actually implementing this hard-coded version was a bit annoying for structural reasons, but I put the whole thing together in not very long; certainly it was much faster than a careful redesign and rewrite (in an output pattern I haven't used before, no less). It's not necessarily the right answer for the long term, but it's definitely the right answer for now (and I'm glad I talked myself into doing it).

(I'm definitely tempted to go back and restructure the whole output reporting to be general. But now there's no rush to it; it's not blocking a feature I want, it's a cleanup.)

A taxonomy of text output (from tools that want to be too clever)

By: cks

One of my long standing gripes with Debian and Ubuntu is, well, I'll quote myself on the Fediverse:

I understand that Debian wants me to use 'apt' instead of apt-get, but the big reason I don't want to is because you can't turn off that progress bar at the bottom of your screen (or at least if you can it's not documented). That curses progress bar is something that I absolutely don't want (and it would make some of our tooling explode, yes we have tooling around apt-get).

Over time, I've developed opinions on what I want to see tools do for progress reports and other text output, and what I feel is increasingly too clever in tools that makes them more and more inconvenient for me. Today I'm going to try to run down that taxonomy, from best to worst.

  1. Line by line output in plain text with no colours.
  2. Represent progress by printing successive dots (or other characters) on the line until finally you print a newline. This is easy to capture and process later, since the end result is a newline terminated line with no control characters.

  3. Reporting progress by printing dots (or other characters) and then backspacing over them to erase them later. Pagers like less have some ability to handle backspaces, but this will give you heartburn in your own programs.

  4. Reporting progress by repeatedly printing a line, backspacing over it, and reprinting it (as apt-get does). This produces a lot more output, but I think less and anything that already deals with backspacing over things will generally be able to handle this. I believe apt-get does this.

  5. Any sort of line output with colours (which don't work in my environment, and when they do work they're usually unreadable). Any sort of terminal codes in the output make it complicated to capture the output with tools like script and then look over them later with pagers like less, although less can process a limited amount of terminal codes, including colours.

  6. Progress bar animation on one line with cursor controls and other special characters. This looks appealing but generates a lot more output and is increasingly hard for programs like less to display, search, or analyze and process. However, your terminal program of choice is probably still going to see this as line by line output and preserve various aspects of scrollback and so on.

  7. Progress output that moves the cursor and the output from its normal line to elsewhere on screen, such as at the bottom (as 'apt autoremove' and other bits of 'apt' do). Now you have a full screen program; viewing, reconstructing, and searching its output later is extremely difficult, and its output will blow up increasingly spectacularly if it's wrong about your window size (including if you resize things while it's running) or what terminal sequences your window responds to. Terminal programs and terminal environments such as tmux or screen may well throw up their hands at doing anything smart with the output, since you look much like a full screen editor, a pager, or programs like top. In some environments this may damage or destroy terminal scrollback.

    An additional reason I dislike this style is that it causes output to not appear at the current line. When I run your command line program, I want your program to print its output right below where I started it, in order, because that's what everything else does. I don't want the output jumping around the screen to random other locations. The only programs I accept that from are genuine full screen programs like top. Programs that insist on displaying things at random places on the screen are not really command line programs, they are TUIs cosplaying being CLIs.

  8. Actual full screen output, as a text UI, with the program clearing the screen and printing status reports all over the place. Fortunately I don't think I've seen any 'command line' programs do this; anything that does tends to be clearly labeled as a TUI program, and people mostly don't provide TUIs for command line tools (partly because it's usually more work).

My strong system administrator's opinion is that if you're tempted to do any of these other than the first, you should provide a command line switch to turn these off. Also, you should detect unusual settings of the $TERM environment variable, like 'dumb' or perhaps 'vt100', and automatically disable your smart output. And you should definitely disable your smart output if $TERM isn't set or you're not outputting to a (pseudo-)terminal.

(Programs that insist on fancy output no matter what make me very unhappy.)

Log messages are mostly for the people operating your software

By: cks

I recently read Evan Hahn's The two kinds of error (via), which talks very briefly in passing about logging, and it sparked a thought. I've previously written my system administrator's view of what an error log level should mean, but that entry leaves out something fundamental about log messages, which is that under most circumstances, log messages are for the people operating your software (I've sort of said this before in a different context). When you're about to add a non-debug log message, one of the questions you should ask is what does someone running your program get out of seeing the message.

Speaking from my own experience, it's very easy to write log messages (and other messages) that are aimed at you, the person developing the program, script, or what have you. They're useful for debugging and for keeping track of the state of the program, and it's natural to write them that way since you're immersed in the program and have all of the context (this is especially a problem for infrequent error messages, which I've learned to make as verbose as possible, and a similar thing applies for infrequently logged messages). But if your software is successful (especially if it gets distributed to other people), most of the people running it won't be the developers, they'll only be operating it.

(This can include a future version of you when you haven't touched this piece of software for months.)

If you want your log messages to be useful for anything other than being mailed to you as part of a 'can you diagnose this' message, they need to be useful for the people operating the software. This doesn't mean 'only report errors that they can fix and need to', although that's part of it. It also means making the information you provide through logs be things that are useful and meaningful to people operating your software, and that they can understand without a magic decoder ring.

If people operating your software won't get anything out of seeing a log message, you probably shouldn't log it by default in the first place (or you need to reword it so that people will get something from it). In Evan Hahn's terminology, this apply to the log messages for both expected errors and unexpected errors, although if the program aborts, it should definitely tell system administrators why it did.

For a system administrator, log messages about expected errors let us diagnose what went wrong to cause something to fail, and how interested we are in them depends partly on how common they are. However, how common they are isn't the only thing. MTAs often have what would be considered relatively verbose logs of message processing and will log every expected error like 'couldn't do a DNS lookup' or 'couldn't connect to a remote machine', even though they can happen a lot. This is very useful because one thing we sometimes care a lot about is what happened to and with a specific email message.

The things that make me so attached to xterm as my terminal program

By: cks

I've said before in various contexts (eg) that I'm very attached to the venerable xterm as my terminal (emulator) program, and I'm not looking forward to the day that I may have to migrate away from it due to Wayland (although I probably can keep running it under XWayland, now that I think about it). But I've never tried to write down a list of the things that make me so attached to it over other alternatives like urxvt, much less more standard ones like gnome-terminal. Today I'm going to try to do that, although my list is probably going to be incomplete.

  • Xterm's ziconbeep feature, which I use heavily. Urxvt can have an equivalent but I don't know if other terminal programs do.

  • I routinely use xterm's very convenient way of making large selections, which is supported in urxvt but not in gnome-terminal (and it can't be since gnome-terminal uses mouse button 3 for its own purposes).

  • The ability to turn off all terminal colours, because they often don't work in my preferred terminal colours. Other terminal programs have somewhat different and sometimes less annoying colours, but it's still far to easy for programs to display things in unreadable colours.

    Yes, I can set my shell environment and many programs to not use colours, but I can't set all of them; some modern programs simply always use colours on terminals. Xterm can be set to completely ignore them.

  • I'm very used to xterm's specific behavior when it comes to what is a 'word' for double-click selection. You can read the full details in the xterm manual page's section on character classes. I'm not sure if it's possible to fully emulate this behavior in other terminal programs; I once made an incomplete attempt in urxvt, while gnome-terminal is quite different and has little or no options for customizing that behavior (in the Gnome way). Generally the modern double click selection behavior is too broad for me.

    (For instance, I'm extremely attached to double-click selecting only individual directories in full paths, rather than the entire thing. I can always swipe to select an entire path, but if I can't pick out individual path elements with a double click my only choice is character by character selection, which is a giant pain.)

    Based on a quick experiment, I think I can make KDE's konsole behave more or less the way I want by clearing out its entire set of "Word characters" in profiles. I think this isn't quite how xterm behaves but it's probably close enough for my reflexes.

  • Xterm doesn't treat text specially because of its contents, for example by underlining URLs or worse, hijacking clicks on them to do things. I already have well evolved systems for dealing with things like URLs and I don't want my terminal emulator to provide any 'help'. I believe that KDE's konsole can turn this off, but gnome-terminal doesn't seem to have any option for it.

  • Many of xterm's behaviors can be controlled from command line switches. Some other terminal emulators (like gnome-terminal) force you to bundle these behaviors together as 'profiles' and only let you select a profile. Similarly, a lot of xterm's behavior can be temporarily changed on the fly through its context menus, without having to change the profile's settings (and then change them back).

  • Every xterm window is a completely separate program that starts from scratch, and xterm is happy to run on remote servers without complications; this isn't something I can say for all other competitors. Starting from scratch also means things like not deciding to place yourself where your last window was, which is konsole's behavior (and infuriates me).

Of these, the hardest two to duplicate are probably xterm's double click selection behavior of what is a word and xterm's large selection behavior. The latter is hard because it requires the terminal program to not use mouse button 3 for a popup menu.

I use some other xterm features, like key binding, including duplicating windows, but I could live without them, especially if the alternate terminal program directly supports modern cut and paste in addition to xterm's traditional style. And I'm accustomed to a few of xterm's special control characters, especially Ctrl-space, but I think this may be pretty universally supported by now (Ctrl-space is in gnome-terminal).

There are probably things that other terminal programs like konsole, gnome-terminal and so on do that I don't want them to (and that xterm doesn't). But since I don't use anything other than xterm (and a bit of gnome-terminal and once in a while a bit of urxvt), I don't know what those undesired features are. Experimenting with konsole for this entry taught me some things I definitely don't want, such as it automatically placing itself where it was before (including placing a new konsole window on top of one of the existing ones, if you have multiple ones).

(This elaborates on a comment I made elsewhere.)

Sometimes the simplest version of a text table is printed from a command

By: cks

Back when we had just started with our current metrics and dashboards adventure, I wrote about how sometimes the simplest version of a graph is a text table. Today I will extend that further: sometimes the simplest version of a text table is to have a command that prints it out, rather than making people look at a web page.

We recently had a major power outage at work, and in the aftermath not all of our machines came back. One of my co-workers is an extreme early bird and he came in to the university about as early as it's possible to on the TTC, and started work on troubleshooting what was going on. One of the things he needed to know was what machines were still down, so he could figure out any common elements to them (and see what machines were stubbornly not coming back on even though they ought to be).

We have Grafana dashboards for this, and the information about what machines are down is present in some of them in tabular form. But it's a table embedded in a widget in a web page, and you need a browser to look at it, which you may not have from the server console of some server you just powered up. Since I like command line tools, at one point I wrote some little scripts that make queries to our Prometheus server with curl and run the result through 'jq' to extract things. One of them is called 'promdownhosts' and it prints out what you'd expect. Initially this was just something I used, but several years ago I mentioned my collection of these scripts to my co-workers and we wound up making them group scripts in a central location.

(I initially wrote this script and a few others for use during our planned power outages and other downtimes, because it was a convenient way of seeing what we hadn't yet turned on or might have missed.)

Early in the morning of that Tuesday, bringing machines back up after the power outage and finding dead PDUs, my co-worker used the 'promdownhosts' script extensively to troubleshoot things. One of the nice aspects of it being a script was that he could put the names of uninteresting machines in a file and then exclude them easily with things like 'promdownhosts | fgrep -v -f /tmp/ignore-these' (something that's much harder to do in a web page dashboard interface, especially if the designer hasn't thought of that). And in general, the script made (and makes) this information quite readily accessible in a compact format that was quick to skim and definitely free of distractions.

Not everything can be presented this way, in a list or a table printed out in plain text from a command line tool. Sometimes tables on a web page are the better option, and it's good to have options in general; sometimes we want to look at this information along with other information too. As I've found out the hard way sometimes, there's only so much information you can cram into a plain text table before the result is increasingly hard to read.

(I have a command that summarizes our current Prometheus alerts and its output is significantly harder to read because I need it to be compact and there's more information to present. It's probably only really suitable for my use because I understand all of its shorthand notations, including the internal Prometheus names for our alerts.)

On the Bourne shell's distinction between shell variables and exported ones

By: cks

One of the famous things that people run into with the Bourne shell is that it draws a distinction between plain shell variables and special exported shell variables, which are put into the environment of processes started by the shell. This distinction is a source of frustration when you set a variable, run a program, and the program doesn't have the variable available to it:

$ GODEBUG=...
$ go-program
[doesn't see your $GODEBUG setting]

It's also a source of mysterious failures, because more or less all of the environment variables that are present automatically become exported shell variables. So whether or not 'GODEBUG=..; echo running program; go-program' works can depend on whether $GODEBUG was already set when your shell started. The environment variables of regular shell sessions are usually fairly predictable, but the environment variables present when shell scripts get run can be much more varied. This makes it easy to write a shell script that only works right for you, because in your environment it runs with certain environment variables set and so they automatically become exported shell variables.

I've told you all of that because despite these pains, I believe that the Bourne shell made the right choice here, in addition to a pragmatically necessary choice at the time it was created, in V7 (Research) Unix. So let's start with the pragmatics.

The Bourne shell was created along side environment variables themselves, and on the comparatively small machines that V7 ran on, you didn't have much room for the combination of program arguments and the new environment. If either grew too big, you got 'argument list too long' when you tried to run programs. This made it important to minimize and control the size of the environment that the shell gave to new processes. If you want to do that without limiting the use of shell variables so much, a split between plain shell variables and exported ones makes sense and requires only a minor bit of syntax (in the form of 'export').

Both machines and exec() size limits are much larger now, so you might think that getting rid of the distinction is a good thing. The Bell Labs Research Unix people thought so, so they did do this in Tom Duff's rc shell for V10 Unix and Plan 9. Having used both the Bourne shell and a version of rc for many years, I both agree and disagree with them.

For interactive use, having no distinction between shell variables and exported shell variables is generally great. If I set $GODEBUG, $PYTHONPATH, or any number of any other environment variables that I want to affect programs I run, I don't have to remember to do a special 'export' dance; it just works. This is a sufficiently nice (and obvious) thing that it's an option for the POSIX 'sh', in the form of 'set -a' (and this set option is present in more or less all modern Bourne shells, including Bash).

('Set -a' wasn't in the V7 sh, but I haven't looked to see where it came from. I suspect that it may have come from ksh, since POSIX took a lot of the specification for their 'sh' from ksh.)

For shell scripting, however, not having a distinction is messy and sometimes painful. If I write an rc script, every shell variable that I use to keep track of something will leak into the environment of programs that I run. The shell variables for intermediate results, the shell variables for command line options, the shell variables used for for loops, you name it, it all winds up in the environment unless I go well out of my way to painfully scrub them all out. For shell scripts, it's quite useful to have the Bourne shell's strong distinction between ordinary shell variables, which are local to your script, and exported shell variables, which you deliberately act to make available to programs.

(This comes up for shell scripts and not for interactive use because you commonly use a lot more shell variables in shell scripts than you do in interactive sessions.)

For a new Unix shell today that's made primarily or almost entirely for interactive use, automatically exporting shell variables into the environment is probably the right choice. If you wanted to be slightly more selective, you could make it so that shell variables with upper case names are automatically exported and everything else can be manually exported. But for a shell that's aimed at scripting, you want to be able to control and limit variable scope, only exporting things that you explicitly want to.

How to redirect a Bash process substitution into a while loop

By: cks

In some sorts of shell scripts, you often find yourself wanting to work through a bunch of input in the shell; some examples of this for me are here and here. One of the tools for this is a 'while read -r ...' loop, using the shell's builtin read to pull in one or more fields of data (hopefully not making a mistake). Suppose, not hypothetically, that you have a situation where you want to use such a 'while read' loop to accumulate some information from the input, setting shell variables, and then using them later. The innocent and non-working way to write this is:

accum=""
sep=""
some-program |
while read -r avalue; do
   accum="$accum$sep$avalue"
   sep=" or "
done

# Now we want to use $accum

(The recent script where I ran into this issue does much more complex things in the while loop that can't easily be done in other ways.)

This doesn't work because the 'while' is actually happening in a subshell, so the shell variables it sets are lost at the end. To make this work we have to wrap everything from the 'while ...' onward up into a subshell, with that part looking like:

some-program |
(
while read -r avalue; do
   accum="$accum$sep$avalue"
   sep=" or "
done
[...]
)

(You can't get around this with '{ while ...; ... done; }', Bash will still put the 'while' in a subshell.)

The way around this starts with how you can use a file redirection with a while loop (it goes on the 'done'):

some-program >/some/file
while read -r avalue; do
  [...]
done </some/file
# $accum is still set

So far this is all generic Bourne shell things. Bash has a special feature of process substitution, which allows us to use a process instead of a file, using the otherwise illegal syntax '<(...)'. This is great and exactly what we want to avoid creating a temporary file and then have to clean it up. So the innocent and obvious way to try to write things is this:

while read -r avalue; do
  [...]
done <(some-program)

If you try this, you will get the sad error message from Bash of:

line N: syntax error near unexpected token `<(some-program)'
line N: 'done <(some-program)'

This is not a helpful error message. I will start by telling you the cure, and then what is going on at a narrow technical level to produce this error message. The cure is:

while read -r avalue; do
  [...]
done < <(some-program)

Note that you must have a space between the two <'s, writing this as '<<(some-program)' will get you a similar syntax error.

The technical reason for this error is that although it looks like redirection, process substitution is a form of substitution, like '$var' (it's in the name, but you, like me, may not know what Bash calls it off the top of your head). The result of process substitution will be, for example, a /dev/fd/N name (and a subprocess that is running our 'some-program' and feeding into the other end of the file descriptor). We can see this directly:

$ echo <(cat /dev/null)
/dev/fd/63

(Your number may vary.)

You can't write 'while ...; done /dev/fd/63'. That's a syntax error. Even though the pre-substitution version looks like redirection, it's not, so it's not accepted.

That '<(...)' is actually a substitution is why our revised version works. Reading '< <(some-program)' right to left, the '<(some-program)' is process substitution, and it (along with other shell expansions) are done first, before redirections. After substitution this looks like '< /dev/fd/NN', which is acceptable syntax. If we leave out the space and write this as '<<(some-program)', the shell throws up its hands at the '<<' bit.

(So from Bash's perspective, this is very similar to 'file=/some/file; while ... ; done < $file', which is perfectly legal.)

PS: Before I wrote this entry, I didn't know how to get around the 'done <(some-program)' syntax error. Until the penny dropped about the difference between redirections and process substitution, I thought that Bash simply forbade this to make its life easier.

With disk caches, you want to be able to attribute hits and misses

By: cks

Suppose that you have a disk or filesystem cache in memory (which you do, since pretty much everything has one these days). Most disk caches will give you simple hit and miss information as part of their basic information, but if you're interested in the performance of your disk cache (or in improving it), you want more information. The problem with disk caches is that there are a lot of different sources and types of disk IO, and you can have hit rates that are drastically different between them. Your hit rate for reading data from files may be modest, while your hit rate on certain sorts of metadata may be extremely high. Knowing this is important because it means that your current good performance on things involving that metadata is critically dependent on that hit rate.

(Well, it may be, depending on what storage media you're using and what its access speeds are like. A lot of my exposure to this dates from the days of slow HDDs.)

This potential vast difference is why you want more detailed information in both cache metrics and IO traces. The more narrowly you can attribute IO and the more you know about it, the more useful things you can potentially tell about the performance of your system and what matters to it. This is not merely 'data' versus 'metadata', and synchronous versus asynchronous; ideally you want to know the sort of metadata read being done, and whether the file data being read is synchronous or not, and whether this is a prefetching read or a 'demand' read that really needs the data.

A lot of the times, operating systems are not set up to pass this information down through all of the layers of IO from the high level filesystem code that knows what it's asking for to the disk driver code that's actually issuing the IOs. Part of the reason for this is that it's a lot of work to pass all of this data along, which means extra CPU and memory on what is an increasingly hot path (especially with modern NVMe based storage). These days you may get some of this fine grained details in metrics and perhaps IO traces (eg, for (Open)ZFS), but probably not all the way to types of metadata.

Of course, disk and filesystem caches (and IO) aren't the only place that this can come up. Any time you have a cache that stores different types of things that are potentially queried quite differently, you can have significant divergence in the types of activity and the activity rates (and cache hit rates) that you're experiencing. Depending on the cache, you may be able to get detailed information from it or you may need to put more detailed instrumentation into the code that queries your somewhat generic cache.

Modern general observability features in operating systems can sometimes let you gather some of this detailed attribute yourself (if the OS doesn't already provide them). However, it's not a certain thing and there are limits; for example, you may have trouble tracing and tracking IO once it gets dispatched asynchronously inside the OS (and most OSes turn IO into asynchronous operations before too long).

Systemd resource controls on user.slice and system.slice work fine

By: cks

We have a number of systems where we traditionally set strict overcommit handling, and for some time this has caused us some heartburn. Some years ago I speculated that we might want to use resource controls on user.slice or systemd.slice if they worked, and then recently in a comment here I speculated that this was the way to (relatively) safely limit memory use if it worked.

Well, it does (as far as I can tell, without deep testing). If you want to limit how much of the system's memory people who log in can use so that system services don't explode, you can set MemoryMin= on system.slice to guarantee some amount of memory to it and all things under it. Alternately, you can set MemoryMax= on user.slice, collectively limiting all user sessions to that amount of memory. In either case my view is that you might want to set MemorySwapMax= on user.slice so that user sessions don't spend all of their time swapping. Which one you set things on depends on which is easier and you trust more; my inclination is MemoryMax, although that means you need to dynamically size it depending on this machine's total memory.

(If you want to limit user memory use you'll need to make sure that things like user cron jobs are forced into user sessions, rather than running under cron.service in system.slice.)

Of course this is what you should expect, given systemd's documentation and the kernel documentation. On the other hand, the Linux kernel cgroup and memory system is sufficiently opaque and ever changing that I feel the need to verify that things actually do work (in our environment) as I expect them to. Sometimes there are surprises, or settings that nominally work but don't really affect things the way I expect.

This does raise the question of how much memory you want to reserve for the system. It would be nice if you could use systemd-cgtop to see how much memory your system.slice is currently using, but unfortunately the number it will show is potentially misleadingly high. This is because the memory attributed to any cgroup includes (much) more than program RAM usage. For example, on our it seems typical for system.slice to be using under a gigabyte of 'user' RAM but also several gigabytes of filesystem cache and other kernel memory. You probably want to allow for some of that in what memory you reserve for system.slice, but maybe not all of the current usage.

(You can get the current version of the 'memdu' program I use as memdu.py.)

Gnome, GSettings, gconf, and which one you want

By: cks

On the Fediverse a while back, I said:

Ah yes, GNOME, it is of course my mistake that I used gconf-editor instead of dconf-editor. But at least now Gnome-Terminal no longer intercepts F11, so I can possibly use g-t to enter F11 into serial consoles to get the attention of a BIOS. If everything works in UEFI land.

Gnome has had at least two settings systems, GSettings/dconf (also) and the older GConf. If you're using a modern Gnome program, especially a standard Gnome program like gnome-terminal, it will use GSettings and you will want to use dconf-editor to modify its settings outside of whatever Preferences dialogs it gives you (or doesn't give you). You can also use the gsettings or dconf programs from the command line.

(This can include Gnome-derived desktop environments like Cinnamon, which has updated to using GSettings.)

If the program you're using hasn't been updated to the latest things that Gnome is doing, for example Thunderbird (at least as of 2024), then it will still be using GConf. You need to edit its settings using gconf-editor or gconftool-2, or possibly you'll need to look at the GConf version of general Gnome settings. I don't know if there's anything in Gnome that synchronizes general Gnome GSettings settings into GConf settings for programs that haven't yet been updated.

(This is relevant for programs, like Thunderbird, that use general Gnome settings for things like 'how to open a particular sort of thing'. Although I think modern Gnome may not have very many settings for this because it always goes to the GTK GIO system, based on the Arch Wiki's page on Default Applications.)

Because I've made this mistake between gconf-editor and dconf-editor more than once, I've now created a personal gconf-editor cover script that prints an explanation of the situation when I run it without a special --really argument. Hopefully this will keep me sorted out the next time I run gconf-editor instead of dconf-editor.

PS: Probably I want to use gsettings instead of dconf-editor and dconf as much as possible, since gsettings works through the GSettings layer and so apparently has more safety checks than dconf-editor and dconf do.

PPS: Don't ask me what the equivalents are for KDE. KDE settings are currently opaque to me.

PDUs can fail (eventually) and some things related to this

By: cks

Early last Tuesday there was a widespread power outage at work, which took out power to our machine rooms for about four hours. Most things came back up when the power was restored, but not everything. One of the things that had happened was that one of our rack PDUs had failed. Fixing this took a surprising amount of work.

We don't normally think about our PDUs very much. They sit there, acting as larger and often smarter versions of power bars, and just, well, work. But both power bars and PDUs can fail eventually, and in our environment rack PDUs tend to last long enough to reach that point. We may replace servers in the racks in our machine rooms, but we don't pull out and replace entire racks all that often. The result is that a rack's initial PDU is likely to stay in the rack until it fails.

(This isn't universal; there are plenty of places that install and remove entire racks at a time. If you're turning over an entire rack, you might replace the PDU at the same time you're replacing all of the rest of it. Whole rack replacement is certainly going to keep your wiring neater.)

A rack PDU failing not a great thing for the obvious reason; it's going to take out much or all of the servers in the rack unless you have dual power supplies on your servers, each connected to a separate PDU. For racks that have been there for a while and gone through a bunch of changes, often it will turn out to be hard to remove and replace the PDU. Maintaining access to remove PDUs is often not a priority either in placing racks in your machine room or in wiring things up, so it's easy for things to get awkward and encrusted. This was one of the things that happened with our failed PDU on last Tuesday; it took quite some work to extract and replace it.

(Some people might have pre-deployed spare PDUs in each rack, but we don't. And if those spare PDUs are already connected to power and turned on, they too can fail over time.)

We're fortunate that we already had spare (smart) PDUs on hand, and we had also pre-configured a couple of them for emergency replacements. If we'd had to order a replacement PDU, things would obviously have been more of a problem. There are probably some research groups around here with their own racks who don't have a spare PDU, because it's an extra chunk of money for an unlikely or uncommon contingency, and they might choose to accept a rack being down for a while.

The importance of limiting syndication feed requests in some way

By: cks

People sometimes wonder why I care so much about HTTP conditional GETs and rate limiting for syndication feed fetchers. There are multiple reasons, including social reasons to establish norms, but one obvious one is transfer volumes. To illustrate that, I'll look at the statistics for yesterday for feed fetches of the main syndication feed for Wandering Thoughts.

Yesterday there were 7492 feed requests that got HTTP 200 responses, 9419 feed requests that got HTTP 304 Not Modified responses, and 11941 requests that received HTTP 429 responses. The HTTP 200 responses amounted to about 1.26 GBytes, with the average response size being 176 KBytes. This average response size is actually a composite; typical compressed syndication feed responses are on the order of 160 KBytes, while uncompressed ones are on the order of 540 KBytes (but there look to have been only 313 of them, which is fortunate; even still they're 12% of the transfer volume).

If feed readers didn't do any conditional GETs and I didn't have any rate limiting (and all of the requests that got HTTP 429s would still have been made), the additional feed requests would have amounted to about another 3.5 GBytes of responses sent out to people. Obviously feed readers did do conditional GETS, and 66% of their non rate limited requests were successful conditional GETs. A HTTP 200 response ratio of 44% is probably too pessimistic once we include rate limited requests, so as an extreme approximation we'll guess that 33% of the rate limited requests would have received HTTP 200 responses with a changed feed; that would amount to another 677 MBytes of response traffic (which is less than I expected). If we use the 44% HTTP 200 ratio, it's still only 903 MBytes more.

(This 44% rate may sound high but my syndication feed changes any time someone leaves a comment on a recent entry, because the syndication feed of entries includes a comment count for every entry.)

Another statistic is that 41% of syndication feed requests yesterday got HTTP 429 responses. The most prolific single IP address received 950 HTTP 429s, which maps to an average request interval of less than two minutes between requests. Another prolific source made 779 requests, which again amounts to an interval of just less than two minutes. There are over 20 single IPs that received more than 96 HTTP 429 responses (which corresponds to an average interval of 15 minutes). There is a lot of syndication feed fetching software out there that is fetching quite frequently.

(Trying to figure out how many HTTP 429 sources did conditional requests is too complex with my current logs, since I don't directly record that information.)

You can avoid the server performance impact of lots of feed fetching by arranging to serve syndication feeds from static files instead of a dynamic system (and then you can limit how frequently you update those files, effectively forcing a maximum number of HTTP 200 fetches per time interval on anything that does conditional GETs). You can't avoid the bandwidth effects, and serving from static files generally leaves you with only modest tools for rate limiting.

PS: The syndication feeds for Wandering Thoughts are so big because I've opted to default to 100 entries in them, but I maintain you should be able to do this sort of thing without having your bandwidth explode.

Consider mentioning your little personal scripts to your co-workers

By: cks

I have a habit of writing little scripts at work for my own use (perhaps like some number of my readers). They pile up like snowdrifts in my $HOME/adm, except they don't melt away when their time is done but stick around even when they're years obsolete. Every so often I mention one of them to my co-workers; sometimes my co-workers aren't interested, but sometimes they find the script appealing and have me put it into our shared location for 'production' scripts and programs. Sometimes, these production-ized scripts have turned out to be very useful.

(Not infrequently, having my co-workers ask me to move something into 'production' causes me to revise it to make it less of a weird hack. Occasionally this causes drastic changes that significantly improve the script.)

When I say that I mentioned my scripts to my co-workers, that makes it sound more intentional than it often is. A common pattern is that I'll use one of my scripts to get some results that I share, and then my co-workers will ask how I did it and I'll show them the command line, and then they'll ask things like "what is this ~cks/adm/<program> thing' and 'can you put that somewhere more accessible, it sounds handy'. I do sometimes mention scripts unprompted, if I think they're especially useful, but I've written a lot of scripts over time and many of them aren't of much use for anyone beside me (or at least, I think they're too weird to be shared).

If you have your own collection of scripts, maybe your co-workers would find some of them useful. It probably can't hurt to mention some of them every so often. You do have to mention specific scripts; in my experience 'here is a directory of scripts with a README covering what's there' doesn't really motivate people to go look. Mentioning a specific script with what it can do for people is the way to go, especially if you've just used the script to deal with some situation.

(One possible downside of doing this is the amount of work you may need to do in order to turn your quick hack into something that can be operated and maintained by other people over the longer term. In some cases, you may need to completely rewrite things, preserving the ideas but not the implementation.)

PS: Speaking from personal experience, don't try to write a README for your $HOME/adm unless you're the sort of diligent person who will keep it up to date as you add, change, and ideally remove scripts. My $HOME/adm's README is more than a decade out of date.

Parsing hours and minutes into a useful time in basic Python

By: cks

Suppose, not hypothetically, that you have a program that optionally takes a time in the past to, for example, report on things as of that time instead of as of right now. You would like to allow people to specify this time as just 'HH:MM', with the meaning being that time today (letting people do 'program --at 08:30'). This is convenient for people using your program but irritatingly hard today with the Python standard library.

(In the following code examples, I need a Unix timestamp and we're working in local time, so I wind up calling time.mktime(). We're working in local time because that's what is useful for us.)

As I discovered or noticed a long time ago, the time module is a thin shim over the C library time functions and inherits their behavior. One of these behaviors is that if you ask time.strptime() to parse a time format of '%H:%M', you get back a struct_time object that is in 1900:

>>> import time
>>> time.strptime("08:10", "%H:%M")
time.struct_time(tm_year=1900, tm_mon=1, tm_mday=1, tm_hour=8, tm_min=10, tm_sec=0, tm_wday=0, tm_yday=1, tm_isdst=-1)

There are two solutions I can think of, the straightforward brute force approach that uses only the time module and a more theoretically correct version using datetime, which comes in two variations depending on whether you have Python 3.14 or not.

The brute force solution is to re-parse a version of the time string with the date added. Suppose that you have a series of time formats that people can give you, including '%H:%M', and you try them all until one works, with code like this:

 for fmt in tfmts:
     try:
         r = time.strptime(tstr, fmt)
         # Fix up %H:%M and %H%M
         if r.tm_year == 1900:
             dt = time.strftime("%Y-%m-%d ", time.localtime(time.time()))
             # replace original r with the revised one.
             r = time.strptime(dt + tstr, "%Y-%m-%d "+fmt)
         return time.mktime(r)
     except ValueError:
         continue

I think the correct, elegant way using only the standard library is to use datetime to combine today's date and the parsed time into a correct datetime object, which can then be turned into a struct_time and passed to time.mktime. Before Python 3.14, I believe this is:

         r = time.strptime(tstr, fmt)
         if r.tm_year == 1900:
             tm = datetime.time(hour=r.tm_hour, minute=r.tm_min)
             today = datetime.date.today()
             dt = datetime.datetime.combine(today, tm)
             r = dt.timetuple()
         return time.mktime(r)

There are variant approaches to the basic transformation I'm doing here but I think this is the most correct one.

If you have Python 3.14 or later, you have datetime.time.strptime() and I think you can do the slightly clearer:

[...]
             tm = datetime.time.strptime(tstr, fmt)
             today = datetime.date.today()
             dt = datetime.datetime.combine(today, tm)
             r = dt.timetuple()
[...]

If you can work with datetime.datetime objects, you can skip converting back to a time.struct_time object. In my case, the eventual result I need is a Unix timestamp so I have no choice.

You can wrap this up into a general function:

def strptime_today(tstr, fmt):
   r = time.strptime(tstr, fmt)
   if r.tm_year != 1900:
      return r
   tm = datetime.time(hour=r.tm_hour, minute=r.tm_min, second=r.tm_sec)
   today = datetime.date.today()
   dt = datetime.datetime.combine(today, tm)
   return dt.timetuple()

This version of time.strptime() will return the time today if given a time format with only hours, minutes, and possibly seconds. Well, technically it will do this if given any format without the year, but dealing with all of the possible missing fields is left as an exercise for the energetic, partly because there's no (relatively) reliable signal for missing months and days the way there is for years. For many programs, a year of 1900 is not even close to being valid and is some sort of mistake at best, but January 1st is a perfectly ordinary day of the year to care about.

(Now that I've written this function I may update my code to use it, instead of the brute force time package only version.)

How GNU Tar handles deleted things in incremental tar archives

By: cks

Suppose, not hypothetically, that you have a system that uses GNU Tar for its full and incremental backups (such as Amanda). Or maybe you use GNU Tar directly for this. If you have an incremental backup tar archive, you might be interested in one or both of two questions, which are in some ways mirrors of each other: what files were deleted between the previous incremental and this incremental, or what's the state of the directory tree as of this incremental (if it and all previous backups it depends on were properly restored).

(These questions are of deep interest to people who may have deleted some amount of files but they're not sure exactly what files have been deleted.)

Handling deleted files is one of the challenges of incremental backups, with various approaches. How GNU Tar handles deleted files is sort of documented in Using tar to perform incremental dumps and Dumpdir, but the documentation doesn't explain it specifically. The simple version is that GNU Tar doesn't explicitly record deletions; instead, every incremental tar archive carries a full listing of the directory tree, covering both things that are in this incremental archive and things that come from previous ones. To deduce deleted files, you have to compare two listings of the directory tree.

(As part of this full listing, an incremental tar archive records every directory, even unchanged ones.)

You can get at these full listings with 'tar --list --incremental --verbose --verbose --file ...', but tar prints them in an inconvenient format. You don't get a directory tree, the way you do with plain 'tar -t'; instead you get the Dumpdir contents of each directory printed out separately, and it's up to you to post-process the results to assemble a directory tree with full paths and so on. People have probably written tools to do this, either from tar's output or by directly reading the GNU Tar incremental tar archive format.

In my view, GNU Tar's approach is sensible and it comes with some useful properties (although there are tradeoffs). Conveniently, you can reconstruct the full directory tree as of that point in time from any single incremental archive; you don't have to go through a series of them to build up the picture. This probably also makes things somewhat more resilient if you're missing some incremental archives in the middle, since at least you know what's supposed to be there but you don't have any copy of. Finding where a single file was deleted is better than it would be if there were explicit deletion records, since you can do a binary search across incrementals to find the first one where it doesn't appear. The lack of explicit deletion reports does make it inconvenient to determine everything that was deleted between two successive incrementals, but on the other hand you can determine what was deleted (or added) between any two tar archives without having to go through every incremental between them.

(You could say that GNU Tar incremental archives have a snapshot of the directory tree state instead of carrying a journal of changes to the state.)

Two challenges of incremental backups

By: cks

Roughly speaking, there are two sorts of backups that you can make, full backups and incremental backups. At the abstract level, full backups are pretty simple; you save everything that you find. Incremental backups are more complicated because they save only the things that changed since whatever they're relative to. People want incremental backups despite the extra complexity because they save a lot of space compared to backing up everything all the time.

There are two general challenges that make incremental backups more complicated than full backups. The first challenge is reliably finding everything that's changed, in the face of all of the stuff that can change in filesystems (or other sources of data). Full backups only need to be able to traverse all of the filesystem (or part of it), or in general the data source, and this is almost always a reliable thing because all sorts of things and people use it. Finding everything that has changed has historically been more challenging because it's not something that people do often outside of incremental backups.

(And when people do it they may not notice if they're missing some things, the way they absolutely will notice if a general traversal skips some files.)

The second challenge is handling things that have gone away. Once you have a way to find everything that's changed it's not too difficult to build a backup system that will faithfully reproduce everything that definitely was there as of the incremental. All you need to do is save every changed file and then unpack the sequence of full and incremental backups on top of each other, with the latest version of any particular file overwriting any previous one. But people often want their incremental restore to reflect the state of directories and so on as of the incremental, which means removing things that have been deleted (both files and perhaps entire directory trees). This means that your incrementals need some way to pass on information about things that were there in earlier backups but aren't there now, so that the restore process can either not restore them or remove them as it restores the sequence of full and incremental backups.

While there are a variety of ways to tackle the first challenge, backup systems that want to run quickly are often constrained by what features operating systems offer (and also what features your backup system thinks it can trust, which isn't always the same thing). You can checksum everything all the time and keep a checksum database, but that's usually not going to be the fastest thing. The second challenge is much less constrained by what the operating system provides, which means that in practice it's much more on you (the backup system) to come up with a good solution. Your choice of solution may interact with how you solve the first challenge, and there are tradeoffs in various approaches you can pick (for example, do you represent deletions explicitly in the backup format or are they implicit in various ways).

There is no single right answer to these challenges. I'll go as far as to say that the answer depends partly on what sort of data and changes you expect to see in the backups and partly where you want to put the costs between creating backups and handling restores.

Understanding the limitation of 'do in new frame/window' in GNU Emacs

By: cks

GNU Emacs has a core model for how it operates, and some of its weird seeming limitations are easier to understand if you internalize that model. One of them is what you have to do in GNU Emacs to get the perfectly sensible operation of 'do <X> in a new frame or window'. For instance, one of the things I periodically want to do in MH-E is 'open a folder in a new frame', so that I can go through it while keeping my main MH-E environment on my inbox to process incoming email.

If you dig through existing GNU Emacs ELisp functions, you won't find a 'make-frame-do-operation' function, which is a bit frustrating. GNU Emacs has a whole collection of operations for making a new frame, and I can run mh-visit-folder in the context of this frame, so it seems like there should be a simple function I could invoke to do this and create my own 'C-x 5 v' binding for 'visit MH-E folder in other frame'.

The clue to what's going on is in the description of C-x 5 5 from the Creating Frames page of the manual, with the emphasis mine:

A more general prefix command that affects the buffer displayed by a subsequent command invoked after this prefix command (other-frame-prefix). It requests the buffer to be displayed by a subsequent command to be shown in another frame.

GNU Emacs frames (and windows) don't run commands and show their output, they display (GNU Emacs) buffers. In order to create a frame, you must have some buffer to display on that frame, and GNU Emacs must know what it is. GNU Emacs has some relatively complex and magical code to implement the 'C-x 5 5' and 'C-x 4 4' prefix commands, but it's all still fundamentally starting from having some buffer to display, not from running a command. The code basically assumes you're running a command that will at some point try to display a buffer, and it hooks into that 'please display this buffer' operation to make the new frame or window and then display the buffer in it.

(Buffers can be created to show files, but they can also be created for a lot of other purposes, including non-file buffers created by ELisp commands that want to present text to you. All of MH-E's buffers are non-file ones, as are things like Magit's information displays.)

The corollary of this is that the most straightforward way to write our own ELisp code to run a command in a new frame is to start out by switching to some buffer in another frame, such as '*scratch*', and then run our command. In an extremely minimal form, this looks like:

(defun mh-visit-folder-other-frame (folder &optional argp)
  "...."
  (interactive [...])
  (switch-to-buffer-other-frame "*scratch*")
  (mh-visit-folder folder argp))

If you know that your command displays a specific buffer, ideally you'll check to see if that buffer exists already and switch to it instead of to some scratch buffer that you're only using because you need to tell Emacs to display some buffer (any buffer) in the new frame.

(In normal GNU Emacs environments you can be pretty confident that there's a *scratch* buffer sitting around. GNU Emacs normally creates it on startup and most people don't delete it. And if you're writing your own code, you can definitely not delete it yourself.)

Now that I've written this entry, maybe I'll remember 'C-x 5 5' and also stop feeling vaguely irritated every time I do the equivalent by hand ('C-x 5 b', pick *scratch*, and then run my command in the newly created frame).

PS: It's probably possible to write a general ELisp function to run another function and make any buffers it wants to show come up on another frame, using the machinery that 'C-x 5 5' does. I will leave writing this function as an exercise for my readers (although maybe it already exists somewhere).

Sometimes giving syndication feed readers good errors is a mistake

By: cks

Yesterday I wrote about the problem of giving feed readers error messages that people will actually see, because you can't just give them HTML text; in practice you have to wrap your HTML text up in a stub, single-entry syndication feed (and then serve it with a HTTP 200 success code). In many situations you're going to want to do this by replying to the initial feed request with a HTTP 302 temporary redirection that winds up on your stub syndication feed (instead of, say, a general HTML page explaining things, such as "this resource is out of service but you might want to look at ...").

Yesterday I put this into effect for certain sorts of problems, including claimed HTTP User-Agents that are for old browser. Then several people reported that this had caused Feedly to start presenting my feed as the special 'your feed reader is (claiming to be) a too-old browser' single entry feed. The apparent direct cause of this is that Feedly made some syndication feed requests with HTTP User-Agent headers of old versions of Chrome and Firefox, which wound up getting a series of HTTP 302 temporary redirections to my new 'your feed reader is a too-old browser' stub feed. Feedly then decided to switch its main feed fetcher over to directly using this new URL for various feeds, despite the HTTP redirections being temporary (and not served for its main feed fetcher, which uses "Feedly/1.0" for its User-Agent).

Feedly has been making these fake browser User-Agent syndication feed fetch attempts for some time, and for some time they've been getting HTTP 302 redirections. However, up until late yesterday, what Feedly wound up on was a regular HTML web page. I have to assume that since this wasn't a valid syndication feed, Feedly ignored it. Only when I did the right thing to give syndication feed readers a good, useful error result did Feedly receive a valid syndication feed and go over the cliff.

Providing a stub syndication feed to communicate errors and problems to syndication feed fetchers is clearly the technically correct answer. However, I'm now somewhat less convinced that it's the most useful answer in practice. In practice, plenty of syndication feed fetchers keep fetching and re-fetching these stub feeds from me, suggesting that people either aren't seeing them or aren't doing anything about it. And now I've seen a feed reader malfunction spectacularly and in a harmful way because I gave it a valid syndication feed result at the end of a temporary HTTP redirection.

(I will probably stick to the current situation, partly because I no longer feel like accepting bad behavior from web agents.)

PS: If you're a feed fetching system, please give your feeds IDs that you put in the User-Agent, so that when they all wind up shifted to the same URL through some misfortune, the website involved can sort them out and redirect them back to the proper URLs.

The problem of delivering errors to syndication feed readers

By: cks

Suppose, not hypothetically, that there are some feed readers (or at least things fetching your syndication feeds) that are misbehaving or blocked for one reason or another. You could just serve these feed readers HTTP 403 errors and stop there, but you'd like to be more friendly. For regular web browsers, you can either serve a custom HTTP error page that explains the situation or answer with a HTTP 302 temporary redirection to a regular HTML page with the explanation. Often the HTTP 302 redirection will be easier because you can use various regular means to create the HTML pages (and even host them elsewhere if you want). Unfortunately, this probably leaves syndication feed readers out in the cold.

(This can also come up if, for example, you decommission a syndication feed but want to let people know more about the situation than a simple HTTP 404 would give them.)

As far as I know, most syndication feed readers expect that the reply to their HTTP feed fetching request is in some syndication feed format (Atom, RSS, etc), which they will parse, process, and display to the person involved. If they get a reply in a different format, such as text/html, this is an error and it won't be shown to the person. Possible the HTML <title> element will make it through, or the HTTP status code response for an error, or maybe both. But your carefully written HTML error page is unlikely to be seen.

(Since syndication feed readers need to be able to display HTML in general, they could do something to show people at least the basic HTML text they got back. But I don't think this is very common.)

As a practical thing, if you want people using blocked syndication feed readers to have a chance to see your explanation, you need to reply with a syndication feed with an entry that is your (HTML) message to them (either directly or through HTTP 302 redirections). Creating this stub feed and properly serving it to appropriate visitors may be anywhere from annoying to challenging. Also, you can't reply with HTTP error statuses (and the feed) even though that's arguably the right thing to do. If you want syndication feed readers to process your stub feed, you need to provide it as part of a HTTP 200 reply.

(Speaking from personal experience I can say that hand-writing stub Atom syndication feeds is a pain, and it will drive you to put very little HTML in the result. Which is okay, you can make it mostly a link to your regular HTML page about whatever issue it is.)

If you're writing a syndication feed reader, I urge you to optionally display the HTML of any HTTP error response or regular HTML page that you receive. If I was writing some sort of blog system today, I would make it possible to automatically generate a syndication feed version of any special error page the software could serve to people (probably through some magic HTTP redirection). That way people can write each explanation only once and have it work in both contexts.

The (very) old "repaint mode" GUI approach

By: cks

Today I ran across another article that talked in passing about "retained mode" versus "immediate mode" GUI toolkits (this one, via), and gave some code samples. As usual when I read about immediate mode GUIs and see source code, I had a pause of confusion because the code didn't feel right. That's because I keep confusing "immediate mode" as used here with a much older approach, which I will call repaint mode for lack of a better description.

A modern immediate mode system generally uses double buffering; one buffer is displayed while the entire window is re-drawn into the second buffer, and then the two buffers are flipped. I believe that modern retained mode systems also tend to use double buffering to avoid screen tearing and other issues (and I don't know if they can do partial updates or have to re-render the entire new buffer). In the old days, the idea of having two buffers for your program's window was a decided luxury. You might not even have one buffer and instead be drawing directly onto screen memory. I'll call this repaint mode, because you directly repainted some or all of your window any time you needed to change anything in it.

You could do an immediate mode GUI without double buffering, in this repaint mode, but it would typically be slow and look bad. So instead people devoted a significant amount of effort to not repainting everything but instead identifying what they were changing and repainting only it, along with any pixels from other elements of your window that had been 'damaged' from prior activity. If you did do a broader repaint, you (or the OS) typically set clipping regions so that you wouldn't actually touch pixels that didn't need to be changed.

(The OS's display system typically needed to support clipping regions in any situation where windows partially overlapped yours, because it couldn't let you write into their pixels.)

One reason that old display systems worked this way is that it required as little memory as possible, which was an important consideration back in the day (which was more or less the 1980s to the early to mid 1990s). People could optimize their repaint code to be efficient and do as little work as possible, but they couldn't materialize RAM that wasn't there. Today, RAM is relatively plentiful and we care a lot more about non-tearing, coherent updates.

The typical code style for a repaint mode system was that many UI elements would normally only issue drawing commands to update or repaint themselves when they were altered. If you had a slider or a text field and its value was updated as a result of input, the code would typically immediately call its repaint function, which could lead to a relatively tight coupling of input handling to the rendering code (a coupling that I believe Model-view-controller was designed to break). Your system had to be capable of a full window repaint, but if you wanted to look good, it wasn't a common operation. A corollary of this is that your code might spend a significant amount of effort working out what was the minimal amount of repainting you needed to do in order to correctly get between two states (and this code could be quite complicated).

(Some of the time this was hidden from you in widget and toolkit internals, although they didn't necessarily give you minimal repaints as you changed widget organization. Also, because a drawing operation was issued right away didn't mean that it took effect right away. In X, server side drawing operations might be batched up to be sent to the X server only when your program was about to wait for more X events.)

Because I'm used to this repaint mode style, modern immediate mode code often looks weird to me. There's no event handler connections, no repaint triggers, and so on, but there is an explicit display step. Alternately, you aren't merely configuring widgets and then camping out in the toolkit's main loop, letting it handle events and repaints for you (the widgets approach is the classical style for X applications, including PyTk applications such as pyhosts).

These days, I suspect that any modern toolkit that still looks like a repaint mode system is probably doing double buffering behind the scenes (unless you deliberately turn that off). Drawing directly to what's visible right now on screen is decidedly out of fashion because of issues like screen tearing, and it's not how modern display systems like Wayland want to operate. I don't know if toolkits implement this with a full repaint on the new buffer, or if they try to copy the old buffer to the new one and selectively repaint parts of it, but I suspect that the former works better with modern graphics hardware.

PS: My view is that even the widget toolkit version of repaint mode isn't a variation of retained mode because the philosophy was different. The widget toolkit might batch up operations and defer redoing layout and repainting things until you either returned to its event loop or asked it to update the display, but you expected a more or less direct coupling between your widget operations and repaints. But you can see it as a continuum that leads to retained mode when you decouple and abstract things enough.

(Now that I've written this down, perhaps I'll stop having that weird 'it's wrong somehow' reaction when I see immediate mode GUI code.)

Testing Linux memory limits is a bit of a pain

By: cks

For reasons outside of the scope of this entry, I want to test how various systemd memory resource limits work and interact with each other (which means that I'm really digging into cgroup v2 memory controls). When I started trying to do this, it turned out that I had no good test program (or programs), although I had some ones that gave me partial answers.

There are two complexities in memory usage testing programs in a cgroups environment. First, you may be able to allocate more memory than you can actually use, depending on your system's settings for strict overcommit. So it's not enough to see how much memory you can allocate using the mechanism of your choice (I tend to use mmap() rather than go through language allocators). After you've either determined how much memory you can allocate or allocated your target amount, you have to at least force the kernel to materialize your memory by writing something to every page of it. Since the kernel can probably swap out some amount of your memory, you may need to keep repeatedly reading all of it.

The second issue is that if you're not in strict overcommit (and sometimes even if you are), the kernel can let you allocate more memory than you can actually use and then you try to use it, hit you with the OOM killer. For my testing, I care about the actual usable amount of memory, not how much memory I can allocate, so I need to deal with this somehow (and this is where my current test programs are inadequate). Since the OOM killer can't be caught by a process (that's sort of the point), the simple approach is probably to have my test program progressively report on how much memory its touched so far, so I can see how far it got before it was OOM-killed. A more complex approach would be to do the testing in a child process with progress reports back to the parent so it could try to narrow in on how much it could use rather than me guessing that I wanted progress reports every, say, 16 MBytes or 32 MBytes of memory touching.

(Hopefully the OOM killer would only kill the child and not the parent, but with the OOM killer you can never be sure.)

I'm probably not the first person to have this sort of need, so I suspect that other people have written test programs and maybe even put them up somewhere. I don't expect to be able to find them in today's ambient Internet search noise, plus this is very close to the much more popular issue of testing your RAM memory.

(Will I put up my little test program when I hack it up? Probably not, it's too much work to do it properly, with actual documentation and so on. And these days I'm not very enthused about putting more repositories on Github, so I'd need to find some alternate place.)

Undo in Vi and its successors, and my views on the mess

By: cks

The original Bill Joy vi famously only had a single level of undo (which is part of what makes it a product of its time). The 'u' command either undid your latest change or it redid the change, undo'ing your undo. When POSIX and the Single Unix Specification wrote vi into the standard, they required this behavior; the vi specification requires 'u' to work the same as it does in ex, where it is specified as:

Reverse the changes made by the last command that modified the contents of the edit buffer, including undo.

This is one particular piece of POSIX compliance that I think everyone should ignore.

Vim and its derivatives ignore the POSIX requirement and implement multi-level undo and redo in the usual and relatively obvious way. The vim 'u' command only undoes changes but it can undo lots of them, and to redo changes you use Ctrl-r ('r' and 'R' were already taken). Because 'u' (and Ctrl-r) are regular commands they can be used with counts, so you can undo the last 10 changes (or redo the last 10 undos). Vim can be set to vi compatible behavior if you want. I believe that vim's multi-level undo and redo is the default even when it's invoked as 'vi' in an unconfigured environment, but I can't fully test that.

Nvi has opted to remain POSIX compliant and operate in the traditional vi way, while still supporting multi-level undo. To get multi-level undo in nvi, you extend the first 'u' with '.' commands, so 'u..' undoes the most recent three changes. The 'u' command can be extended with '.' in either of its modes (undo'ing or redo'ing), so 'u..u..' is a no-op. The '.' operation doesn't appear to take a count in nvi, so there is no way to do multiple undos (or redos) in one action; you have to step through them by hand. I'm not sure how nvi reacts if you want do things like move your cursor position during an undo or redo sequence (my limited testing suggests that it can perturb the sequence, so that '.' now doesn't continue undoing or redoing the way vim will continue if you use 'u' or Ctrl-r again).

The vi emulation package evil for GNU Emacs inherits GNU Emacs' multi-level undo and nominally binds undo and redo to 'u' and Ctrl-r respectively. However, I don't understand its actual stock undo behavior. It appears to do multi-level undo if you enter a sequence of 'u' commands and accepts a count for that, but it feels not vi or vim compatible if you intersperse 'u' commands with things like cursor movement, and I don't understand redo at all (evil has some customization settings for undo behavior, especially evil-undo-system). I haven't investigated Evil extensively and this undo and redo stuff makes me less likely to try using it in the future.

The BusyBox implementation of vi is minimal but it can be built with support for 'u' and multi-level undo, which is done by repeatedly invoking 'u'. It doesn't appear to have any redo support, which makes a certain amount of sense in an environment when your biggest concern may be reverting things so they're no worse than they started out. The Ubuntu and Fedora versions of busybox appear to be built this way, but your distance may vary on other Linuxes.

My personal view is that the vim undo and redo behavior is the best and most human friendly option. Undo and redo are predictable and you can predictably intersperse undo and redo operations with other operations that don't modify the buffer, such as moving the cursor, searching, and yanking portions of text. The nvi behavior essentially creates a special additional undo mode, where you have to remember that you're in a sequence of undo or redo operations and you can't necessarily do other vi operations in the middle (such as cursor movement, searches, or yanks). This matters a lot to me because I routinely use multi-level undo when I'm writing text to rewind my buffer to a previous state and yank out some wording that I've decided I like better than its replacement.

(For additional vi versions, on the Fediverse, I was also pointed to nextvi, which appears to use vim's approach to undo and redo; I believe neatvi also does this but I can't spot any obvious documentation on it. There are vi-inspired editors such as vile and vis, but they're not things people would normally use as a direct replacement for vi. I believe that vile follows the nvi approach of 'u.' while vis follows the vim model of 'uu' and Ctrl-r.)

Moving to make many of my SSH logins not report things on login

By: cks

I've been logging in to Unix machines for what is now quite a long time. When I started, it was traditional for your login process to be noisy. The login process itself would tell you last login details and the 'message of the day' ('motd'), and people often made their shell .profile or .login report more things, so you could see things like:

Last login: Tue Feb 10 22:16:14 2026 from 128.100.X.Y
 22:22:42 up 1 day, 11:22,  3 users,  load average: 0.40, 2.95, 3.30
cks cks cks
[output from fortune elided]
: <host> ;

(There is no motd shown here but it otherwise hits the typical high points, including a quote from fortune. People didn't always use 'fortune' itself but printing a randomly selected quote on login used to be common.)

Many years ago I modified my shell environment on our servers so that it wouldn't report the currently logged in users, show the motd, or tell me my last login. But I kept the 'uptime' line:

$ ssh cs.toronto.edu
 22:26:05 up 209 days,  5:26, 167 users,  load average: 0.47, 0.51, 0.60
: apps0.cs ;

Except, I typically didn't see that. I see this only on full login sessions, and when I was in the office I typically used special tools (also, also, also) that didn't actually start a login session and so didn't show me this greeting banner. Only when I was at home did I do SSH logins (with tooling) and so see this, and I didn't do that very much (because I didn't normally work from home, so I had no reason to be routinely opening windows on our servers).

As a long term result of that 2020 thing I work from home a lot more these days and so I open up a lot more SSH logins than I used to. Recently I was thinking about how to make this feel nicer, and it struck me that one of the things I found quietly annoying was that line from 'uptime' (to the point that sometimes my first action on login was to run 'clear', so I had a clean window). It was the one last thing cluttering up 'give me a new window on host X' and making the home experience visibly different from the office experience.

So far I've taken only a small step forward. I've made it so that I skip running 'uptime' if I'm logging in from home and the load on the machine I'm logging in to is sufficiently low to be uninteresting (which is often the case). As I get used to (or really, accept) this little change, I'll probably slowly move to silence 'uptime' more often.

When I think about it, making this change feels long overdue. Printing out all sorts of things on login made sense in a world where I logged in to places relatively infrequently. But that's not the case in my world any more. My terminal windows are mostly transient and I mostly work on servers that I have to start new windows on, and right from very early I made my office environment not treat them as login sessions, with the full output and everything (if I cared about routinely seeing the load on a server, that's what xload was for (cf)).

(I'm bad about admitting to myself that my usage has shifted and old settings no longer make sense.)

A fun Python puzzle with circular imports

By: cks

Baptiste Mispelon asked an interesting Python quiz (via, via @glyph):

Can someone explain this #Python import behavior?
I'm in a directory with 3 files:

a.py contains `A = 1; from b import *`
b.py contains `from a import *; A += 1`
c.py contains `from a import A; print(A)`

Can you guess and explain what happens when you run `python c.py`?

I encourage you to guess which of the options in the original post is the actual behavior before you read the rest of this entry.

There are two things going on here. The first thing is what actually happens when you do 'from module import ...'. The short version is that this copies the current bindings of names from one module to another. So when module b does 'from a import *', it copies the binding of a.A to b.A and then the += changes that binding. The behavior would be the same if we used 'from a import A' and 'from b import A' in the code, and if we did we could describe what each did in isolation as starting with 'A = 1' (in a), then 'A = a.A; A += 2' (in b), and then 'A = b.A' (back in a) successively (and then in c, 'A = a.A').

The second thing going on is that you can import incomplete modules (this is true in both Python 2 and Python 3, which return the same results here). To see how this works we need to combine the description of 'import' and 'from' and the approximation of what happens during loading a module, although neither is completely precise. To summarize, when a module is being loaded, the first thing that happens is that a module namespace is created and is added to sys.modules; then the code of the module is executed in that namespace. When Python encounters a 'from', if there is an entry for the module in sys.modules, Python immediately imports things from it; it implicitly assumes that the module is already fully loaded.

At first I was surprised by this behavior, but the more I think about it the more it seems a reasonable choice. It avoids having to explicitly detect circular imports and it makes circular imports work in the simple case (where you do 'import b' and then don't use anything from b until all imports are finished and the program is running). It has the cost that if you have circular name uses you get an unhelpful error message about 'cannot import name' (or 'NameError: name ... is not defined' if you use 'from module import *'):

$ cat a.py
from b import B; A = 10 + B
$ cat b.py
from a import A; B = 20 + A
$ cat c.py
from a import A; print(A)
$ python c.py
[...]
ImportError: cannot import name 'A' from 'a' [...]

(Python 3.13 does print a nice stack trace the points to the whole set of 'from ...' statements.)

Given all of this, here is what I believe is the sequence of execution in Baptiste Mispelon's example:

  1. c.py does 'from a import A', which initiates a load of the 'a' module.
  2. an 'a' module is created and added to sys.modules
  3. that module begins executing the code from a.py, which creates an 'a.A' name (bound to 1) and then does 'from b import *'.
  4. a 'b' module is created and added to sys.modules.
  5. that module begins executing the code from b.py. This code starts by doing 'from a import *', which finds that 'sys.modules["a"]' exists and copies the a.A name binding, creating b.A (bound to 1).
  6. b.py does 'A += 1', which mutates the b.A binding (but not the separate a.A binding) to be '2'.
  7. b.py finishes its code, returning control to the code from a.py, which is still part way through 'from b import *'. This import copies all names (and their bindings) from sys.modules["b"] into the 'a' module, which means the b.A binding (to 2) overwrites the old a.A binding (to 1).
  8. a.py finishes and returns control to c.py, where 'from a import A' can now complete by copying the a.A name and its binding into 'c', make it the equivalent of 'import a; A = a.A; del a'.
  9. c.py prints the value of this, which is 2.

At the end of things, there is all of c.A, a.A, and b.A, and they are bindings to the same object. The order of binding was 'b.A = 2; a.A = b.A; c.A = a.A'.

(There's also a bonus question, where I have untested answers.)

Sidebar: A related circular import puzzle and the answer

Let's take a slightly different version of my error message example above, that simplifies things by leaving out c.py:

$ cat a.py
from b import B; A = 10 + B
$ cat b.py
from a import A; B = 20 + A
$ python a.py
[...]
ImportError: cannot import name 'B' from 'b' [...]

When I first did this I was quite puzzled until the penny dropped. What's happening is that running 'python a.py' isn't creating an 'a' module but instead a __main__ module, so b.py doesn't find a sys.modules["a"] when it starts and instead creates one and starts loading it. That second version of a.py, now in an "a" module, is what tries to refer to b.B and finds it not there (yet).

Systemd and blocking connections to localhost, including via 'any'

By: cks

I recently discovered a surprising path to accessing localhost URLs and services, where instead of connecting to 127.0.0.1 or the IPv6 equivalent, you connected to 0.0.0.0 (or the IPv6 equivalent). In that entry I mentioned that I didn't know if systemd's IPAddressDeny would block this. I've now tested this, and the answer is that systemd's restrictions do block this. If you set 'IPAddressDeny=localhost', the service or whatever is blocked from the 0.0.0.0 variation as well (for both outbound and inbound connections). This is exactly the way it should be, so you might wonder why I was uncertain and felt I needed to test it.

There are a variety of ways at different levels that you might implement access controls on a process (or a group of processes) in Linux, for IP addresses or anything else. For example, you might create an eBPF program that filtered the system calls and system call arguments allowed and attach it to a process and all of its children using seccomp(2). Alternately, for filtering IP connections specifically, you might use a cgroup socket address eBPF program (also), which are among the the cgroup program types that are available. Or perhaps you'd prefer to use a cgroup socket buffer program.

How a program such as systemd implements filtering has implications for what sort of things it has to consider and know about when doing the filtering. For example, if we reasonably conclude that the kernel will have mapped 0.0.0.0 to 127.0.0.1 by the time it invokes cgroup socket address eBPF programs, such a program doesn't need to have any special handling to block access to localhost by people using '0.0.0.0' as the target address to connect to. On the other hand, if you're filtering at the system call level, the kernel has almost certainly not done such mapping at the time it invokes you, so your connect() filter had better know that '0.0.0.0' is equivalent to 127.0.0.1 and it should block both.

This diversity is why I felt I couldn't be completely sure about systemd's behavior without actually testing it. To be honest, I didn't know what the specific options were until I researched them for this entry. I knew systemd used eBPF for IPAddressDeny (because it mentions that in the manual page in passing), but I vaguely knew there are a lot of ways and places to use eBPF and I didn't know if systemd's way needed to know about 0.0.0.0 or if systemd did know.

Sidebar: What systemd uses

As I found out through use of 'bpftool cgroup list /sys/fs/cgroup/<relevant thing>' on a systemd service that I knew uses systemd IP address filtering, systemd uses cgroup socket buffer programs, and is presumably looking for good and bad IP addresses and netblocks in those programs. This unfortunately means that it would be hard for systemd to have different filtering for inbound connections as opposed to outgoing connections, because at the socket buffer level it's all packets.

(You'd have to go up a level to more complicated filters on socket address operations.)

The original vi is a product of its time (and its time has passed)

By: cks

Recently I saw another discussion of how some people are very attached to the original, classical vi and its behaviors (cf). I'm quite sympathetic to this view, since I too am very attached to the idiosyncratic behavior of various programs I've gotten used to (such as xterm's very specific behavior in various areas), but at the same time I had a hot take over on the Fediverse:

Hot take: basic vim (without plugins) is mostly what vi should have been in the first place, and much of the differences between vi and vim are improvements. Multi-level undo and redo in an obvious way? Windows for easier multi-file, cross-file operations? Yes please, sign me up.

Basic vi is a product of its time, namely the early 1980s, and the rather limited Unix machines of the time (yes a VAX 11/780 was limited).

(The touches of vim superintelligence, not so much, and I turn them off.)

For me, vim is a combination of genuine improvements in vi's core editing behavior (cf), frustrating (to me) bits of trying too hard to be smart (which I mostly disable when I run across them), and an extension mechanism I ignore but people use to make vim into a superintelligent editor with things like LSP integrations.

Some of the improvements and additions to vi's core editing may be things that Bill Joy either didn't think of or didn't think were important enough. However, I feel strongly that some or even many of omitted features and differences are a product of the limited environments vi had to operate in. The poster child for this is vi's support of only a single level of undo, which drastically constrains the potential memory requirements (and implementation complexity) of undo, especially since a single editing operation in vi can make sweeping changes across a large file (consider a whole-file ':...s/../../' substitution, for example).

(The lack of split windows might be one part memory limitations and one part that splitting an 80 by 24 serial terminal screen is much less useful than splitting, say, an 80 by 50 terminal window.)

Vim isn't the only improved version of vi that has added features like multi-level undo and split windows so you can see multiple files at once (or several parts of the same file); there's also at least nvi. I'm used to vim so I'm biased, but I happen to think that a lot of vim's choices for things like multi-level undo are good ones, ones that will be relatively obvious and natural to new people and avoid various sorts of errors and accidents. But other people like nvi and I'm not going to say they're wrong.

I do feel strongly that giving stock vi to anyone who doesn't specifically ask for it is doing them a disservice, and this includes installing stock vi as 'vi' on new Unix installs. At this point, what new people are introduced to and what is the default on systems should be something better and less limited than stock vi. Time has moved on and Unix systems should move on with it.

(I have similar feelings about the default shell for new accounts for people, as opposed to system accounts. Giving people bare Bourne shell is not doing them any favours and is not likely to make a good first impression. I don't care what you give them but it should at least support cursor editing, file completion, and history, and those should be on by default.)

PS: I have complicated feelings about Unixes that install stock vi as 'vi' and something else under its full name, because on the one hand that sounds okay but on the other hand there is so much stuff out there that says to use 'vi' because that's the one name that's universal. And if you then make 'vi' the name of the default (visual) editor, well, it certainly feels like you're steering new people into it and doing them a disservice.

(I don't expect to change the mind of any Unix that is still shipping stock vi as 'vi'. They've made their cultural decisions a long time ago and they're likely happy with the results.)

How we failed to notice a power failure

By: cks

Over on the Fediverse, I mentioned that we once missed noticing that there had been a power failure. Naturally there is a story there (and this is the expanded version of what I said in the Fediverse thread). A necessary disclaimer is that this was all some time ago and I may be mangling or mis-remembering some of the details.

My department is spread across multiple buildings, one of which has my group's offices and our ancient machine room (which I believe has been there since the building burned down and was rebuilt). But for various reasons, this building doesn't have any of the department's larger meeting rooms. Once upon a time we had a weekly meeting of all the system administrators (and our manager), both my group and all of the Points of Contact, which amounted to a dozen people or so and needed one of the larger meeting rooms, which was of course in a different building than our machine room.

As I was sitting in the meeting room during one weekly meeting, fiddling around, I tried to get my Linux laptop on either our wireless network or our wired laptop network (it's been long enough that I can't remember which). This was back in the days when networking on Linux laptops wasn't a 100% reliable thing, especially wireless, so I initially assumed that my inability to get on the network was the fault of my laptop and its software. Only after a bit of time and also failing on both wired and wireless networking did I ask to see if anyone else (with a more trustworthy laptop) could get on the network. As a ripple of "no, not me" spread around the room, we realized that something was wrong.

(This was in the days before smartphones were pervasive, and also it must have been before the university-wide wireless network was available in that meeting room.)

What was wrong turned out to be a short power failure that had been isolated to the building that our machine room was in. Had people been in their offices, the problem would have been immediately obvious; we'd have seen all networking fail, and the people in the building would have seen the lights go out and so on. But because the power issue hit at exactly the time that we were all in our weekly meeting in a different building, we missed it.

(My memory is that by the time we'd reached the machine room the power was coming back, but obviously we had a variety of work to do to clean the situation up so that was it for the meeting.)

For extra irony, the building we were meeting in was right next to our machine room's building, and the meeting room had a window that literally looked across the alleyway at our building. At least that made it quick and easy to get to the machine room, because we could just walk across the bridge that connects the two buildings.

PS: In our environment, this is such a rare collection of factors that it's not worth trying to set up some sort of alerting for it, especially today in a world with pervasive smartphones (where people outside the meeting room can easily send some of us messages, even with the network down).

(Also, these days we don't normally have such big meetings any more and if we did, they'd be virtual meetings and we'd definitely notice bits of the network going down, one way or another.)

A surprising path to accessing localhost URLs and HTTP services

By: cks

One of the classic challenges in web security is DNS rebinding. The simple version is that you put some web service on localhost in order to keep outside people from accessing it, and then some joker out in the world makes 'evil.example.org' resolve to 127.0.0.1 and arranges to get you to make requests to it. Sometimes this is through JavaScript in a browser, and sometimes this is by getting you to fetch things from URLs they supply (because you're running a service that fetches and processes things from external URLs, for example).

One way people defend against this is by screening out 127.0.0.0/8, IPv6's ::1, and other dangerous areas of IP address space from DNS results (either in the DNS resolver or in your own code). And you can also block URLs with these as explicit IP addresses, or 'localhost' or the like. Sometimes you might add extra security restrictions to a process or an environment through means like Linux eBPF to screen out which IP addresses you're allowed to connect to (cf, and I don't know whether systemd's restrictions would block this).

As I discovered the other day, if you connect to INADDR_ANY, you connect to localhost (which any number of people already knew). Then in a comment Kevin Lyda reminded me that INADDR_ANY is also known as 0.0.0.0, and '0' is often accepted as a name that will turn into it, resulting in 'ssh 0' working and also (in some browsers) 'http://0:<port>/'. The IPv6 version of INADDR_ANY is also an all-zero address, and '::0' and '::' are both accepted as names for it, and then of course it's easy to create DNS records that resolve to either the IPv4 or IPv6 versions. As I said on the Fediverse:

Surprise: blocking DNS rebinding to localhost requires screening out more than 127/8 and ::1 answers. This is my face.

It turns out that this came up in mid 2024 in the browser context, as '0.0.0.0 Day' (cf). Modern versions of Chrome and Safari apparently explicitly block requests to 0.0.0.0 (and presumably also the IPv6 version), while Firefox will still accept it. And of course your URL-fetching libraries will almost certainly also accept it, especially through DNS lookups of ordinary looking but attacker controlled hostnames.

In my view, it's not particularly anyone's fault that this slipped through the cracks, both in browsers and in tools that handle fetching content from potentially hostile URLs. The reality of life is that how IP behaves in practice is complicated and some of it is historical practice that's been carried forward and isn't necessarily obvious or well known (and certainly isn't standardized). Then URLs build on top of this somewhat rickety foundation and surprises happen.

(This is related to the issue of browsers being willing to talk to 'local' IPs, which Chrome once attempted to start blocking (and I believe that shipped, but I don't use Chrome any more so I don't know what the current state is).)

The meaning of connecting to INADDR_ANY in TCP and UDP

By: cks

An interesting change to IP behavior landed in FreeBSD 15, as I discovered by accident. To quote from the general networking section of the FreeBSD 15 release notes:

Making a connection to INADDR_ANY, i.e., using it as an alias for localhost, is now disabled by default. This functionality can be re-enabled by setting the net.inet.ip.connect_inaddr_wild sysctl to 1. cd240957d7ba

The change's commit message has a bit of a different description:

Previously connect() or sendto() to INADDR_ANY reached some socket bound to some host interface address. Although this was intentional it was an artifact of a different era, and is not desirable now.

This is connected to an earlier change and FreeBSD bugzilla #28075, which has some additional background and motivation for the overall change (as well as the history of this feature in 4.x BSD).

The (current) Linux default behavior matches the previous FreeBSD behavior. If you had something listening on localhost (in IPv4, specifically 127.0.0.1) or listening on INADDR_ANY, connecting to INADDR_ANY would reach it and give the source of your connection a localhost address (either 127.0.0.1 or ::1 depending on IPv4 versus IPv6). Obviously the current FreeBSD default behavior has now changed, and the Linux behavior may change at some point (or at least become something that can be changed by a sysctl).

(Linux specifically restricts you to connecting to 127.0.0.1; you can't reach a port listening on, eg, 127.0.0.10, although that is also a localhost address.)

One of the tricky API issues here is that higher level APIs can often be persuaded or tricked into using INADDR_ANY by default when they connect to something. For example, in Go's net package, if you leave the hostname blank, you currently get INADDR_ANY (which is convenient behavior for listening but not necessarily for connecting). In other APIs, your address variable may start with an initial zero value for the target IP address, which is INADDR_ANY for IPv4; if your code never sets it (perhaps because the 'host' is a blank string), you get a connection to INADDR_ANY and thus to localhost. In top of that, a blank host name to connect to may have come about through accident or through an attacker's action (perhaps they can make decoding or parsing the host name fail, leaving the 'host name' blank on you).

I believe that what's happening with Go's tests is that the net package guarantees that things like net.Dial("tcp", ":<port>") connect to localhost, so of course the net package has tests to insure that this stays working. Currently, Go's net package implements this behavior by mapping a blank host to INADDR_ANY, which has traditionally worked and been the easiest way to get the behavior Go wants. It also means that Go can use uniform parsing of 'host:port' for both listening, where ':port' is required to mean listening on INADDR_ANY, and for connecting, where the host has to be localhost. Since this is a high level API, Go can change how the mapping works, and it pretty much has to in order to fully work as documented on FreeBSD 15 in a stock configuration.

(Because that would be a big change to land right before the release of Go 1.26, I suspect that the first bugfix that will land is to skip these tests on FreeBSD, or maybe only on FreeBSD 15+ if that's easy to detect.)

I prefer to pass secrets between programs through standard input

By: cks

There are a variety of ways to pass secrets from one program to another on Unix, and many of them may expose your secrets under some circumstances. A secret passed on the command line is visible in process listings; a secret passed in the environment can be found in the process's environment (which can usually be inspected by outside parties). When I've had to deal with this in administrative programs in our environment, I have reached for an old Unix standby: pass the secret between programs through file descriptors, specifically standard input and standard output. This can even be used and done in shell scripts. However, there are obviously some cautions, both in general and in shell scripts.

Although Bourne shell script variables look like environment variables, they aren't exported into the environment until you ask for this with 'export'. Naturally you should never do this for the environment variables that hold secrets. Also, these days 'echo' is a built-in in any version of the Bourne shell you want to use, so 'echo $somesecret' does not actually run a process that has the secret visible in its command line arguments. However, you have to be careful what commands you use here, because potentially convenient ones like printf aren't builtin and can't be used like this.

As a general caution, you need to either limit the characters that are allowed in secrets or encode the secret somehow (you might as well use base64). If you need to pass more than one thing between your programs this way, you'll need to define a very tiny protocol, if only so that you write down the order that things are sent between programs (and if they are, for example, newline-delimited).

One advantage of passing secrets this way is that it's easy to pass them from machine to machine through mechanisms like SSH (if you have passwordless SSH). Instead of 'provide-secret | consume-secret', you can simply change to 'provide-secret | ssh remote consume-secret'.

In the right (Unix) environment it's possible to pass secrets this way to programs that want to read them from a file, using features like Bash's '<(...)' notation or the underlying Unix features that enable that Bash feature (specifically, /dev/fd).

Passing secrets between programs this way can seem a little janky and improper, but I can testify that it works. We have a number of things that move secrets around this way, including across machines, and they've been doing it for years without problems.

(There are fancy ways to handle this on Linux for some sorts of secrets, generally static secrets, but I don't know of any other generally usable way of doing this for dynamic secrets that are generated on the fly, especially if some of the secrets consumers are shell scripts. But you probably could write a D-Bus based system to do this with all sorts of bells and whistles, if you had to do it a lot and wanted something more professional looking.)

The consoles of UEFI, serial and otherwise, and their discontents

By: cks

UEFI is the modern firmware standard for x86 PCs and other systems; sometimes the actual implementation is called a UEFI BIOS, but the whole area is a bit confusing. I recently wrote about getting FreeBSD to use a serial console on a UEFI system and mentioned that some UEFI BIOSes could echo console output to a serial port, which caused Greg A. Woods to ask a good question in a comment:

So, how does one get a typical UEFI-supporting system to use a serial console right from the firmware?

The mechanical answer is that you go into your UEFI BIOS settings and see if it has any options for what is usually called 'console redirection'. If you have it, you can turn it on and at that point the UEFI console will include the serial device you picked, theoretically allowing both output and input from the serial device. This is very similar to the 'console redirection' option in 'legacy' pre-UEFI BIOSes, although it's implemented rather differently. An important note here is that UEFI BIOS console redirection only applies to things using the UEFI console. Your UEFI BIOS definitely uses the UEFI console, and your UEFI operating system boot loader hopefully does. Your operating system almost certainly doesn't.

A UEFI BIOS doesn't need to have such an option and typical desktop ones probably don't. The UEFI standard provides a standard set of ways to implement console redirection (and alternate console devices in general), but UEFI doesn't require it; it's perfectly standard compliant for a UEFI BIOS to only support the video console. Even if your UEFI BIOS provides console redirection, your actual experience of trying to use it may vary. Watching boot output is likely to be fine, but trying to interact with the BIOS from your serial port may be annoying.

How all of this works is that UEFI has a notion of an EFI console, which is (to quote the documentation) "used to handle input and output of text-based information intended for the system user during the operation of code in the boot services environment". The EFI console is an abstract thing, and it's also some globally defined variables that include ConIn and ConOut, the device paths of the console input and output device or devices. Device paths can include multiple sub-devices (in generic device path structures), and one of the examples specifically mentioned is:

[...] An example of this would be the ConsoleOut environment variable that consists of both a VGA console and serial output console. This variable would describe a console output stream that is sent to both VGA and serial concurrently and thus has a Device Path that contains two complete Device Paths. [...]

(Sometimes this is 'ConsoleIn' and 'ConsoleOut', eg, and sometimes 'ConIn' and 'ConOut'. Don't ask me why.)

In theory, a UEFI BIOS can hook a wide variety of things up to ConIn, ConOut, or both, as it decides (and implements), possibly including things like IPv4 connections. In practice it's up to the UEFI BIOS to decide what it will bother to support. Server UEFI BIOSes will typically support serial console redirection, which is to say connecting some serial port to ConIn and ConOut in addition to the VGA console. Desktop motherboard UEFI BIOSes probably won't. I don't know if there are very many server UEFI BIOSes that will use only the serial console and exclude the VGA console from ConIn and ConOut.

(Also in theory I believe a UEFI BIOS could wire up ConOut to include a serial port but not connect it to ConIn. In practice I don't know of any that do.)

EFI also defines a protocol (a set of function calls) for console input and output. For input, what people (including the UEFI BIOS itself) get back is either or both of an EFI scan code or a Unicode character. The 'EFI scan code' is used to determine what special key you typed, for example F11 to go into some UEFI BIOS setup mode. The UEFI standard also has an appendix with examples of mapping various sorts of input to these EFI scan codes, which is very relevant for entering anything special over a serial console.

If you look at this appendix B, you'll note that it has entries for both 'ANSI X3.64 / DEC VT200-500 (8-bit mode)' and 'VT100+ (7-bit mode)'. Now you have two UEFI BIOS questions. First, does your UEFI BIOS even implement this, or does it either ignore the whole issue (leaving you with no way to enter special characters) or come up with its own answers? And second, does your BIOS restrict what it recognizes over the serial port to just whatever type it's set the serial port to, or will it recognize either sequence for something like F11? The latter question is very relevant because your terminal emulator environment may or may not generate what your UEFI BIOS wants for special keys like F11 (or it may even intercept some keys, like F11; ideally you can turn this off).

(Another question is what your UEFI BIOS may call the option that controls what serial port key mapping it's using. One machine I've tested on calls the setting "Putty KeyPad" and the correct value for the "ANSI X3.64" version is "XTERMR6", for example, which corresponds to what xterm, Gnome-Terminal and probably other modern terminal programs send.)

Another practical issue is that if you do anything fancy with a UEFI serial console, such as go into the BIOS configuration screens, your UEFI BIOS may generate output that assumes a very specific and unusual terminal resolution. For instance, the Supermicro server I've been using for my FreeBSD testing appears to require a 100x30 terminal in its BIOS configuration screens; if you have any other resolution you get various sorts of jumbled results. Many of our Dell servers take a different approach, where the moment you turn on serial console redirection they choke their BIOS configuration screens down to an ASCII 80x24 environment. OS boot environments may be more forgiving in various ways.

The good news is that your operating system's bootloader will probably limit itself to regular characters, and in practice what you care about a lot of the time is interacting with the bootloader (for example, for alternate boot and disaster recovery), not your UEFI BIOS.

As FreeBSD discusses in loader.efi(8), it's not necessarily straightforward for an operating system boot loader to decode what the UEFI ConIn and ConOut are connected to in order to pass the information to the operating system (which normally won't be using UEFI to talk to its console(s)). This means that the UEFI BIOS console(s) may not wind up being what the OS console(s) are, and you may have to configure them separately.

PS: As you may be able to tell from what I've written here, if you care significantly about UEFI BIOS access from the serial port, you should expect to do a bunch of experimentation with your specific hardware. Remember to re-check your results with new server generations and new UEFI BIOS firmware versions.

Estimating where your Prometheus Blackbox TCP query-response check failed

By: cks

As covered recently, the normal way to check simple services from outside in a Prometheus environment is with Prometheus Blackbox, which is somewhat complicated to understand. One of its abstractions is a prober, a generic way of checking some service using HTTP, DNS queries, a TCP connection, and so on. The TCP prober supports conducting a query-response dialog once you connect, but currently (as of Blackbox 0.28.0) it doesn't directly expose metrics that tell you where your TCP probe with a query-response set failed (and why), and sometimes you'd like to know.

A somewhat typical query-response probe looks like this:

  smtp_starttls:
    prober: tcp
    tcp:
      query_response:
        - expect: "^220"
        - send: "EHLO something\r"
        - expect: "^250-STARTTLS"
        - expect: "^250 "
        - send: "STARTTLS\r"
        - expect: "^220"
        - starttls: true
        - expect: "^220"
        - send: "QUIT\r"

To understand what metrics we can look for on failure, we need to both understand how each important option in a step can fail, and what metrics they either set on failure or create when they succeed.

  • starttls will fail if it can't successfully negotiate a TLS connection with the server, possibly including if the server's TLS certificate fails to verify. It sets no metrics on failure, but on success it will set various TLS related metrics such as the probe_ssl_* family and probe_tls_version_info.

  • send will fail if there is an error sending the line, such as the TCP connection closing on you. It sets no metrics on either success or failure.

  • expect reads lines from the TCP connection until either a line matches your regular expression, it hits EOF, or it hits a network error. If it hit a network error, including from the other end abruptly terminating the connection in a way that raises a local error, it sets no metrics. If it hit EOF, it sets the metric probe_failed_due_to_regex to 1; if it matched a line, it sets that metric to 0.

    One important case of 'network error' is if the check you're doing times out. This is internally implemented partly by putting a (Go) deadline on the TCP connection, which will cause an error if it runs too long. Typical Blackbox module timeouts aren't very long (how long depends on both configuration settings and how frequent your checks are; they have to be shorter than the check interval).

    If you have multiple 'expect' steps and you check fails at one of them, there's (currently) no way to find out which one it failed at unless you can determine this from other metrics, for example the presence or absence of TLS metrics.

  • expect_bytes fails if it doesn't immediately read those bytes from the TCP connection. If it failed because of an error or because it read fewer bytes than required (including no bytes, ie an EOF), it sets no metrics. If it read enough bytes it sets the probe_failed_due_to_bytes metric to either 0 (if they matched) or 1 (if they didn't).

In many protocols, the consequences of how expect works means that if the server at the other end spits out some error response instead of the response you expect, your expect will skip over it and then wait endlessly. For instance, if the SMTP server you're probing gives you a SMTP 4xx temporary failure response in either its greeting banner or its reply to your EHLO, your 'expect' will sit there trying to read another line that might start with '220'. Eventually either your check will time out or the SMTP server will, and probably it will be your check (resulting in a 'network error' that leaves no traces in metrics). Generally this means you can only see a probe_failed_due_to_regex of 1 in a TCP probe based module if the other end cleanly closed the connection, so that you saw EOF. This tends to be pretty rare.

(We mostly see it for SSH probes against overloaded machines, where we connect but then the SSH daemon immediately closes the connection without sending the banner, giving us an EOF in our 'expect' for the banner.)

If the probe failed because of a DNS resolution failure, I believe that probe_ip_addr_hash will be 0 and I think probe_ip_protocol will also be 0.

If the check involves TLS, the presence of the TLS metrics in the result means that you got a connection and got as far as starting TLS. In the example above, this would mean that you got almost all of the way to the end.

I'm not sure if there's any good way to detect that the connection attempt failed. You might be able to reasonably guess that from an abnormally low probe_duration_seconds value. If you know the relevant timeout values, you can detect a probe that failed due to timeout by looking for a suitably high probe_duration_seconds value.

If you have some use of the special labels action, then the presence of a probe_expect_info metric means that the check got to that step. If you don't have any particular information that you want to capture from an expect line, you can use labels (once) to mark that you've succeeded at some expect step by using a constant value for your label.

(Hopefully all of this will improve at some point and Blackbox will provide, for example, a metric that tells you the step number that a query-response block failed on. See issue #1528, and also issue #1527 where I wish for a way to make an 'expect' fail immediately and definitely if it receives known error responses, such as a SMTP 4xx code.)

Early Linux package manager history and patching upstream source releases

By: cks

One of the important roles of Linux system package managers like dpkg and RPM is providing a single interface to building programs from source even though the programs may use a wide assortment of build processes. One of the source building features that both dpkg and RPM included (I believe from the start) is patching the upstream source code, as well as providing additional files along with it. My impression is that today this is considered much less important in package managers, and some may make it at least somewhat awkward to patch the source release on the fly. Recently I realized that there may be a reason for this potential oddity in dpkg and RPM.

Both dpkg and RPM are very old (by Linux standards). As covered in Andrew Nesbitt's Package Manager Timeline, both date from the mid-1990s (dpkg in January 1994, RPM in September 1995). Linux itself was quite new at the time and the Unix world was still dominated by commercial Unixes (partly because the march of x86 PCs was only just starting). As a result, Linux was a minority target for a lot of general Unix free software (although obviously not for Linux specific software). I suspect that this was compounded by limitations in early Linux libc, where apparently it had some issues with standards (see eg this, also, also, also).

As a minority target, I suspect that Linux regularly had problems compiling upstream software, and for various reasons not all upstreams were interested in fixing (or changing) that (especially if it involved accepting patches to cope with a non standards compliant environment; one reply was to tell Linux to get standards compliant). This probably left early Linux distributions regularly patching software in order to make it build on (their) Linux, leading to first class support for patching upstream source code in early package managers.

(I don't know for sure because at that time I wasn't using Linux or x86 PCs, and I might have been vaguely in the incorrect 'Linux isn't Unix' camp. My first Linux came somewhat later.)

These days things have changed drastically. Linux is much more standards compliant and of course it's a major platform. Free software that works on non-Linux Unixes but doesn't build cleanly on Linux is a rarity, so it's much easier to imagine (or have) a package manager that is focused on building upstream source code unaltered and where patching is uncommon and not as easy (or trivial) as dpkg and RPM make it.

(You still need to be able to patch upstream releases to handle security patches and so on, since projects don't necessarily publish new releases for them. I believe some projects simply issue patches and tell you to apply them to their current release. And you may have to backport a patch yourself if you're sticking on an older release of the project that they no longer do patches for.)

Making a FreeBSD system have a serial console on its second serial port

By: cks

Over on the Fediverse I said:

Today's other work achievement: getting a UEFI booted FreeBSD 15 machine to use a serial console on its second serial port, not its first one. Why? Because the BMC's Serial over Lan stuff appears to be hardwired to the second serial port, and life is too short to wire up physical serial cables to test servers.

The basics of serial console support for your FreeBSD machine are covered in the loader.conf manual page, under the 'console' setting (in the 'Default Settings' section). But between UEFI and FreeBSD's various consoles, things get complicated, and for me the manual pages didn't do a great job of putting the pieces together clearly. So I'll start with my descriptions of all of the loader.conf variables that are relevant:

console="efi,comconsole"
Sets both the bootloader console and the kernel console to both the EFI console and the serial port, by default COM1 (ttyu0, Linux ttyS0). This is somewhat harmful if your UEFI BIOS is already echoing console output to the serial port (or at least to the serial port you want); you'll get doubled serial output from the FreeBSD bootloader, but not doubled output from the kernel.

boot_multicons="YES"
As covered in loader_simp(8), this establishes multiple low level consoles for kernel messages. It's not necessary if your UEFI BIOS is already echoing console output to the serial port (and the bootloader and kernel can recognize this), but it's harmless to set it just in case.

comconsole_speed="115200"
Sets the serial console speed (and in theory 115200 is the default). It's not necessary if the UEFI BIOS has set things up but it's harmless. See loader_simp(8) again.

comconsole_port="0x2f8"
Sets the serial port used to COM2. It's not necessary if the UEFI BIOS has set things up, but again it's harmless. You can use 0x3f8 to specify COM1, although it's the default. See loader_simp(8).

hw.uart.console="io:0x2f8,br:115200"
This tells the kernel where the serial console is and what baud rate it's at, here COM2 and 115200 baud. The loader will automatically set it for you if you set the comconsole_* variables, either because you also need a 'console=' setting or because you're being redundant. See loader.efi(8) (and then loader_simp(8) and uart(4)).

(That the loader does this even without a 'comconsole' in your nonexistent 'console=' line may some day be considered a bug and fixed.)

If they agree with each other, you can safely set both hw.uart.console and the comconsole_* variables.

On a system where the UEFI BIOS isn't echoing the UEFI console output to a serial port, the basic version of FreeBSD using both the video console (settings for which are in vt(4)) and the serial console (on the default of COM1), with the primary being the video console, is a loader.conf setting of:

console="efi,comconsole"
boot_multicons="YES"

This will change both the bootloader console and the kernel console after boot. If your UEFI BIOS is already echoing 'console' output to the serial port, bootloader output will be doubled and you'll get to see fun bootloader output like:

LLooaaddiinngg  ccoonnffiigguurreedd  mmoodduulleess......

If you see this (or already know that your UEFI BIOS is doing this), the minimal alternate loader.conf settings (for COM1) are:

# for COM1 / ttyu0
hw.uart.console="io:0x3f8,br:115200"

(The details are covered in loader.efi(8)'s discussion of console considerations.)

If you don't need a 'console=' setting because of your UEFI BIOS, you must set either hw.uart.console or the comconsole_* settings. Technically, setting hw.uart.console is the correct approach; that setting only comconsole_* still works may be a bug.

If you don't explicitly set a serial port to use, FreeBSD will use COM1 (ttyu0, Linux ttyS0) for the bootloader and kernel. This is only possible if you're using 'console=', because otherwise you have to directly or indirectly set 'hw.uart.console', which directly tells the kernel which serial port to use (and the bootloader will use whatever UEFI tells it to). To change the serial port to COM2, you need to set the appropriate one of 'comconsole_port' and 'hw.uart.console' from 0x3f8 (COM1) to the right PC port value of 0x2f8.

So our more or less final COM2 /boot/loader.conf for a case where you can turn off or ignore the BIOS echoing to the serial console is:

console="efi,comconsole"
boot_multicons="YES"
comconsole_speed="115200"
# For the COM2 case
comconsole_port="0x2f8"

If your UEFI BIOS is already echoing 'console' output to the serial port, the minimal version of the above (again for COM2) is:

# For the COM2 case
hw.uart.console="io:0x2f8,br:115200"

(As with Linux, the FreeBSD kernel will only use one serial port as the serial console; you can't send kernel messages to two serial ports. FreeBSD at least makes this explicit in its settings.)

As covered in conscontrol and elsewhere, FreeBSD has a high level console, represented by /dev/console, and a low level console, used directly by the kernel for things like kernel messages. The high level console can only go to one device, normally the first one; this is either the first one in your 'console=' line or whatever UEFI considers the primary console. The low level console can go to multiple devices. Unlike Linux, this can be changed on the fly once the system is up through conscontrol (and also have its state checked).

Conveniently, you don't need to do anything to start a serial login on your chosen console serial port. All four possible (PC) serial ports, /dev/ttyu0 through /dev/ttyu3, come pre-set in /etc/ttys with 'onifconsole' (and 'secure'), so that if the kernel is using one of them, there's a getty started on it. I haven't tested what happens if you use conscontrol to change the console on the fly.

Booting FreeBSD on a UEFI based system is covered through the manual page series of uefi(8), boot(8), loader.efi(8), and loader(8). It's not clear to me if loader.efi is the EFI specific version of loader(8), or if the one loads and starts the other in a multi-stage boot process. I suspect it's the former.

Sidebar: What we may wind up with in loader.conf

Here's what I think is a generic commented block for serial console support:

# Uncomment if the UEFI BIOS does not echo to serial port
#console="efi,comconsole"
boot_multicons="YES"
comconsole_speed="115200"
# Uncomment for COM2
#comconsole_port="0x2f8"
# change 0x3f8 (COM1) to 0x2f8 for COM2
hw.uart.console="io:0x3f8,br:115200"

All of this works for me on FreeBSD 15, but your distance may vary.

Why I'm ignoring pretty much all new Python packaging tools

By: cks

One of the things going on right now is that Python is doing a Python developer survey. On the Fediverse, I follow a number of people who do Python stuff, and they've been posting about various aspects of the survey, including a section on what tools people use for what. This gave me an interesting although very brief look into a world that I'm deliberately ignoring, and I'm doing that because I feel my needs are very simple and are well met by basic, essentially universal tools that I already know and have.

Although I do some small amount of Python programming, I'm not a Python developer; you could call me a consumer of Python things, both programs and packages. The thing I do most is use programs written in Python that aren't single-file, dependency free things, almost always for my own personal use (for example, asncounter and the Python language server). The tool I use for almost all of these is pipx, which I feel handles pretty much everything I could ask for and comes pre-packaged in most Linuxes. Admittedly I've written some tools to make my life nicer.

(One important think pipx does is install each program separately. This allows me to remove one clearly and also to use PyPy or CPython as I prefer on a program by program basis.)

For programs that we want to use as part of our operations (for example), the modern, convenient approach is to make a venv and then install the program into it with pip. Pip is functionally universal and the resulting venvs effectively function as self contained artifacts that can be moved or put anywhere (provided that we stick to the same Ubuntu LTS version). So far we haven't tried to upgrade these in place; if a new version of the program comes out, we build a new venv and swap which one is used.

(It's possible that package dependencies of the program could be updated even if it hasn't released a new version, but we treat these built venvs as if they were compiled binaries; once produced, they're not modified.)

Finally, our Django based web application now uses a Django setup where Django is installed into a venv and then the production tree of our application lives outside that venv (previously we didn't use venvs at all but that stopped working). Our application isn't versioned or built into a Python artifact; it's a VCS tree and is managed through VCS operations. The Django venv is created separately, and I use pip for that because again pip is universal and familiar. This is a crude and brute force approach but it's also ensured that I haven't had to care about the Python packaging ecosystem (and how to make Python packages) for the past fifteen years. At the moment we use only standard Django without any third party packages that we'd also have to add to the venv and manage, and I expect that we're going to stay that way. A third party package would have to be very attractive (or become extremely necessary) in order for us to take it on and complicate life.

I'm broadly aware that there are a bunch of new Python package management and handling tools that go well beyond pip and pipx in both performance and features. My feeling so far is that I don't need anything more than I have and I don't do the sort of regular Python development where the extra features the newer tools have would make a meaningful difference. And to be honest, I'm wary of some or all of these turning out to be a flavour of the month. My mostly outside impression is that Python packaging and package management has had a great deal of churn over the years, and from seeing the Go ecosystem go through similar things from closer up I know that being stuck with a now abandoned tool is not particularly fun. Pip and pipx aren't the modern hot thing but they're also very unlikely to go away.

Why Linux wound up with system package managers

By: cks

Yesterday I discussed the two sorts of program package managers, system package managers that manage the whole system and application package managers that mostly or entirely manage third party programs. Commercial Unix got application package managers in the very early 1990s, but Linux's first program managers were system package managers, in dpkg and RPM (or at least those seem to be the first Linux package managers).

The abstract way to describe why is to say that Linux distributions had to assemble a whole thing from separate pieces; the kernel came from one place, libc from another, coreutils from a third, and so on. The concrete version is to think about what problems you'd have without a package manager. Suppose that you assembled a directory tree of all of the source code of the kernel, libc, coreutils, GCC, and so on. Now you need to build all of these things (or rebuild, let's ignore bootstrapping for the moment).

Building everything is complicated partly because everything goes about it differently. The kernel has its own configuration and build system, a variety of things use autoconf but not necessarily with the same set of options to control things like features, GCC has a multi-stage build process, Perl has its own configuration and bootstrapping process, X is frankly weird and vaguely terrifying, and so on. Then not everyone uses 'make install' to actually install their software, so you have another set of variations for all of this.

(The less said about the build processes for either TeX or GNU Emacs in the early to mid 1990s, the better.)

If you do this at any scale, you need to keep track of all of this information (cf) and you want a uniform interface for 'turn this piece into a compiled and ready to unpack blob'. That is, you want a source package (which encapsulates all of the 'how to do it' knowledge) and a command that takes a source package and does a build with it. Once you're building things that you can turn into blobs, it's simpler to always ship a new version of the blob whenever you change anything.

(You want the 'install' part of 'build and install' to result in a blob rather than directly installing things on your running system because until it finishes, you're not entirely sure the build and install has fully worked. Also, this gives you an easy way to split overall system up into multiple pieces, some of which people don't have to install. And in the very early days, to split them across multiple floppy disks, as SLS did.)

Now you almost have a system package manager with source packages and binary packages. You're building all of the pieces of your Linux distribution in a standard way from something that looks a lot like source packages, and you pretty much want to create binary blobs from them rather than dump everything into a filesystem. People will obviously want a command that takes a binary blob and 'installs' it by unpacking it on their system (and possibly extra stuff), rather than having to run 'tar whatever' all the time themselves, and they'll also want to automatically keep track of which of your packages they've installed rather than having to keep their own records. Now you have all of the essential parts of a system package manager.

(Both dpkg and RPM also keep track of which package installed what files, which is important for upgrading and removing packages, along with things having versions.)

The two subtypes of one sort of package managers, the "program manager"

By: cks

I've written before that one of the complications of talking about package managers and package management is that there are two common types of package managers, program managers (which manage installed programs on a system level) and module managers (which manage package dependencies for your project within a language ecosystem or maybe a broader ecosystem). Today I realized that there is a further important division within program managers. I will call this division application (package) managers and system (package) managers.

A system package manager is what almost all Linux distributions have (in the form of Debian's dpkg and its set of higher level tools, Fedora's RPM and its set of higher level tools, Arch's pacman, and so on). It manages everything installed by the distribution on the system, from the kernel all the way up to the programs that people run to get work done, but certainly including what we think of as system components like the core C library, basic POSIX utilities, and so on. In modern usage, all updates to the system are done by shipping new package versions, rather than by trying to ship 'patches' that consist of only a few changed files or programs.

(Some Linux distributions are moving some high level programs like Chrome to an application package manager.)

An application package manager doesn't manage the base operating system; instead it only installs, manages, and updates additional (and optional) software components. Sometimes these are actual applications, but at other times, especially historically, these were things like the extra-cost C compiler from your commercial Unix vendor. On Unix, files from these application packages were almost always installed outside of the core system areas like /usr/bin; instead they might go into /opt/<something> or /usr/local or various other things.

(Sometimes vendor software comes with its own internal application package manager, because the vendor wants to ship it in pieces and let you install only some of them while managing the result. And if you want to stretch things a bit, browsers have their own internal 'application package management' for addons.)

A system package manager can also be used for 'applications' and routinely is; many Linux systems provide undeniable applications like Firefox and LibreOffice through the system package manager (not all of them, though). This can include third party packages that put themselves in non-system places like /opt (on Unix) if they want to. I think this is most common on Linux systems, where there's no common dedicated application package manager that's widely used, so third parties wind up building their own packages for the system package manager (which is sure to be there).

For relatively obvious reasons, it's very hard to have multiple system package managers in use on the same system at once; they wind up fighting over who owns what and who changes what in the operating system. It's relatively straightforward to have multiple application package managers in use at once, provided that they keep to their own area so that they aren't overwriting each other.

For the most part, the *BSDs have taken a base system plus application manager approach, with things like their 'ports' system being their application manager. Where people use third party program managers, including pkgsrc on multiple Unixes, Homebrew on macOS, and so on, these are almost always application managers that don't try to also take over and manage the core ('base') operating system programs, libraries, and so on.

(As a result, the *BSDs ship system updates as 'patches', not as new packages, cf OpenBSD's syspatch. I've heard some rumblings that FreeBSD may be working to change this.)

I believe that Microsoft Windows has some degree of system package management, in that it has components that you might or might not install and that can be updated or restored independently, but I don't have much exposure to the Windows world. I will let macOS people speak up in the comments about how that system operates (as people using macOS experience, not as how it's developed; as developed there are a bunch of different parts to macOS, as one can see from the various open source repositories that Apple publishes).

PS: The Linux flatpak movement is mostly or entirely an application manager, and so usually separate from the system package manager (Snap is the same thing but I ignore Canonical's not-invented-here pet projects as much as possible). You can also see containers as an extremely overweight application 'package' delivery model.

PPS: In my view, to count as package management a system needs to have multiple 'packages' and have some idea of what packages are installed. It's common but not absolutely required for the package manager to keep track of what files belong to what package. Generally this goes along with a way to install and remove packages. A system can be divided up into components without having package management, for example if there's no real tracking of what components you've installed and they're shipped as archives that all get unpacked in the same hierarchy with their files jumbled together.

Forcing a Go generic type to be a pointer type (and some challenges)

By: cks

Recently I saw a Go example that made me scratch my head and decode what was going on (you can see it here). Here's what I understand about what's going on. Suppose that you want to create a general interface for a generic type that requires any concrete implementation to be a pointer type. We can do this by literally requiring a pointer:

type Pointer[P any] interface {
   *P
}

That this is allowed is not entirely obvious from the specification, but it's not forbidden. We're not allowed to use just 'P' or '~P' in the interface type, because you're not allowed to directly or indirectly embed yourself as a type parameter, but '*P' isn't doing that directly; instead, it's forcing a pointer version of some underlying type. Actually using it is a bit awkward, but I'll get to that.

We can then require such a generic type to have some methods, for example:

type Index[P any] interface {
   New() *P
   *P
}

This can be implemented by, for example:

type base struct {
	i int
}

func (b *base) New() *base {
	return &base{-1}
}

But suppose we want to have a derived generic type, for example a struct containing an Index field of this Index (generic) type. We'd like to write this in the straightforward way:

type Example[P any] struct {
	Index Index[P]
}

This doesn't work (at least not today); you can't write 'Index[P]' outside of a type constraint. In order to make this work you must create the type with two related generic type constraints:

type Example[T Index[P], P any] struct {
	Index T
}

This unfortunately means that when we use this generic type to construct values of some concrete type, we have to repeat ourselves:

e := Example[*base, base]{&base{0}}

However, requiring both type constraints means that we can write generic methods that use both of them:

func (e *Example[T, P]) Do() {
	e.Index = (T)(new(P))
}

I believe that the P type would otherwise be inaccessible and you'd be unable to construct this, but I could be wrong; these are somewhat deep waters in Go generics.

You run into a similar issue with functions that you simply want to take an argument that is a Pointer (or an Index), because our Pointer (and Index) generic types are specified relative to an underlying type and can't be used without specifying that underlying type, either explicitly or through type inference. So you have to write generic functions that look like:

func Something[T Pointer[P], P any] (p T) {
   [...]
}

This generic function can successfully use type inference when invoked, but it has to be declared this way and if type inference doesn't work in your specific case you'll need to repeat yourself, as with constructing Example values.

Looking into all of this and writing it out has left me less enlightened than I hoped at the start of the process, but Go generics are a complicated thing in general (or at least I find all of their implications and dark corners to be complicated).

(Original source and background, which is slightly different from what I've done here.)

Sidebar: The type inference way out for constructing values

In the computer science tradition, we can add a layer of indirection.

func NewExample[T Index[P], P any] (p *P) Example[T,P] {
    var e Example[T,P]
    e.Index = p
    return e
}

Then you can call this as 'NewExample(&base{0})' and type inference will fill in al of the types, at least in this case. Of course this isn't an in-place construction, which might be important in some situations.

Sidebar: The mind-bending original version

The original version was like this:

type Index[P any, T any] interface {
	New() T
	*P
}

type Example[T Index[P, T], P any] struct {
	Index T
}

In this version, Example has a type parameter that refers to itself, 'T Index[P, T]'. This is legal in a type parameter declaration; what would be illegal is referring to 'Example' in the type parameters. It's also satisfiable (which isn't guaranteed).

Scraping the FreeBSD 'mpd5' daemon to obtain L2TP VPN usage data

By: cks

We have a collection of VPN servers, some OpenVPN based and some L2TP based. They used to be based on OpenBSD, but we're moving from OpenBSD to FreeBSD and the VPN servers recently moved too. We also have a system for collecting Prometheus metrics on VPN usage, which worked by parsing the output of things. For OpenVPN, our scripts just kept working when we switched to FreeBSD because the two OSes use basically the same OpenVPN setup. This was not the case for our L2TP VPN server.

OpenBSD does L2TP using npppd, which supports a handy command line control program, npppctl, that can readily extract and report status information. On FreeBSD, we wound up using mpd5. Unfortunately, mpd5 has no equivalent of npppctl. Instead, as covered (sort of) in its user manual you get your choice of a TCP based console that's clearly intended for interactive use and a web interface that is also sort of intended for interactive use (and isn't all that well documented).

Fortunately, one convenient thing about the web interface is that it uses HTTP Basic authentication, which means that you can easily talk to it through tools like curl. To do status scraping through the web interface, first you need to turn it on and then you need an unprivileged mpd5 user you'll use for this:

set web self 127.0.0.1 5006
set web open

set user metrics <some-password> user

At this point you can use curl to get responses from the mpd5 web server (from the local host, ie your VPN server itself):

curl -s -u metrics:... --basic 'http://localhost:5006/<something>'

There are two useful things you can ask the web server interface for. First, you can ask it for a complete dump of its status in JSON format, by asking for 'http://localhost:5006/json' (although the documentation claims that the information returned is what 'show summary' in the console would give you, it is more than that). If you understand mpd5 and like parsing and processing JSON, this is probably a good option. We did not opt to do this.

The other option is that you can ask the web interface to run console (interface) commands for you, and then give you the output in either a 'pleasant' HTML page or in a basic plain text version. This is done by requesting either '/cmd?<command>' or '/bincmd?<command>' respectively. For statistics scraping, the most useful version is the 'bincmd' one, and the command we used is 'show session':

curl -s -u metrics:... --basic 'http://localhost:5006/bincmd?show%20session'

This gets you output that looks like:

ng1  172.29.X.Y  B2-2 9375347-B2-2  L2-2  2  9375347-L2-2  someuser  A.B.C.D
RESULT: 0

(I assume 'RESULT: 0' would be something else if there was some sort of problem.)

Of these, the useful fields for us are the first, which gives the local network device, the second, which gives the internal VPN IP of this connection, and the last two, which give us the VPN user and their remote IP. The others are internal MPD things that we (hopefully) don't have to care about. The internal VPN IP isn't necessary for (our) metrics but may be useful for log correlation.

To get traffic volume information, you need to extract the usage information from each local network device that a L2TP session is using (ie, 'ng1' and its friends). As far as I know, the only tool for this in (base) FreeBSD is netstat. Although you can invoke it interface by interface, probably the better thing to do (and what we did) is to use 'netstat -ibn -f link' to dump everything at once and then pick through the output to get the lines that give you packet and byte counts for each L2TP interface, such as ng1 here.

(I'm not sure if dropped packets is relevant for these interfaces; if you think it might be, you want 'netstat -ibnd -f link'.)

FreeBSD has a general system, 'libxo', for producing output from many commands in a variety of handy formats. As covered in xo_options, this can be used to get this netstat output in JSON if you find that more convenient. I opted to get the plain text format and use field numbers for the information I wanted for our VPN traffic metrics.

(Partly this was because I could ultimately reuse a lot of my metrics generation tools from the OpenBSD npppctl parsing. Both environments generated two sets of line and field based information, so a significant amount of the work was merely shuffling around which field was used for what.)

PS: Because of how mpd5 behaves, my view is that you don't want to let anyone but system staff log on to the server where you're using it. It is an old C code base and I would not trust it if people can hammer on its TCP console or its web server. I certainly wouldn't expose the web server to a non-localhost network, even apart from the bit where it definitely doesn't support HTTPS.

Printing things in colour is not simple

By: cks

Recently, Verisimilitude left a comment on my entry on X11's DirectColor visual type, where they mentioned that L Peter Deutsch, the author of Ghostscript, lamented using twenty-four bit colour for Ghostscript rather than a more flexible approach, which you may need in printing things with colour. As it happens, I know a bit about this area for two or three reasons, which come at it from different angles. A long time ago I was peripherally involved in desktop publishing software, which obviously cares about printing colour, and then later I became a hobby photographer and at one point had some exposure to people who care about printing photographs (both colour and black and white).

(The actual PDF format supports much more complex colour models than basic 24-bit sRGB or sGray colour, but apparently Ghostscript turns all of that into 24-bit colour internally. See eg, which suggests that modern Ghostscript has evolved into a more complex internal colour model.)

On the surface, printing colour things out in physical media may seem simple. You convert RGB colour to CMYK colour and then send the result off to the printer, where your inkjet or laser printer uses its CMYK ink or toner to put the result on the paper. Photographic printers provide the first and lesser complication in this model, because serious photographic printers have many more colours of ink than CMYK and they put these inks on various different types of fine art paper that have different effects on how the resulting colours come out.

Photographic printers have so many ink colours because this results in more accurate and faithful colours or, for black and white photographs (where a set of grey inks may be used), in more accurate and faithful greys. Photographers who care about this will carefully profile their printer using its inks on the particular fine art paper they're going to use in order to determine how RGB colours can be most faithfully reproduced. Then as part of the printing process, the photographic print software and the printer driver will cooperate to take the RGB photograph and map its colours to what combination of inks and ink intensity can best do the job.

(Photographers use different fine art papers because the papers have different characteristics; one of the high level ones is matte versus glossy papers. But the rabbit hole of detailed paper differences goes quite deep. So does the issue of how many inks a photo printer should have and what they should be. Naturally photographers who make prints have lots of opinions on this whole area.)

Where this stops being just a print driver issue is that people editing photographs often want to see roughly how they'll look when printed out without actually making a print (which is generally moderately expensive). This requires the print subsystem to be capable of feeding colour mapping results back to the editing layer, so you can see that certain things need to be different at the RGB colour level so that they come out well in the printed photograph. This is of course all an approximation, but at the very least photo editing software like darktable wants to be able to warn you when you're creating an 'out of gamut' colour that can't be accurately printed.

(I don't have any current numbers for the cost of making prints on photographic printers, but it's not trivial, especially if you're making large prints; you'll use a decent amount of ink and the fine art paper isn't cheap either. You don't want to make more test prints than you really have to.)

All of this is still in the realm of RGB colour, though (although colour space and display profiling and management complicate the picture). To go beyond this we need to venture into the twin worlds of printing advertising, including product boxes, and fine art printing. Printed product ads and especially boxes for products not infrequently use spot colours, where part of the box will be printed with a pure ink colour rather than approximated with process colours (CMYK or other). You don't really want to manage spot colours by saying that they're a specific RGB value and then everything with that RGB value will be printed with that spot colour; ideally you want to manage them as a specific spot colour layer for each spot colour you're using. An additional complication is that product boxes for mass products aren't necessarily printed with CMYK inks at all; like photographic prints, they may use a custom ink set that's designed to do a good job with the limited colour gamut that appears on the product box.

(This leads to a fun little game you can play at home.)

Desktop publishing software that wants to do a good job with this needs a bunch of features. I believe that generally you want to handle spot colours as separate editing layers even if they're represented in RGB. You probably also want features to limit the colour space and colours that the product designer can do, because the company that will print your boxes may have told you it has certain standard ink sets and please keep your box colours to things they handle well as much as possible. Or you may want to use only pure spot colours from your set of them and not have a product designer accidentally set something to another colour.

Printing art books of fine art has similar issues. The artwork that you're trying to reproduce in the art book may use paint colours that don't reproduce well in standard CMYK colours, or in any colour set without special inks (one case is metallic colours, which are readily available for fine art paints and which some artists love). The artist whose work you're trying to print may have strong opinions about you doing a good job of it, while the more inks you use (and the more special inks) the more expensive the book will be. Some compromise is inevitable but you have to figure out where and what things will be the most mangled by various ink set options. This means your software should be able to map from something roughly like RGB scans or photographs into ink sets and let you know about where things are going to go badly.

For fine art books, my memory is that there are a variety of tricks that you can play to increase the number of inks you can use. For example, sometimes you can print different sections of the book with different inks. This requires careful grouping of the pages (and artwork) that will be printed on a single large sheet of paper with a single set of inks at the printing plant. It also means that your publishing software needs to track ink sets separately for groups of pages and understand how the printing process will group pages together, so it can warn you if you're putting an artwork onto a page that clashes with the ink set it needs.

(Not all art books run into these issues. I believe that a lot of art books for Japanese anime have relatively few problems here because the art they're reproducing was already made for an environment with a restricted colour gamut. No one animates with true metallic colours for all sorts of reasons.)

To come back to PDFs and colour representation, we can see why you might regret picking a single 24-bit RGB colour representation for everything in a program that handles things that will eventually be printed. I'm not sure there's any reasonable general format that will cover everything you need when doing colour printing, but you certainly might want to include explicit provisions for spot colours (which are very common in product boxes, ads, and so on), and apparently Ghostscript eventually gained support for them (as well as various other colour related things).

Understanding query_response in Prometheus Blackbox's tcp prober

By: cks

Prometheus Blackbox is somewhat complicated to understand. One of its fundamental abstractions is a 'prober', a generic way of probing some service (such as making HTTP requests or DNS requests). One prober is the 'tcp' prober, which makes a TCP connection and then potentially conducts a conversation with the service to verify its health. For example, here's a ClamAV daemon health check, which connects, sends a line with "PING", and expects to receive "PONG":

  clamd_pingpong:
    prober: tcp
    tcp:
      query_response:
        - send: "PING\n"
        - expect: "PONG"

The conversation with the service is detailed in the query_response configuration block (in YAML). For a long time I thought that this was what it looks like here, a series of entries with one directive per entry, such as 'send', 'expect', or 'starttls' (to switch to TLS after, for example, you send a 'STARTTLS' command to the SMTP or IMAP server).

However, much like an earlier case with Alertmanager, this is not actually what the YAML syntax is. In reality each step in the query_response YAML array can have multiple things. To quote the documentation:

 [ - [ [ expect: <string> ],
       [ expect_bytes: <string> ],
       [ labels:
         - [ name: <string>
             value: <string>
           ], ...
       ],
       [ send: <string> ],
       [ starttls: <boolean | default = false> ]
     ], ...
 ]

When there are multiple keys in a single step, Blackbox handles them in almost the order listed here: first expect, then labels if the expect matched, then expect_bytes, then send, then starttls. Normally you wouldn't have both expect and expect_bytes in the same step (and combining them is tricky). This order is not currently documented, so you have to read prober/query_response.go to determine it.

One reason to combine expect and send together in a single step is that then send can use regular expression match groups from the expect in its text. There's an example of this in the example blackbox.yml file:

  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        # cks: note use of ${1}, from PING
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"

The 'labels:' key is something added in v0.26.0, in #1284. As shown in the example blackbox.yml file, it can be used to do things like extract SSH banner information into labels on a metric:

  ssh_banner_extract:
    prober: tcp
    timeout: 5s
    tcp:
      query_response:
      - expect: "^SSH-2.0-([^ -]+)(?: (.*))?$"
        labels:
        - name: ssh_version
          value: "${1}"
        - name: ssh_comments
          value: "${2}"

This creates a metric that looks like this:

probe_expect_info {ssh_comments="Ubuntu-3ubuntu13.14", ssh_version="OpenSSH_9.6p1"} 1

At the moment there are some undocumented restrictions on the 'labels' key (or action or whatever you want to call it). First, it only works if you use it in a step that has an 'expect'. Even if all you want to do is set constant label values (for example to record that you made it to a certain point in your steps), you need to expect something; you can't use 'labels' in a step that otherwise only has, say, 'send'. Second, you can only have one labels in your entire query_response section; if you have more than one, you'll currently experience a Go panic when checking reaches the second.

This is unfortunate because Blackbox is currently lacking good ways to see how far your query_response steps got if the probe fails. Sometimes it's obvious where your probe failed, or irrelevant, but sometimes it's both relevant and not obvious. If you could use multiple labels, you could progressively set fixed labels and tell how far you got by what labels were visible in the scrape metrics.

(And of course you could also record various pieces of useful information that you don't get all at once.)

Sidebar: On (not) condensing expect and send together

My personal view is that I normally don't want to condense 'expect' and 'send' together into one step entry unless I have to, because most of the time it inverts the relationship between the two. In most protocols and protocol interactions, you send something and expect a response; you don't receive something and then send a response to it. In my opinion this is more naturally written in the style:

      query_response:
      - expect: "something"
      - send: "my request"
      - expect: "reply to my request"
      - send: "something else"
      - expect: "reply to something else"

Than as:

      query_response:
      - expect: "something"
        send: "my request"
      - expect: "reply to my request"
        send: "something else"
      - expect: "reply to something else"

What look like pairs (an expect/send in the same step) are not actually pairs; the 'expect' is for a previous 'send' and then 'send' pairs with the next 'expect' in the next step. So it's clearer to write them all as separate steps, which doesn't create any expectations of pairing.

Pitfalls in using Prometheus Blackbox to monitor external SMTP

By: cks

The news of the day is that Microsoft had a significant outage inside their Microsoft 365 infrastructure. We noticed when we stopped being able to deliver email to the university's institutional email system, which was a bit mysterious in the usual way of today's Internet:

The joys of modern email: "Has Microsoft decided to put all of our email on hold or are they having a global M365 inbound SMTP email incident?"

(For about the last hour and a half, if it's an incident someone is having a bad day.)

We didn't find out immediately when this happened (and if our systems had been working right, we wouldn't have found out when I did, but that's another story). Initially I was going to write an entry about whether or not we should use our monitoring system to monitor external services that other people run, but it turns out that we do try to monitor whether we can do a SMTP conversation to the university's M365-hosted institutional email. There were several things that happened with this monitoring.

The first thing that happened is that the alerts related to it rotted. The university once had a fixed set of on-premise MX targets and we monitored our ability to talk to them and alerted on it. Then the university moved their MX targets to M365 and our old alerts stopped applying, so we commented them out and never added any new alerts for any new checking we were doing.

One of the reasons for that is that we were doing this monitoring through Prometheus Blackbox, and Blackbox is not ideal for monitoring Microsoft 365 MX targets. The way M365 does redundancy in their inbound mail servers for your domain is not by returning multiple DNS MX records, but by returning one MX record for a hostname that has multiple IP addresses (and the IP addresses may change). What a mailer will do is try all of the IP addresses until one responds. What Blackbox does is it picks one IP address and then it probes the IP address; if the address fails, there is no attempt to check the other IP addresses. Failing if one IP of many is not responding is okay for casual checks, but you don't necessarily want to alert on it.

(I believe that Blackbox picks the first IP address in the DNS A record, but this depends on how the Go standard library and possibly your local resolver behaves. If either sort the results, you get the first A record in the sorted result.)

The final issue is that we weren't necessarily checking enough of the SMTP conversation. For various reasons, we decided that all we could safely and confidently check was that the university's mail system accepted a testing SMTP MAIL FROM from our subdomain; we didn't check that it also accepted a SMTP RCPT TO. I believe that during part of this Microsoft 365 incident, the inbound M365 SMTP servers would accept our SMTP MAIL FROM but report an error at the RCPT TO (although I can't be sure). Certainly if we want to have a more realistic check of 'is email to M365 working', we should go as far as a SMTP RCPT TO.

(During parts of the incident, DNS lookups didn't succeed for the MX target. Without detailed examination I can't be sure of what happened in the other cases.)

Overall, Blackbox is probably the wrong tool to check an external mail target like M365 if we're serious about it and want to do a good job. At the moment it's not clear to me if we should go to the effort to do better, since it is an external service and there's nothing we can do about problems (although we can let people know, which has some value, but that's another entry).

PS: You can get quite elaborate in a mail deliverability test, but to some degree the more elaborate you get the more pieces of infrastructure you're testing, and you may want a narrow test for better diagnostics.

What ZFS people usually mean when they talk about "ZFS metadata"

By: cks

Recently I read Understanding ZFS Scrubs and Data Integrity (via), which is a perfectly good article and completely accurate, bearing in mind some qualifications which I'm about to get into. One of the things this article says in the preface is:

In this article, we will walk through what scrubs do, how the Merkle tree layout lets ZFS validate metadata and data from end to end, [...]

This is both completely correct and misleading, because what ZFS people mean we talk about "metadata" is probably not what ordinary people (who are aware of filesystems) think of as "metadata". This misunderstanding leads people (which once upon a time included me) to believe that ZFS scrubs check much more than they actually do.

Specifically, in normal use "ZFS metadata" is different from "filesystem metadata", like directories. A core ZFS concept is DMU objects (dnodes), which are a basic primitive of ZFS's structure; a DMU object stores data in a more or less generic way. As covered in more detail in my broad overview on how ZFS is structured on disk, filesystem objects like directories, files, ACLs, and so on are all DMU objects that are stored in the filesystem's (DMU) object set and are referred to (for examine in filesystem directories) by object number (the equivalent of an inode number). At this level, filesystem metadata is ZFS data.

What ZFS people and ZFS scrubs mean by "ZFS metadata" are things such as each filesystem's DMU object set (which is itself a DMU object, because in ZFS it's turtles most of the way down), the various DSL (Dataset and Snapshot Layer) objects, the various DMU objects used to track and manage free space in the ZFS pool, and so on. All of this ZFS metadata is organized in a tree that's rooted in the uberblock and the pool's Meta Object Set (MOS) that the uberblock points to. It is this tree that is guarded and verified by checksums and ZFS scrubs, from the very top down to the leaves.

As far as I know, all filesystem level files, directories, symbolic links, ACLs, and so on are leaves of this tree of ZFS metadata; they are merely ZFS data. While they make up a logical filesystem tree (we hope), they aren't a tree at the level of ZFS objects; they're merely DMU objects in the filesystem's object set. Only at the ZFS filesystem layer (ZPL, the "ZFS POSIX Layer") does ZFS look inside these various filesystem objects and maintain structural relationships, such as a filesystem's directory tree or parent information (some of which is maintained using generic ZFS facilities like ZAP objects).

Scrubs must go through the tree of ZFS metadata in order to find everything that's in use in order to verify its checksum, but they don't have to go through the filesystem's directory tree. To verify the checksum of everything in a filesystem, all a scrub has to do is go through the filesystem's DMU object set, which contains every in-use object in the filesystem regardless of whether it's a regular file, a directory, a symbolic link, an ACL, or whatever.

The long painful history of (re)using login to log people in

By: cks

The news of the time interval is that Linux's usual telnetd has had a giant security vulnerability for a decade. As people on the Fediverse observed, we've been here before; Solaris apparently had a similar bug 20 or so years ago (which was CVE-2007-0882, cf, via), and AIX in the mid 1990s (CVE-1999-0113, source, also)), and also apparently SGI Irix, and no doubt many others (eg). It's not necessarily telnetd at fault, either, as I believe it's sometimes been rlogind.

All of these bugs have a simple underlying cause; in a way that root cause is people using Unix correctly and according to its virtue of modularity, where each program does one thing and you string programs together to achieve your goal. Telnetd and rlogind have the already complicated job of talking a protocol to the network, setting up ptys, and so on, so obviously they should leave the also complex job of logging the user in to login, which already exists to do that. In theory this should work fine.

The problem with this is that from more or less the beginning, login has had several versions of its job. From no later than V3 in 1972, login could also be used to switch from one user to another, not just log in initially. In 4.2 BSD, login was modified and reused to become part of rlogind's authentication mechanism (really; .rhosts is checked in the 4.2BSD login.c, not in rlogind). Later, various versions of login were modified to support 'automatic' logins, without challenging for a password (see eg FreeBSD login(1), OpenBSD login(1), and Linux login(1); use of -f for this appears to date back to around 4.3 Tahoe). Sometimes this was explicitly for the use of things that were running as root and had already authenticated the login.

In theory this is all perfectly Unixy. In practice, login figured out which of these variations of its basic job it was being used for based on a combination of command line arguments and what UID it was running as, which made it absolutely critical that programs running as root that reused login never allowed login to be invoked with arguments that would shift it to a different mode than they expected. Telnetd and rlogind have traditionally run as root, creating this exposure.

People are fallible, programmers included, and attackers are very ingenious. Over the years any number of people have found any number of ways to trick network daemons running as root into running login with 'bad' arguments.

The one daemon I don't think has ever been tricked this way is OpenSSH, because from very early on sshd refused to delegate logging people in to login. Instead, sshd has its own code to log people in to the system. This has had its complexities but has also shielded sshd from all of these (login) context problems.

In my view, this is one of the unfortunate times when the ideals of Unix run up against the uncomfortable realities of the world. Network daemons delegating logging people in to login is the correct Unix answer, but in practice it has repeatedly gone wrong and the best answer is OpenSSH's.

TCP, UDP, and listening only on a specific IP address

By: cks

One of the surprises of TCP and UDP is that when your program listens for incoming TCP connections or UDP packets, you can chose to listen only on a specific IP address instead of all of the IP addresses that the current system has. This behavior started as a de-facto standard but is now explicitly required for TCP in RFC 9293 section 3.9.1.1. There are at least two uses of this feature; to restrict access to your listening daemon, and to run multiple daemons on the same port.

The classical case of restricting access to a listening daemon is a program that listens only on the loopback IP address (IPv4 or IPv6 or both). Since loopback addresses can't be reached from outside the machine, only programs running on the machine can reach the daemon. On a machine with multiple IP addresses that are accessible from different network areas, you can also listen on only one IP address (perhaps an address 'inside' a firewall) to shield your daemon from undesired connections.

(Except in the case of the loopback IP address, this shielding isn't necessarily perfect. People on any of your local networks can always throw packets at you for any of your IP addresses, if they know them. In some situations, listening only on RFC 1918 private addresses can be reasonably safe from the outside world.)

The other use is to run multiple daemons that are listening on the same port but on different IP addresses. For example, you might run a public authoritative DNS server for some zones that is listening on port 53 (TCP and UDP) on your non-localhost IPs and a private resolving DNS server that is listening on localhost:53. Or you could have a 'honeypot' IP address that is running a special SSH server to look for Internet attackers, while still running your regular SSH server (to allow regular access) on your normal IP addresses. Broadly, this can be useful any time you want to have different configurations on the same port for different IP addresses.

Using restricted listening for access control has a lot of substitutes. Your daemon can check incoming connections and drop them depending on the local or remote IPs, or your host could have some simple firewall rules, or some additional software layer could give you a hand. Also, as mentioned, if you listen on anything other than localhost, you need to be sure that your overall configuration makes that safe enough. The other options are more complex but also more sure, or at least more obviously sure (or flawed).

Using restricted listening to have different things listening on the same TCP or UDP port doesn't have any good substitutes in current systems. Even if the operating system allows multiple things to listen generally on the same port, it has no idea which instance should get which connection or packet. To do this steering today, you'd need either a central 'director' daemon that received all packets or connection attempts and then somehow passed them to the right other program, or you'd have programs listen on different ports and then use OS firewall rules to (re)direct traffic to the right instance.

You can imagine an API that allows all of the programs to tell the operating system which connections they're interested in and which ones they aren't. One simple form of that API is 'listen on a specific IP address instead of all of them', and it conveniently also allows the OS to trivially detect conflicts between programs (even if some of them initially seem artificial).

(It would be nice if OSes gave programs nice APIs for choosing what incoming connections and packets they wanted and what they didn't, but mostly we deal with the APIs we have, not the ones we want.)

Single sign on systems versus X.509 certificates for the web

By: cks

Modern single sign on specifications such as OIDC and SAML and systems built on top of them are fairly complex things with a lot of moving parts. It's possible to have a somewhat simple surface appearance for using them in web servers, but the actual behind the scenes implementation is typically complicated, and of course you need an identity provider server and its supporting environment as well (which can get complicated). One reaction to this is to suggest using X.509 certificates to authenticate people (as a recent comment on this entry did).

There are a variety of technical considerations here, like to what extent browsers (and other software) might support personal X.509 certificates and make them easy to use, but to my mind there's also an overriding broad consideration that makes the two significantly different. Namely, people can remember passwords but they have to store X.509 certificates. OIDC and SAML may pass around tokens and programs dealing with them may store tokens, but the root of everything is in passwords, and you can recover all the tokens from there. This is not true with X.509 certificates; the certificate is the thing.

(There are also challenges around issuing, managing, checking, and revoking personal X.509 certificates, but let's ignore them.)

To make using X.509 certificate practical for authenticating people, people have to be able to use them on multiple devices and move them between browsers. Many people have multiple devices and people do change what browsers they use (for all that browser and platform vendors like them not to, or at least the ones that are currently popular are often all for that). Today, there is basically nothing that helps people deal with this, and as a result X.509 certificates are at best awkward for people to use (and remember, security is people).

(In common use, it's easy to move passwords between browsers and devices because they're in your head (excluding password managers, which are still not used by a lot of people).)

Of course you could develop standards and software for moving and managing X.509 certificates. In many ways, passkeys show what's possible here, and also show many of the hazards of using things for authentication that can't be memorized (or copied) by people in order to transport them between environments. However, no such standards and software exist today, and no one has every shown much interest in developing them, even back in the days when personal X.509 certificates were close to your only game in town.

(You could also develop much better browser UIs for dealing with personal X.509 certificates, something that was extremely under-developed back in the days when they were sometimes in use. Even importing such a certificate into your browser could be awkward, never mind using it.)

In the past, people have authenticated web applications through the use of personal X.509 certificates (as a more secure form of passwords). As far as I know, pretty much everyone has given up on that and moved to better options, first passwords (sometimes plus some form of additional confirmation) and then these days trying to get people to use passkeys. One reason they gave up was that actually using X.509 certificates in practice was awkward and something that people found quite annoying.

(I had to use a personal X.509 certificate for a while in order to get free TLS certificates for our servers. It wasn't a particularly great experience and I'm not in the least bit surprised that everyone ditched it for single sign on systems.)

PS: It's no good saying that X.509 certificates would be great if all of the required technology was magically developed, because that's not going to just happen. If you want personal X.509 certificates to be a thing, you have a great deal of work ahead of you and there is no guarantee you'll be successful. No one else is going to do that work for you.

PPS: You can imagine a system where people use their passwords and other multi-factor authentication to issue themselves new personal X.509 certificates signed by your local Certificate Authority, so they can recover from losing the X.509 certificate blob (or get a new certificate for a new device). Congratulations, you have just re-invented a manual version of OIDC tokens (also, it's worse in various ways).

People cannot "just pay attention" to (boring, routine) things

By: cks

Sometimes, people in technology believe that we can solve problems by getting people to pay attention. This comes up in security, anti-virus efforts, anti-phish efforts, monitoring and alert handling, warning messages emitted by programs, warning messages emitted by compilers and interpreters, and many other specific contexts. We are basically always wrong.

One of the core, foundational results from human factors research, research into human vision, the psychology of perceptions, and other related fields, is that human brains are a mess of heuristics and have far more limited capabilities than we think (and they lie to us all the time). Anyone who takes up photography as a hobby has probably experienced this (I certainly did); you can take plenty of photographs where you literally didn't notice some element in the picture at the time but only saw it after the fact while reviewing the photograph.

(In general photography is a great education on how much our visual system lies to us. For example, daytime shadows are blue, not black.)

One of the things we have a great deal of evidence about from both experiments and practical experience is that people (which is to say, human brains) are extremely bad at noticing changes in boring, routine things. If something we see all the time quietly disappears or is a bit different, the odds are extremely high that people will literally not notice. Our minds have long since registered whatever it is as 'routine' and tuned it out in favour of paying attention to more important things. You cannot get people to pay attention to these routine, almost always basically the same thing by asking them to (or yelling at them to do so, or blaming them when they don't), because our minds don't work that way.

We also have a tendency to see what we expect to see and not see what we don't expect to see, unless what we don't expect shoves itself into our awareness with unusual forcefulness. There is a famous invisible gorilla experiment that shows one aspect of this, but there are many others. This is why practical warning, alerts, and so on cannot be unobtrusive. Fire alarms are blaringly loud and obtrusive so that you cannot possibly miss them despite not expecting to hear them. A fire alarm that was "pay attention to this light if it starts blinking and makes a pleasant ringing tone" would get people killed.

There are hacks to get people to pay attention anyway, such as checklists, but these hacks are what we could call "not scalable" for many of the situations that people in technology care about. We cannot get people to go through a "should you trust this" checklist every time they receive an email message, especially when phish spammers deliberately craft their messages to create a sense of urgency and short-cut people's judgment. And even checklists are subject to seeing what you expect and not paying attention, especially if you do them over and over again on a routine basis.

(I've written a lot about this in various narrower areas before, eg 1, 2, 3, 4, 5. And in general, everything comes down to people, also.)

Systemd-networkd and giving your virtual devices alternate names

By: cks

Recently I wrote about how Linux network interface names have a length limit, of 15 characters. You can work around this limit by giving network interfaces an 'altname' property, as exposed in (for example) 'ip link'. While you can't work around this at all in Canonical's Netplan, it looks like you can have this for your VLANs in systemd-networkd, since there's AlternativeName= in the systemd.link manual page.

Except, if you look at an actual VLAN configuration as materialized by Netplan (or written out by hand), you'll discover a problem. Your VLANs don't normally have .link files, only .netdev and .network files (and even your normal Ethernet links may not have .link files). The AlternativeName= setting is only valid in .link files, because networkd is like that.

(The AlternativeName= is a '[Link]' section setting and .network files also have a '[Link]' section, but they allow completely different sets of '[Link]' settings. The .netdev file, which is where you define virtual interfaces, doesn't have a '[Link]' section at all, although settings like AlternativeName= apply to them just as much as to regular devices. Alternately, .netdev files could support setting altnames for virtual devices in the '[NetDev]' section along side the mandatory 'Name=' setting.)

You can work around this indirectly, because you can create a .link file for a virtual network device and have it work:

[Match]
Type=vlan
OriginalName=vlan22-mlab

[Link]
AlternativeNamesPolicy=
AlternativeName=vlan22-matterlab

Networkd does the right thing here even though 'vlan22-mlab' doesn't exist when it starts up; when vlan22-mlab comes into existence, it matches the .link file and has the altname stapled on.

Given how awkward this is (and that not everything accepts or sees altnames), I think it's probably not worth bothering with unless you have a very compelling reason to give an altname to a virtual interface. In my case, this is clearly too much work simply to give a VLAN interface its 'proper' name.

Since I tested, I can also say that this works on a Netplan-based Ubuntu server where the underlying VLAN is specified in Netplan. You have to hand write the .link file and stick it in /etc/systemd/network, but after that it cooperates reasonably well with a Netplan VLAN setup.

TCP and UDP and implicit "standard" elements of things

By: cks

Recently, Verisimilitude left a comment on this entry of mine about binding TCP and UDP ports to a specific address. That got me thinking about features that have become standard elements of things despite not being officially specified and required.

TCP and UDP are more or less officially specified in various RFCs and are implicitly specified by what happens on the wire. As far as I know, nowhere in these standards (or wire behavior) does anything require that a multi-address host machine allow you to listen for incoming TCP or UDP traffic on a specific port on only a restricted subset of those addresses. People talking to your host have to use a specific IP, obviously, and established TCP connections have specific IP addresses associated with them that can't be changed, but that's it. Hosts could have an API where you simply listened to a specific TCP or UDP port and then they provided you with the local IP when you received inbound traffic; it would be up to your program to do any filtering to reject addresses that you didn't want used.

However, I don't think anyone has such an API, and anything that did would likely be considered very odd and 'non-standard'. It's become an implicit standard feature of TCP and UDP that you can opt to listen on only one or a few IP addresses of a multi-address host, including listening only on localhost, and connections to your (TCP) port on other addresses are rejected without the TCP three-way handshake completing. This has leaked through into the behavior that TCP clients expect in practice; if a port is not available on an IP address, clients expect to get a TCP layer 'connection refused', not a successful connection and then an immediate disconnection. If a host had the latter behavior, clients would probably not report it as 'connection refused' and some of them would consider it a sign of a problem on the host.

This particular (API) feature comes from a deliberately designed element of the BSD sockets API, the bind() system call. Allowing you to bind() local addresses to your sockets means that you can set the outgoing IP address for TCP connection attempts and UDP packets, which is important in some situations, but BSD could have provided a different API for that. BSD's bind() API does allow you maximum freedom with only a single system call; you can nail down either or both of the local IP and the local port. Binding the local port (but not necessarily the local IP) was important in BSD Unix because it was part of a security mechanism.

(This created an implicit API requirement for other OSes. If you wanted your OS to have an rlogin client, you had to be able to force the use of a low local port when making TCP connections, because the BSD rlogind.c simply rejected connections from ports that were 1024 and above even in situations where it would ask you for a password anyway.)

A number of people copied the BSD sockets API rather than design their own. Even when people designed their own API for handling networking (or IPv4 and later IPv6), my impression is that they copied the features and general ideas of the BSD sockets API rather than starting completely from scratch and deviating significantly from the BSD API. My usual example of a relatively divergent API is Go, which is significantly influenced by a quite different networking history inside Bell Labs and AT&T, but Go's net package still allows you to listen selectively on an IP address.

(Of course Go has to work with the underlying BSD sockets API on many of the systems it runs on; what it can offer is mostly constrained by that, and people will expect it to offer more or less all of the 'standard' BSD socket API features in some form.)

PS: The BSD TCP API doesn't allow a listening program to make a decision about whether to allow or reject an incoming connection attempt, but this is turned out to be a pretty sensible design. As we found out witn SYN flood attacks, TCP's design means that you want to force the initiator of a connection attempt to prove that they're present before the listening ('server') side spends much resources on the potential connection.

Linux network interface names have a length limit, and Netplan

By: cks

Over on the Fediverse, I shared a discovery:

This is my (sad) face that Linux interfaces have a maximum name length. What do you mean I can't call this VLAN interface 'vlan22-matterlab'?

Also, this is my annoyed face that Canonical Netplan doesn't check or report this problem/restriction. Instead your VLAN interface just doesn't get created, and you have to go look at system logs to find systemd-networkd telling you about it.

(This is my face about Netplan in general, of course. The sooner it gets yeeted the better.)

Based on both some Internet searches and looking at kernel headers, I believe the limit is 15 characters for the primary name of an interface. In headers, you will find this called IFNAMSIZ (the kernel) or IF_NAMESIZE (glibc), and it's defined to be 16 but that includes the trailing zero byte for C strings.

(I can be confident that the limit is 15, not 16, because 'vlan22-matterlab' is exactly 16 characters long without a trailing zero byte. Take one character off and it works.)

At the level of ip commands, the error message you get is on the unhelpful side:

# ip link add dev vlan22-matterlab type wireguard
Error: Attribute failed policy validation.

(I picked the type for illustration purposes.)

Systemd-networkd gives you a much better error message:

/run/systemd/network/10-netplan-vlan22-matterlab.netdev:2: Interface name is not valid or too long, ignoring assignment: vlan22-matterlab

(Then you get some additional errors because there's no name.)

As mentioned in my Fediverse post, Netplan tells you nothing. One direct consequence of this is that in any context where you're writing down your own network interface names, such as VLANs or WireGuard interfaces, simply having 'netplan try' or 'netplan apply' succeed without errors does not mean that your configuration actually works. You'll need to look at error logs and perhaps inventory all your network devices.

(This isn't the first time I've seen Netplan behave this way, and it remains just as dangerous.)

As covered in the ip link manual page, network interfaces can have either or both of aliases and 'altname' properties. These alternate names can be (much) longer than 16 characters, and the 'ip link property' altname property can be used in various contexts to make things convenient (I'm not sure what good aliases are, though). However this is somewhat irrelevant for people using Netplan, because the current Netplan YAML doesn't allow you to set interface altnames.

You can set altnames in networkd .link files, as covered in the systemd.link manual page. The direct thing you want is AlternativeName=, but apparently you may also want to set a blank alternative names policy, AlternativeNamesPolicy=. Of course this probably only helps if you're using systemd-networkd directly, instead of through Netplan.

PS: Netplan itself has the notion of Ethernet interfaces having symbolic names, such as 'vlanif0', but this is purely internal to Netplan; it's not manifested as an actual interface altname in the 'rendered' systemd-networkd control files that Netplan writes out.

(Technically this applies to all physical device types.)

❌