Normal view

There are new articles available, click to refresh the page.
Today — 7 December 2025Main stream

RuBee

22 November 2025 at 00:00

I have at least a few readers for which the sound of a man's voice saying "government cell phone detected" will elicit a palpable reaction. In Department of Energy facilities across the country, incidences of employees accidentally carrying phones into secure areas are reduced through a sort of automated nagging. A device at the door monitors for the presence of a tag; when the tag is detected it plays an audio clip. Because this is the government, the device in question is highly specialized, fantastically expensive, and says "government cell phone" even though most of the phones in question are personal devices. Look, they already did the recording, they're not changing it now!

One of the things that I love is weird little wireless networks. Long ago I wrote about ANT+, for example, a failed personal area network standard designed mostly around fitness applications. There's tons of these, and they have a lot of similarities---so it's fun to think about the protocols that went down a completely different path. It's even better, of course, if the protocol is obscure outside of an important niche. And a terrible website, too? What more could I ask for.

The DoE's cell-phone nagging boxes, and an array of related but more critical applications, rely on an unusual personal area networking protocol called RuBee.

RuBee is a product of Visible Assets Inc., or VAI, founded in 2004 1 by John K. Stevens. Stevens seems a somewhat improbable founder, with a background in biophysics and eye health, but he's a repeat entrepreneur. He's particularly fond of companies called Visible: he founded Visible Assets after his successful tenure as CEO of Visible Genetics. Visible Genetics was an early innovator in DNA sequencing, and still provides a specialty laboratory service that sequences samples of HIV in order to detect vulnerabilities to antiretroviral medications.

Clinical trials in the early 2000s exposed Visible Genetics to one of the more frustrating parts of health care logistics: refrigeration. Samples being shipped to the lab and reagents shipped out to clinics were both temperature sensitive. Providers had to verify that these materials had stayed adequately cold throughout shipping and handling, otherwise laboratory results could be invalid or incorrect. Stevens became interested in technical solutions to these problems; he wanted some way to verify that samples were at acceptable temperatures both in storage and in transit.

Moreover, Stevens imagined that these sensors would be in continuous communication. There's a lot of overlap between this application and personal area networks (PANs), protocols like Bluetooth that provide low-power communications over short ranges. There is also clear overlap with RFID; you can buy RFID temperature sensors. VAI, though, coined the term visibility network to describe RuBee. That's visibility as in asset visibility: somewhat different from Bluetooth or RFID, RuBee as a protocol is explicitly designed for situations where you need to "keep tabs" on a number of different objects. Despite the overlap with other types of wireless communications, the set of requirements on a visibility network have lead RuBee down a very different technical path.

Visibility networks have to be highly reliable. When you are trying to keep track of an asset, a failure to communicate with it represents a fundamental failure of the system. For visibility networks, the ability to actually convey a payload is secondary: the main function is just reliably detecting that endpoints exist. Visibility networks have this in common with RFID, and indeed, despite its similarities to technologies like BLE RuBee is positioned mostly as a competitor to technologies like UHF RFID.

There are several differences between RuBee and RFID; for example, RuBee uses active (battery-powered) tags and the tags are generally powered by a complete 4-bit microcontroller. That doesn't necessarily sound like an advantage, though. While RuBee tags advertise a battery life of "5-25 years", the need for a battery seems mostly like a liability. The real feature is what active tags enable: RuBee operates in the low frequency (LF) band, typically at 131 kHz.

At that low frequency, the wavelength is very long, about 2.5 km. With such a long wavelength, RuBee communications all happen at much less than one wavelength in range. RF engineers refer to this as near-field operation, and it has some properties that are intriguingly different from more typical far-field RF communications. In the near-field, the magnetic field created by the antenna is more significant than the electrical field. RuBee devices are intentionally designed to emit very little electrical RF signal. Communications within a RuBee network are achieved through magnetic, not electrical fields. That's the core of RuBee's magic.

The idea of magnetic coupling is not unique to RuBee. Speaking of the near-field, there's an obvious comparison to NFC which works much the same way. The main difference, besides the very different logical protocols, is that NFC operates at 13.56 MHz. At this higher frequency, the wavelength is only around 20 meters. The requirement that near-field devices be much closer than a full wavelength leads naturally to NFC's very short range, typically specified as 4 cm.

At LF frequencies, RuBee can achieve magnetic coupling at ranges up to about 30 meters. That's a range comparable to, and often much better than, RFID inventory tracking technologies. Improved range isn't RuBee's only benefit over RFID. The properties of magnetic fields also make it a more robust protocol. RuBee promises significantly less vulnerability to shielding by metal or water than RFID.

There are two key scenarios where this comes up: the first is equipment stored in metal containers or on metal shelves, or equipment that is itself metallic. In that scenario, it's difficult to find a location for an RFID tag that won't suffer from shielding by the container. The case of water might seem less important, but keep in mind that people are made mostly of water. RFID reading is often unreliable for objects carried on a person, which are likely to be shielded from the reader by the water content of the body.

These problems are not just theoretical. WalMart is a major adopter of RFID inventory technology, and in early rollouts struggled with low successful read rates. Metal, moisture (including damp cardboard boxes), antenna orientation, and multipath/interference effects could cause read failure rates as high as 33% when scanning a pallet of goods. Low read rates are mostly addressed by using RFID "portals" with multiple antennas. Eight antennas used as an array greatly increase read rate, but at a cost of over ten thousand dollars per portal system. Even so, WalMart seems to now target a success rate of only 95% during bulk scanning.

95% might sound pretty good, but there are a lot of visibility applications where a failure rate of even a couple percent is unacceptable. These mostly go by the euphemism "high value goods," which depending on your career trajectory you may have encountered in corporate expense and property policies. High-value goods tend to be items that are both attractive to theft and where theft has particularly severe consequences. Classically, firearms and explosives. Throw in classified material for good measure.

I wonder if Stevens was surprised by RuBee's market trajectory. He came out of the healthcare industry and, it seems, originally developed RuBee for cold chain visibility... but, at least in retrospect, it's quite obvious that its most compelling application is in the armory.

Because RuBee tags are small and largely immune to shielding by metals, you can embed them directly in the frames of firearms, or as an aftermarket modification you can mill out some space under the grip. RuBee tags in weapons will read reliably when they are stored in metal cases or on metal shelving, as is often the case. They will even read reliably when a weapon is carried holstered, close to a person's body.

Since RuBee tags incorporate an active microcontroller, there are even more possibilities. Temperature logging is one thing, but firearm-embedded RuBee tags can incorporate an accelerometer (NIST-traceable, VAI likes to emphasize) and actually count the rounds fired.


Sidebar time: there is a long history of political hazard around "smart guns." The term "smart gun" is mostly used more specifically for firearms that identify their user, for example by fingerprint authentication or detection of an RFID fob. The idea has become vague enough, though, that mention of a firearm with any type of RFID technology embedded would probably raise the specter of the smart gun to gun-rights advocates.

Further, devices embedded in firearms that count the number of rounds fired have been proposed for decades, if not a century, as a means of accountability. The holder of a weapon could, in theory, be required to positively account for every round fired. That could eliminate incidents of unreported use of force by police, for example. In practice I think this is less compelling than it sounds, simple counting of rounds leaves too many opportunities to fudge the numbers and conceal real-world use of a weapon as range training, for example.

That said, the NRA has long been vehemently opposed to the incorporation of any sort of technology into weapons that could potentially be used as a means of state control or regulation. The concern isn't completely unfounded; the state of New Jersey did, for a time, have legislation that would have made user-identifying "smart guns" mandatory if they were commercially available. The result of the NRA's strident lobbying is that no such gun has ever become commercially available; "smart guns" have been such a political third rail that any firearms manufacturer that dared to introduce one would probably face a boycott by most gun stores. For better or worse, a result of the NRA's powerful political advocacy in this area is that the concept of embedding security or accountability technology into weapons has never been seriously pursued in the US. Even a tentative step in that direction can produce a huge volume of critical press for everyone involved.

I bring this up because I think it explains some of why VAI seems a bit vague and cagey about the round-counting capabilities of their tags. They position it as purely a maintenance feature, allowing the armorer to keep accurate tabs on the preventative maintenance schedule for each individual weapon (in armory environments, firearm users are often expected to report how many rounds they fired for maintenance tracking reasons). The resistance of RuBee tags to concealment is only positioned as a deterrent to theft, although the idea of RuBee-tagged firearms creates obvious potential for security screening. Probably the most profitable option for VAI would be to promote RuBee-tagged firearms as tool for enforcement of gun control laws, but this is a political impossibility and bringing it up at all could cause significant reputational harm, especially with the government as a key customer. The result is marketing copy that is a bit odd, giving a set of capabilities that imply an application that is never mentioned.


VAI found an incredible niche with their arms-tracking application. Institutional users of firearms, like the military, police, and security forces, are relatively price-insensitive and may have strict accounting requirements. By the mid-'00s, VAI was into the long sales cycle of proposing the technology to the military. That wasn't entirely unsuccessful. RuBee shot-counting weapon inventory tags were selected by the Naval Surface Warfare Center in 2010 for installation on SCAR and M4 rifles. That contract had a five-year term, it's unclear to me if it was renewed. Military contracting opened quite a few doors to VAI, though, and created a commercial opportunity that they eagerly pursued.

Perhaps most importantly, weapons applications required an impressive round of safety and compatibility testing. RuBee tags have the fairly unique distinction of military approval for direct attachment to ordnance, something called "zero separation distance" as the tags do not require a minimum separation from high explosives. Central to that certification are findings of intrinsic safety of the tags (that they do not contain enough energy to trigger explosives) and that the magnetic fields involved cannot convey enough energy to heat anything to dangerous temperatures.

That's not the only special certification that RuBee would acquire. The military has a lot of firearms, but military procurement is infamously slow and mercurial. Improved weapon accountability is, almost notoriously, not a priority for the US military which has often had stolen weapons go undetected until their later use in crime. The Navy's interest in RuBee does not seem to have translated to more widespread military applications.

Then you have police departments, probably the largest institutional owners of firearms and a very lucrative market for technology vendors. But here we run into the political hazard: the firearms lobby is very influential on police departments, as are police unions which generally oppose technical accountability measures. Besides, most police departments are fairly cash-poor and are not likely to make a major investment in a firearms inventory system.

That leaves us with institutional security forces. And there is one category of security force that are particularly well-funded, well-equipped, and beholden to highly R&D-driven, almost pedantic standards of performance: the protection forces of atomic energy facilities.

Protection forces at privately-operated atomic energy facilities, such as civilian nuclear power plants, are subject to licensing and scrutiny by the Nuclear Regulatory Commission. Things step up further at the many facilities operated by the National Nuclear Security Administration (NNSA). Protection forces for NNSA facilities are trained at the Department of Energy's National Training Center, at the former Manzano Base here in Albuquerque. Concern over adequate physical protection of NNSA facilities has lead Sandia National Laboratories to become one of the premier centers for R&D in physical security. Teams of scientists and engineers have applied sometimes comical scientific rigor to "guns, gates, and guards," the traditional articulation of physical security in the nuclear world.

That scope includes the evaluation of new technology for the management of protection forces, which is why Oak Ridge National Laboratory launched an evaluation program for the RuBee tagging of firearms in their armory. The white paper on this evaluation is curiously undated, but citations "retrieved 2008" lead me to assume that the evaluation happened right around the middle of the '00s. At the time, VAI seems to have been involved in some ultimately unsuccessful partnership with Oracle, leading to the branding of the RuBee system as Oracle Dot-Tag Server. The term "Dot-Tag" never occurs outside of very limited materials around the Oracle partnership, so I'm not sure if it was Oracle branding for RuBee or just some passing lark. In any case, Oracle's involvement seems to have mainly just been the use of the Oracle database for tracking inventory data---which was naturally replaced by PostgreSQL at Oak Ridge.

The Oak Ridge trial apparently went well enough, and around the same time, the Pantex Plant in Texas launched an evaluation of RuBee for tracking classified tools. Classified tools are a tricky category, as they're often metallic and often stored in metallic cases. During the trial period, Pantex tagged a set of sample classified tools with RuBee tags and then transported them around the property, testing the ability of the RuBee controllers to reliably detect them entering and exiting areas of buildings. Simultaneously, Pantex evaluated the use of RuBee tags to track containers of "chemical products" through the manufacturing lifecycle. Both seem to have produced positive results.

There are quite a few interesting and strange aspects of the RuBee system, a result of its purpose-built Visibility Network nature. A RuBee controller can have multiple antennas that it cycles through. RuBee tags remain in a deep-sleep mode for power savings until they detect a RuBee carrier during their periodic wake cycle. When a carrier is detected, they fully wake and listen for traffic. A RuBee controller can send an interrogate message and any number of tags can respond, with an interesting and novel collision detection algorithm used to ensure reliable reading of a large number of tags.

The actual RuBee protocol is quite simple, and can also be referred to as IEEE 1902.1 since the decision of VAI to put it through the standards process. Packets are small and contain basic addressing info, but they can also contain arbitrary payload in both directions, perfect for data loggers or sensors. RuBee tags are identified by something that VAI oddly refers to as an "IP address," causing some confusion over whether or not VAI uses IP over 1902.1. They don't, I am confident saying after reading a whole lot of documents. RuBee tags, as standard, have three different 4-byte addresses. VAI refers to these as "IP, subnet, and MAC," 2 but these names are more like analogies. Really, the "IP address" and "subnet" are both configurable arbitrary addresses, with the former intended for unicast traffic and the latter for broadcast. For example, you would likely give each asset a unique IP address, and use subnet addresses for categories or item types. The subnet address allows a controller to interrogate for every item within that category at once. The MAC address is a fixed, non-configurable address derived from the tag's serial number. They're all written in the formats we associate with IP networks, dotted-quad notation, as a matter of convenience.

And that's about it as far as the protocol specification, besides of course the physical details which are a 131,072 Hz carrier, 1024 Hz data clock, either ASK or BPSK modulation. The specification also describes an interesting mode called "clip," in which a set of multiple controllers interrogate in exact synchronization and all tags then reply in exact synchronization. Somewhat counter-intuitively, because of the ability of RuBee controllers to separate out multiple simultaneous tag transmissions using an anti-collision algorithm based on random phase shifts by each tag, this is ideal. It allows a room, say an armory, full of RuBee controllers to rapidly interrogate the entire contents of the room. I think this feature may have been added after the Oak Ridge trials...

RuBee is quite slow, typically 1,200 baud, so inventorying a large number of assets can take a while (Oak Ridge found that their system could only collect data on 2-7 tags per second per controller). But it's so robust that it an achieve a 100% read rate in some very challenging scenarios. Evaluation by the DoE and the military produced impressive results. You can read, for example, of a military experiment in which a RuBee antenna embedded in a roadway reliably identified rifles secured in steel containers in passing Humvees.

Paradoxically, then, one of the benefits of RuBee in the military/defense context is that it is also difficult to receive. Here is RuBee's most interesting trick: somewhat oversimplified, the strength of an electrical radio signal goes as 1/r, while the strength of a magnetic field goes as 1/r^3. RuBee equipment is optimized, by antenna design, to produce a minimal electrical field. The result is that RuBee tags can very reliably be contacted at short range (say, around ten feet), but are virtually impossible to contact or even detect at ranges over a few hundred feet. To the security-conscious buyer, this is a huge feature. RuBee tags are highly resistant to communications or electronic intelligence collection.

Consider the logical implications of tagging the military's rifles. With conventional RFID, range is limited by the size and sensitivity of the antenna. Particularly when tags are incidentally powered by a nearby reader, an adversary with good equipment can detect RFID tags at very long range. VAI heavily references a 2010 DEFCON presentation, for example, that demonstrated detection of RFID tags at a range of 80 miles. One imagines that opportunistic detection by satellite is feasible for a state intelligence agency. That means that your rifle asset tracking is also revealing the movements of soldiers in the field, or at least providing a way to detect their approach.

Most RuBee tags have their transmit power reduced by configuration, so even the maximum 100' range of the protocol is not achievable. VAI suggests that typical RuBee tags cannot be detected by radio direction finding equipment at ranges beyond 20', and that this range can be made shorter by further reducing transmit power.

Once again, we have caught the attention of the Department of Energy. Because of the short range of RuBee tags, they have generally been approved as not representing a COMSEC or TEMPEST hazard to secure facilities. And that brings us back to the very beginning: why does the DoE use a specialized, technically interesting, and largely unique radio protocol to fulfill such a basic function as nagging people that have their phones? Because RuBee's security properties have allowed it to be approved for use adjacent to and inside of secure facilities. A RuBee tag, it is thought, cannot be turned into a listening device because the intrinsic range limitation of magnetic coupling will make it impossible to communicate with the tag from outside of the building. It's a lot like how infrared microphones still see some use in secure facilities, but so much more interesting!

VAI has built several different product lines around RuBee, with names like Armory 20/20 and Shot Counting Allegro 20/20 and Store 20/20. The founder started his career in eye health, remember. None of them are that interesting, though. They're all pretty basic CRUD applications built around polling multiple RuBee controllers for tags in their presence.

And then there's the "Alert 20/20 DoorGuard:" a metal pedestal with a RuBee controller and audio announcement module, perfect for detecting government cell phones.


I put a lot of time into writing this, and I hope that you enjoy reading it. If you can spare a few dollars, consider supporting me on ko-fi. You'll receive an occasional extra, subscribers-only post, and defray the costs of providing artisanal, hand-built world wide web directly from Albuquerque, New Mexico.


One of the strangest things about RuBee is that it's hard to tell if it's still a going concern. VAI's website has a press release section, where nothing has been posted since 2019. The whole website feels like it was last revised even longer ago. When RuBee was newer, back in the '00s, a lot of industry journals covered it with headlines like "the new RFID." I think VAI was optimistic that RuBee could displace all kinds of asset tracking applications, but despite some special certifications in other fields (e.g. approval to use RuBee controllers and tags around pacemakers in surgical suites), I don't think RuBee has found much success outside of military applications.

RuBee's resistance to shielding is impressive, but RFID read rates have improved considerably with new DSP techniques, antenna array designs, and the generally reduced cost of modern RFID equipment. RuBee's unique advantages, its security properties and resistance to even intentional exfiltration, are interesting but not worth much money to buyers other than the military.

So that's the fate of RuBee and VAI: defense contracting. As far as I can tell, RuBee and VAI are about as vital as they have ever been, but RuBee is now installed as just one part of general defense contracts around weapons systems, armory management, and process safety and security. IEEE standardization has opened the door to use of RuBee by federal contractors under license, and indeed, Lockheed Martin is repeatedly named as a licensee, as are firearms manufacturers with military contracts like Sig Sauer.

Besides, RuBee continues to grow closer to the DoE. In 2021, VAI appointed Lisa Gordon-Hagerty to it board of directors. Gordon-Hagerty was undersecretary of Energy and had lead the NNSA until the year before. This year, the New Hampshire Small Business Development Center wrote a glowing profile of VAI. They described it as a 25-employee company with a goal of hitting $30 million in annual revenue in the next two years.

Despite the outdated website, VAI claims over 1,200 RuBee sites in service. I wonder how many of those are Alert 20/20 DoorGuards? Still, I do believe there are military weapons inventory systems currently in use. RuBee probably has a bright future, as a niche technology for a niche industry. If nothing else, they have legacy installations and intellectual property to lean on. A spreadsheet of VAI-owned patents on RuBee, with nearly 200 rows, encourages would-be magnetically coupled visibility network inventors not to go it on their own. I just wish I could get my hands on a controller....

  1. I have found some conflicting information on the date, it could have been as early as 2002. 2004 is the year I have the most confidence in.

  2. The documentation is confusing enough about these details that I am actually unclear on whether the RuBee "MAC address" is 4 bytes or 6. Examples show 6 byte addresses, but the actual 1902.1 specification only seems to allow 4 byte addresses in headers. Honestly all of the RuBee documentation is a mess like this. I suspect that part of the problem is that VAI has actually changed parts of the protocol and not all of their products are IEEE 1902.1 compliant.

Error'd: A Horse With No Name

5 December 2025 at 06:30

Scared Stanley stammered "I'm afraid of how to explain to the tax authority that I received $NaN."

1

 

Our anonymous friend Anon E. Mous wrote "I went to look up some employee benefits stuff up and ... This isn't a good sign."

0

 

Regular Michael R. is not actually operating under an alias, but this (allegedly scamming?) site doesn't know.

2

 

Graham F. gloated "I'm glad my child 's school have followed our naming convention for their form groups as well!"

3

 

Adam R. is taking his anonymous children on a roadtrip to look for America. "I'm planning a trip to St. Louis. While trying to buy tickets for the Gateway Arch, I noticed that their ticketing website apparently doesn't know how to define adults or children (or any of the other categories of tickets, for that matter)."

4

 

[Advertisement] Plan Your .NET 9 Migration with Confidence
Your journey to .NET 9 is more than just one decision.Avoid migration migraines with the advice in this free guide. Download Free Guide Now!

CodeSOD: Pawn Pawn in in Game Game of of Life Life

4 December 2025 at 06:30

It feels like ages ago, when document databases like Mongo were all the rage. That isn't to say that they haven't stuck around and don't deliver value, but gone is the faddish "RDBMSes are dead, bro." The "advantage" they offer is that they turn data management problems into serialization problems.

And that's where today's anonymous submission takes us. Our submitter has a long list of bugs around managing lists of usernames. These bugs largely exist because the contract developer who wrote the code didn't write anything, and instead "vibe coded too close to the sun", according to our submitter.

Here's the offending C# code:

   [JsonPropertyName("invitedTraders")]
   [BsonElement("invitedTraders")]
   [BsonIgnoreIfNull]
   public InvitedTradersV2? InvitedTraders { get; set; }

   [JsonPropertyName("invitedTradersV2")]
   [BsonElement("invitedTradersV2")]
   [BsonIgnoreIfNull]
   public List<string>? InvitedTradersV2 { get; set; }

Let's start with the type InvitedTradersV2. This type contains a list of strings which represent usernames. The field InvitedTradersV2 is a list of strings which represent usernames. Half of our submitter's bugs exist simply because these two lists get out of sync- they should contain the same data, but without someone enforcing that correctly, problems accrue.

This is made more frustrating by the MongoDB attribute, BsonIgnoreIfNull, which simply means that the serialized object won't contain the key if the value is null. But that means the consuming application doesn't know which key it should check.

For the final bonus fun, note the use of JsonPropertyName. This comes from the built-in class library, which tells .NET how to serialize the object to JSON. The problem here is that this application doesn't use the built-in serializer, and instead uses Newtonsoft.JSON, a popular third-party library for solving the problem. While Newtonsoft does recognize some built-in attributes for serialization, JsonPropertyName is not among them. This means that property does nothing in this example, aside from add some confusion to the code base.

I suspect the developer responsible, if they even read this code, decided that the duplicated data was okay, because isn't that just a normal consequence of denormalization? And document databases are all about denormalization. It makes your queries faster, bro. Just one more shard, bro.

[Advertisement] Keep all your packages and Docker containers in one place, scan for vulnerabilities, and control who can access different feeds. ProGet installs in minutes and has a powerful free version with a lot of great features that you can upgrade when ready.Learn more.

The Thanksgiving Shakedown

3 December 2025 at 06:30

On Thanksgiving Day, Ellis had cuddled up with her sleeping cat on the couch to send holiday greetings to friends. There in her inbox, lurking between several well wishes, was an email from an unrecognized sender with the subject line, Final Account Statement. Upon opening it, she read the following:

1880s stock delivery form agreement

Dear Ellis,

Your final account statement dated -1 has been sent to you. Please log into your portal and review your balance due totaling #TOTAL_CHARGES#.

Payment must be received within 30 days of this notice to avoid collection. You may submit payment online via [Payment Portal Link] or by mail to:

Chamberlin Apartments
123 Main Street
Anytown US 12345

If you believe there is an error on your account, please contact us immediately at 212-555-1212.

Thank you for your prompt attention to this matter.

Chamberlin Apartments

Ellis had indeed rented an apartment managed by this company, but had moved out 16 years earlier. She'd never been late with a payment for anything in her life. What a time to receive such a thing, at the start of a long holiday weekend when no one would be able to do anything about it for the next 4 days!

She truly had so much to be grateful for that Thanksgiving, and here was yet more for her list: her broad technical knowledge, her experience working in multiple IT domains, and her many years of writing up just these sorts of stories for The Daily WTF. All of this added up to her laughing instead of panicking. She could just imagine the poor intern who'd hit "Send" by mistake. She also imagined she wasn't the only person who'd received this message. Rightfully scared and angry callers would soon be hammering that phone number, and Ellis was further grateful that she wasn't the one who had to pick up.

"I'll wait for the apology email!" she said out loud with a knowing smile on her face, closing out the browser tab.

Ellis moved on physically and mentally, going forward with her planned Thanksgiving festivities without giving it another thought. The next morning, she checked her inbox with curious anticipation. Had there been a retraction, a please disregard?

No. Instead, there were still more emails from the same sender. The second, sent 7 hours after the first, bore the subject line Second Notice - Outstanding Final Balance:

Dear Ellis,

Our records show that your final balance of #TOTAL_CHARGES# from your residency at your previous residence remains unpaid.

This is your second notice. Please remit payment in full or contact us to discuss the balance to prevent your account from being sent to collections.

Failure to resolve the balance within the next 15 days may result in your account being referred to a third-party collections agency, which could impact your credit rating.

To make payment or discuss your account, please contact us at 212-555-1212 or accounting@chamapts.com.

Sincerely,

Chamberlin Apartments

The third, sent 6 and a half hours later, threatened Final Notice - Account Will Be Sent to Collections.

Dear Ellis,

Despite previous notices, your final account balance remains unpaid.

This email serves as final notice before your account is forwarded to a third-party collections agency for recovery. Once transferred, we will no longer be able to accept payment directly or discuss the account.

To prevent this, payment of #TOTAL_CHARGES# must be paid in full by #CRITICALDATE#.

Please submit payment immediately. Please contact 212-555-1212 to confirm your payment.

Sincerely,

Chamberlin Apartments

It was almost certainly a mistake, but still rather spooky to someone who'd never been in such a situation. There was solace in the thought that, if they really did try to force Ellis to pay #TOTAL_CHARGES# on the basis of these messages, anyone would find it absurd that all 3 notices were sent mere hours apart, on a holiday no less. The first two had also mentioned 30 and 15 days to pay up, respectively.

Suddenly remembering that she probably wasn't the only recipient of these obvious form emails, Ellis thought to check her local subreddit. Sure enough, there was already a post revealing the range of panic and bewilderment they had wrought among hundreds, if not thousands. Current and more recent former tenants had actually seen #TOTAL_CHARGES# populated with the correct amount of monthly rent. People feared everything from phishing attempts to security breaches.

It wasn't until later that afternoon that Ellis finally received the anticipated mea culpa:

We are reaching out to sincerely apologize for the incorrect collection emails you received. These messages were sent in error due to a system malfunction that released draft messages to our entire database.

Please be assured of the following:
The recent emails do not reflect your actual account status.
If your account does have an outstanding balance, that status has not changed, and you would have already received direct and accurate communication from our office.
Please disregard all three messages sent in error. They do not require any action from you.

We understand that receiving these messages, especially over a holiday, was upsetting and confusing, and we are truly sorry for the stress this caused. The issue has now been fully resolved, and our team has worked with our software provider to stop all queued messages and ensure this does not happen again.

If you have any questions or concerns, please feel free to email leasing@chamapts.com. Thank you for your patience and understanding.

All's well that ends well. Ellis thanked the software provider's "system malfunction," whoever or whatever it may've been, that had granted the rest of us a bit of holiday magic to take forward for all time.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

CodeSOD: The Destination Dir

2 December 2025 at 06:30

Darren is supporting a Delphi application in the current decade. Which is certainly a situation to be in. He writes:

I keep trying to get out of doing maintenance on legacy Delphi applications, but they keep pulling me back in.

The bit of code Darren sends us isn't the largest WTF, but it's a funny mistake, and it's a funny mistake that's been sitting in the codebase for decades at this point. And as we all know, jokes only get funnier with age.

FileName := DestDir + ExtractFileName(FileName);
if FileExists(DestDir + ExtractFileName(FileName)) then
begin
  ...
end;

This code is inside of a module that copies a file from a remote server to the local host. It starts by sanitizing the FileName, using ExtractFileName to strip off any path components, and replace them with DestDir, storing the result in the FileName variable.

And they liked doing that so much, they go ahead and do it again in the if statement, repeating the exact same process.

Darren writes:

As Homer Simpson said "Lather, rinse, and repeat. Always repeat."

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!

CodeSOD: Formula Length

1 December 2025 at 06:30

Remy's Law of Requirements Gathering states "No matter what the requirements document says, what your users really wanted was Excel." This has a corrolary: "Any sufficiently advanced Excel file is indistingushable from software."

Given enough time, any Excel file whipped up by any user can transition from "useful" to "mission critical software" before anyone notices. That's why Nemecsek was tasked with taking a pile of Excel spreadsheets and converting them into "real" software, which could be maintained and supported by software engineers.

Nemecsek writes:

This is just one of the formulas they asked me to work on, and not the longest one.

Nemecsek says this is a "formula", but I suspect it's a VBA macro. In reality, it doesn't matter.

InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).
InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).Losses = 
calcLossesInPart(InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).RatedFrequency, InitechNeoDTMachineDevice.
InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).
InitechNeoDTActivePartPart(iPart).RadialPositionToMainDuct, InitechNeoDTMachineDevice.
InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).
InitechNeoDTActivePartPart(iPart).InitechNeoDTActivePartPartSectionContainer(0).
InitechNeoDTActivePartPartSection(0).InitechNeoDTActivePartPartConductorComposition(0).IsTransposed, 
InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).
InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).ParallelRadialCount, InitechNeoDTMachineDevice.
InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).
InitechNeoDTActivePartPart(iPart).InitechNeoDTActivePartPartSectionContainer(0).
InitechNeoDTActivePartPartSection(0).InitechNeoDTActivePartPartConductorComposition(0).
ParallelAxialCount, InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).InitechNeoDTActivePartPartConductor(0).Type, 
InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).
InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).InitechNeoDTActivePartPartConductor(0).
DimensionRadialElectric, InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).InitechNeoDTActivePartPartConductor(0).
DimensionAxialElectric + InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).InitechNeoDTActivePartPartConductor(0).InsulThickness, 
getElectricConductivityAtTemperatureT1(InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).InitechNeoDTActivePartPartConductor(0).
InitechNeoDTActivePartPartConductorRawMaterial(0).ElectricConductivityT0, InitechNeoDTMachineDevice.
InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).
InitechNeoDTActivePartPart(iPart).InitechNeoDTActivePartPartSectionContainer(0).
InitechNeoDTActivePartPartSection(0).InitechNeoDTActivePartPartConductorComposition(0).
InitechNeoDTActivePartPartConductor(0).InitechNeoDTActivePartPartConductorRawMaterial(0).MaterialFactor, 
InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).
InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).InitechNeoDTActivePartPartConductor(0).
InitechNeoDTActivePartPartConductorRawMaterial(0).ReferenceTemperatureT0, InitechNeoDTMachineDevice.
ReferenceTemperature), LayerNumberRatedVoltage, InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartLayerContainer(0),InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).RFactor)

Line breaks added to try and keep horizontal scrolling sane. This arguably hurts readability, in the same way that beating a dead horse arguably hurts the horse.

This may not be the longest one, but it's certainly painful. I do not know exactly what this is doing, and frankly, I do not want to.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

Error'd: On the Dark Side

28 November 2025 at 06:30

...matter of fact, it's all dark.

Gitter Hubber checks in on the holidays: "This is the spirit of the Black Friday on GitHub. That's because I'm using dark mode. Otherwise, it would have a different name… You know what? Let's just call it Error Friday!"

1

 

"Best get typing!" self-admonishes. Jason G. Suffering a surfeit of snark, he proposes "Not sure my battery will last long enough.
Finally, quantum resistant security.
I can't remember my number after the 5000th digit. " Any of those will do just fine.

2

 

Don't count Calle L. out. "This is for a calorie tracking app, on Thanksgiving. Offer was so delicious it wasn't even a number any more! Sadly it did not slim the price down more than expected."

0

 

"Snow and rain and rain and snow!" exclaims Paul N. "Weather so astounding, they just had to trigger three separate notifications at the same time."

3

 

It's not a holiday for everyone though, is it? Certainly not for Michael R. , who is back with a customer service complaint about custom deliveries. "I am unlucky with my deliveries. This time it's DPD. "

4

 

[Advertisement] Plan Your .NET 9 Migration with Confidence
Your journey to .NET 9 is more than just one decision.Avoid migration migraines with the advice in this free guide. Download Free Guide Now!

Classic WTF: Teleported Release

27 November 2025 at 06:30
It's a holiday in the US today, one where we give thanks. And today, we give thanks to not have this boss. Original. --Remy

Matt works at an accounting firm, as a data engineer. He makes reports for people who don’t read said reports. Accounting firms specialize in different areas of accountancy, and Matt’s firm is a general firm with mid-size clients.

The CEO of the firm is a legacy from the last century. The most advanced technology on his desk is a business calculator and a pencil sharpener. He still doesn’t use a cellphone. But he does have a son, who is “tech savvy”, which gives the CEO a horrible idea of how things work.

Usually, this is pretty light, in that it’s sorting Excel files or sorting the output of an existing report. Sometimes the requests are bizarre or utter nonsense. And, because the boss doesn’t know what the technical folks are doing, some of the IT staff may be a bit lazy about following best practices.

This means that most of Matt’s morning is spent doing what is essentially Tier 1 support before he gets into doing his real job. Recently, there was a worse crunch, as actual support person Lucinda was out for materinity leave, and Jackie, the one other developer, was off on vacation on a foreign island with no Internet. Matt was in the middle of eating a delicious lunch of take-out lo mein when his phone rang. He sighed when he saw the number.

“Matt!” the CEO exclaimed. “Matt! We need to do a build of the flagship app! And a deploy!”

The app was rather large, and a build could take upwards of 45 minutes, depending on the day and how the IT gods were feeling. But the process was automated, the latest changes all got built and deployed each night. Anything approved was released within 24 hours. With everyone out of the office, there hadn’t been any approved changes for a few weeks.

Matt checked the Github to see if something went wrong with the automated build. Everything was fine.

“Okay, so I’m seeing that everything built on GitHub and everything is available in production,” Matt said.

“I want you to do a manual build, like you used to.”

“If I were to compile right now, it could take quite awhile, and redeploying runs the risk of taking our clients offline, and nothing would be any different.”

“Yes, but I want a build that has the changes which Jackie was working on before she left for vacation.”

Matt checked the commit history, and sure enough, Jackie hadn’t committed any changes since two weeks before leaving on vacation. “It doesn’t looked like she pushed those changes to Github.”

“Githoob? I thought everything was automated. You told me the process was automated,” the CEO said.

“It’s kind of like…” Matt paused to think of an analogy that could explain this to a golden retriever. “Your dishwasher, you could put a timer on it to run it every night, but if you don’t load the dishwasher first, nothing gets cleaned.”

There was a long pause as the CEO failed to understand this. “I want Jackie’s front-page changes to be in the demo I’m about to do. This is for Initech, and there’s millions of dollars riding on their account.”

“Well,” Matt said, “Jackie hasn’t pushed- hasn’t loaded her metaphorical dishes into the dishwasher, so I can’t really build them.”

“I don’t understand, it’s on her computer. I thought these computers were on the cloud. Why am I spending all this money on clouds?”

“If Jackie doesn’t put it on the cloud, it’s not there. It’s uh… like a fax machine, and she hasn’t sent us the fax.”

“Can’t you get it off her laptop?”

“I think she took it home with her,” Matt said.

“So?”

“Have you ever seen Star Trek? Unless Scotty can teleport us to Jackie’s laptop, we can’t get at her files.”

The CEO locked up on that metaphor. “Can’t you just hack into it? I thought the NSA could do that.”

“No-” Matt paused. Maybe Matt could try and recreate the changes quickly? “How long before this meeting?” he asked.

“Twenty minutes.”

“Just to be clear, you want me to do a local build with files I don’t have by hacking them from a computer which may or may not be on and connected to the Internet, and then complete a build process which usually takes 45 minutes- at least- deploy to production, so you can do a demo in twenty minutes?”

“Why is that so difficult?” the CEO demanded.

“I can call Jackie, and if she answers, maybe we can figure something out.”

The CEO sighed. “Fine.”

Matt called Jackie. She didn’t answer. Matt left a voicemail and then went back to eating his now-cold lo mein.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

Announcements: We Want Your Holiday Horrors

26 November 2025 at 10:00

As we enter into the latter portion of the year, folks are traveling to visit family, logging off of work in hopes that everything can look after itself for a month, and somewhere, someone, is going to make the choice "yes, I can push to prod on Christmas Eve, and it'll totally work out for me!"

Over the next few weeks, I'm hoping to get a chance to get some holiday support horrors up on the site, in keeping with the season. Whether it's the absurd challenges of providing family tech support, the last minute pushes to production, the five alarm fires caused by a pointy-haired-bosses's incompetence, we want your tales of holiday IT woe.

So hit that submit button on the side bar, and tell us who's on Santa's naughty list this year.

[Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.

Tales from the Interview: Interview Smack-Talk

26 November 2025 at 06:30

In today's Tales from the Interview, our Anonymous submitter relates their experience with an anonymous company:

I had made it through the onsite, but along the way I had picked up some toxic work environment red flags. Since I had been laid off a couple months prior, I figured I wasn't in a position to be picky, so I decided I would still give it my best shot and take the job if I got it, but I'd continue looking for something better.

Then they brought me back onsite a second time for one final interview with 2 senior managers. I went in and they were each holding a printout of my resume. They proceeded to go through everything on it. First they asked why I chose the university I went to, then the same for grad school, which was fine.

WWF SmackDown Logo (1999-2001)

Then they got to my first internship. I believe the conversation went something like this:

Manager: "How did you like it?"

Me: "Oh, I loved it!"

Manager: "Were there any negatives?"

Me: "No, not that I can think of."

Manager: "So it was 100% positive?"

Me: "Yep!"

And then they got to my first full-time job, where the same manager repeated the same line of questioning but pushed even harder for me to say something negative, at one point saying "Well, you left for (2nd company on my resume), so there must have been something negative."

I knew better than to bad-mouth a previous employer in an interview, it's like going into a first date and talking smack about your ex. But what do you do when your date relentlessly asks you to talk smack about all your exes and refuses to let the subject turn to anything else? This not only confirmed my suspicions of a toxic work environment, I also figured *they* probably knew it was toxic and were relentlessly testing every candidate to make sure they wouldn't blow the whistle on them.

That was the most excruciatingly awkward interview I've ever had. I didn't get the job, but at that point I didn't care anymore, because I was very, very sure I didn't want to work there in the long term.

I'm glad Subby dodged that bullet, and I hope they're in a better place now.

It seems like this might be some stupid new trend. I recently bombed an interview where I could tell I wasn't giving the person the answer on their checklist, no matter how many times I tried. It was a question about how I handled it when someone opposed what I was doing at work or gave me negative feedback. It felt like they wanted me to admit to more fur-flying drama and fireworks than had ever actually occurred.

I actively ask for and welcome critique on my writing, it makes my work so much better. And if my work is incorrect and needs to be redone, or someone has objections to a project I'm part of, I seek clarification and (A) implement the requested changes, (B) explain why things are as they are and offer alternate suggestions/solutions, (C) seek compromise, depending on the situation. I don't get personal about it.

So, why this trend? Subby believed it was a way to test whether the candidate would someday badmouth the employer. That's certainly feasible, though if that were the goal, you'd think Subby would've passed their ordeal with flying colors. I'm not sure myself, but I have a sneaking suspicion that the nefarious combination of AI and techbro startup culture have something to do with it.

So perhaps I also dodged a bullet: one of the many things I'm grateful for this Thanksgiving.

Feel free to share your ideas, and any and all bullets you have dodged, in the comments.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

CodeSOD: The Map to Your Confession

25 November 2025 at 06:30

Today, Reginald approaches us for a confession.

He writes:

I've no idea where I "copied" this code from five years ago. The purpose of this code was to filter out Maps and Collections Maybe the intention was to avoid a recursive implementation by an endless loop? I am shocked that I wrote such code.

Well, that doesn't bode well, Reginald. Let's take a look at this Java snippet:

/**
 * 
 * @param input
 * @return
 */
protected Map rearrangeMap(Map input) {
	Map retMap = new HashMap();

	if (input != null && !input.isEmpty()) {

		Iterator it = input.keySet().iterator();
		while (true) {
			String key;
			Object obj;
			do {
				do {
					if (!it.hasNext()) {
					}
					key = (String) it.next();

				} while (input.get(key) instanceof Map);

				obj = input.get(key);

			} while (obj instanceof Boolean && ((Boolean) obj).equals(Boolean.FALSE));

			if (obj != null) {
				retMap.put(key, obj);
				return retMap;
			}
		}
	} else {
		return retMap;
	}
}

The first thing that leaps out is that this is a non-generic Map, which is always a code smell, but I suspect that's the least of our problems.

We start by verifying that the input Map exists and contains data. If the input is null or empty, we return it. In our main branch, we create an iterator across the keys, before ethering a while(true) loop. So far so bad

Then we enter a pair of nested do loops. Which definitely hints that we've gone off the edge of the map here. In the inner most loop, we do a check- if there isn't a next element in the iterator, we… do absolutely nothing. Whether there is or isn't an element, we advance to the next element, risking a NoSuchElementException. We do this while the key points to an instance of Map. As always, an instanceof check is a nauseating code stench.

Okay, so the inner loop skips across any keys that point to maps, and throws an exception when it gets to the end of the list.

The surrounding loop skips over every key that is a boolean value that is also false.

If we find anything which isn't a Map and isn't a false Boolean and isn't null, we put it in our retMap and return it.

This function finds the first key that points to a non-map, non-false value and creates a new map that contains only that key/value. Which it's a hard thing to understand why I'd want that, especially since some Map implementations make no guarantee about order. And even if I did want that, I definitely wouldn't want to do that this way. A single for loop could have solved this problem.

Reginald, I don't think there's any absolution for this. Instead, my advice would be to install a carbon monoxide detector in your office, because I have some serious concerns about whether or not your brain is getting enough oxygen.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

CodeSOD: Copied Homework

24 November 2025 at 06:30

Part of the "fun" of JavaScript is dealing with code which comes from before sensible features existed. For example, if you wanted to clone an object in JavaScript, circa 2013, that was a wheel you needed to invent for yourself, as this StackOverflow thread highlights.

There are now better options, and you'd think that people would use them. However, the only thing more "fun" than dealing with code that hasn't caught up with the times is dealing with developers who haven't, and still insist on writing their own versions of standard methods.

  const objectReplace = (oldObject, newObject) => {
    let keys = Object.keys(newObject)
    try {
      for (let key of keys) {
        oldObject[key] = newObject[key]
      }
    } catch (err) {
      console.log(err, oldObject)
    }     

    return oldObject
  }

It's worth noting that Object.entries returns an array containing both the keys and values, which would be a more sensible for this operation, but then again, if we're talking about using correct functions, Object.assign would replace this function.

There's no need to handle errors here, as nothing about this assignment should throw an exception.

The thing that really irks me about this though is that it pretends to be functional (in the programming idiom sense) by returning the newly modified value, but it's also just changing that value in place because it's a reference. So it has side effects, in a technical sense (changing the value of its input parameters) while pretending not to. Now, I probably shouldn't get too hung up on that, because that's also exactly how Object.assign behaves, but dammit, I'm going to be bothered by it anyway. If you're going to reinvent the wheel, either make one that's substantially worse, or fix the problems with the existing wheel.

In any case, the real WTF here is that this function is buried deep in a 15,000 line file, written by an offshore contract team, and there are at least 5 other versions of this function, all with slightly different names, but all basically doing the same thing, because everyone on the team is just copy/pasting until they get enough code to submit a pull request.

Our submitter wonders, "Is there a way to train an AI to not let people type this?"

No, there isn't. You can try rolling that boulder up a hill, but it'll always roll right back down. Always and forever, people are going to write bad code.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

Error'd: Untimely

21 November 2025 at 06:30

Sometimes, it's hard to know just when you are. This morning, I woke up to a Macbook that thinks it's in Paris, four hours ago. Pining for pain chocolate. A bevy of anonyms have had similar difficulties.

First up, an unarabian anonym observes "They say that visiting Oman feels like traveling back in time to before the rapid modernization of the Arab states. I just think their eVisa application system is taking this "time travel" thing a bit too far... "

0

 

Snecod, an unretired (anteretired?) anonym finds it hard to plan when the calendar is unfixed. "The company's retirement plan was having a rough time prior to Second June." Looks like the first wtf was second March.

2

 

And an unamerican anonym sent us this (uh, back in first March) "Was looking to change the cable package I have from them. Apparently my discounts are all good until 9th October 1930, and a second one looking good until 9th January 2024."

3

 

On a different theme, researcher Jennifer E. exclaimed "Those must have been BIG divorces! Guy was so baller Wikipedia couldn’t figure out when he divorced either of these women." Or so awful they divorced him continuously.

4

 

Finally, parsimonious Greg L. saved this for us. "I don't remember much about #Error!, but I guess it was an interesting day."

1

 

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

CodeSOD: Invalid Route and Invalid Route

20 November 2025 at 06:30

Someone wanted to make sure that invalid routes logged an error in their Go web application. Artem found this when looking at production code.

if (requestUriPath != "/config:system") &&
    (requestUriPath != "/config:system/ntp") &&
    (requestUriPath != "/config:system/ntp/servers") &&
    (requestUriPath != "/config:system/ntp/servers/server") &&
    (requestUriPath != "/config:system/ntp/servers/server/config") &&
    (requestUriPath != "/config:system/ntp/servers/server/config/address") &&
    (requestUriPath != "/config:system/ntp/servers/server/config/key-id") &&
    (requestUriPath != "/config:system/ntp/servers/server/config/minpoll") &&
    (requestUriPath != "/config:system/ntp/servers/server/config/maxpoll") &&
    (requestUriPath != "/config:system/ntp/servers/server/config/version") &&
    (requestUriPath != "/config:system/ntp/servers/server/state") &&
    (requestUriPath != "/config:system/ntp/servers/server/state/address") &&
    (requestUriPath != "/config:system/ntp/servers/server/state/key-id") &&
    (requestUriPath != "/config:system/ntp/servers/server/state/minpoll") &&
    (requestUriPath != "/config:system/ntp/servers/server/state/maxpoll") &&
    (requestUriPath != "/config:system/ntp/servers/server/state/version") {
    log.Info("ProcessGetNtpServer: no return of ntp server state for ", requestUriPath)
    return nil
}

The most disturbing part of this, for Artem, isn't that someone wrote this code and pushed it to production. It's that, according to git blame, two people wrote this code, because the first developer didn't include all the cases.

For the record, the application does have an actual router module, which can trigger logging on invalid routes.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

CodeSOD: Are You Mocking Me?

19 November 2025 at 06:30

Today's representative line comes from Capybara James (most recently previously). It's representative, not just of the code base, but of Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Or, "you get what you measure".

If, for example, you decide that code coverage metrics are how you're going to judge developers, then your developers are going to ensure that the code coverage looks great. If you measure code coverage, then you will get code coverage- and nothing else.

That's how you get tests like this:

Mockito.verify(exportRequest, VerificationModeFactory.atLeast(0)).failedRequest(any(), any(), any());

This test passes if the function exportRequest.failedRequest is called at least zero times, with any input parameters.

Which, as you might imagine, is a somewhat useless thing to test. But what's important is that there is a test. The standards for code coverage are met, the metric is satisfied, and Goodhart marks up another win on the board.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

Using an ADE: Ancient Development Environment

18 November 2025 at 06:30

One of the things that makes legacy code legacy is that code, over time, rots. Some of that rot comes from the gradual accumulation of fixes, hacks, and kruft. But much of the rot also comes from the tooling going unsupported or entirely out of support.

For example, many years ago, I worked in a Visual Basic 6 shop. The VB6 IDE went out of support in April, 2008, but we continued to use it well into the next decade. This made it challenging to support the existing software, as the IDE frequently broke in response to OS updates. Even when we started running it inside of a VM running an antique version of Windows 2000, we kept running into endless issues getting projects to compile and build.

A fun side effect of that: the VB6 runtime remains supported. So you can run VB6 software on modern Windows. You just can't modify that software.

Greta has inherited an even more antique tech stack. She writes, "I often wonder if I'm the last person on Earth encumbered with this particular stack." She adds, "The IDE is long-deprecated from a vendor that no longer exists- since 2002." Given the project started in the mid 2010s, it may have been a bad choice to use that tech-stack.

It's not as bad as it sounds- while the technology and tooling is crumbling ruins, the team culture is healthy and the C-suite has given Greta wide leeway to solve problems. But that doesn't mean that the tooling isn't a cause of anguish, and even worse than the tooling- the code itself.

"Some things," Greta writes, "are 'typical bad'" and some things "are 'delightfully unique' bad."

For example, the IDE has a concept of "designer" files, for the UI, and "code behind" files, for the logic powering the UI. The IDE frequently corrupts its own internal state, and loses the ability to properly update the designer files. When this happens, if you attempt to open, save, or close a designer file, the IDE pops up a modal dialog box complaining about the corruption, with a "Yes" and "No" option. If you click "No", the modal box goes away- and then reappears because you're seeing this message because you're on a broken designer file. If you click "Yes", the IDE "helpfully" deletes pretty much everything in your designer file.

Nothing about the error message indicates that this might happen.

The language used is a dialect of C++. I say "dialect" because the vendor-supplied compiler implements some cursed feature set between C++98 and C++11 standards, but doesn't fully conform to either. It's only capable of outputting 32-bit x86 code up to a Pentium Pro. Using certain C++ classes, like std::fstream, causes the resulting executable to throw a memory protection fault on exit.

Worse, the vendor supplied class library is C++ wrappers on top of an even more antique Pascal library. The "class" library is less an object-oriented wrapper and more a collection of macros and weird syntax hacks. No source for the Pascal library exists, so forget about ever updating that.

Because the last release of the IDE was circa 2002, running it on any vaguely modern environment is prone to failures, but it also doesn't play nicely inside of a VM. At this point, the IDE works for one session. If you exit it, reboot your computer, or try to close and re-open the project, it breaks. The only fix is to reinstall it. But the reinstall requires you to know which set of magic options actually lets the install proceed. If you make a mistake and accidentally install, say, CORBA support, attempting to open the project in the IDE leads to a cascade of modal error boxes, including one that simply says, "ABSTRACT ERROR" ("My favourite", writes Greta). And these errors don't limit themselves to the IDE; attempting to run the compiler directly also fails.

But, if anything, it's the code that makes the whole thing really challenging to work with. While the UI is made up of many forms, the "main" form is 18,000 lines of code, with absolutely no separation of concerns. Actually, the individual forms don't have a lot of separation of concerns; data is shared between forms via global variables declared in one master file, and then externed into other places. Even better, the various sub-forms are never destroyed, just hidden and shown, which means they remember their state whether you want that or not. And since much of the state is global, you have to be cautious about which parts of the state you reset.

Greta adds:

There are two files called main.cpp, a Station.cpp, and a Station1.cpp. If you were to guess which one owns the software's entry point, you would probably be wrong.

But, as stated, it's not all as bad as it sounds. Greta writes: "I'm genuinely happy to be here, which is perhaps odd given how terrible the software is." It's honestly not that odd; a good culture can go a long way to making wrangling a difficult tech stack happy work.

Finally, Greta has this to say:

We are actively working on a .NET replacement. A nostalgic, perhaps masochistic part of me will miss the old stack and its daily delights.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

Meta Plans Deep Cuts to Metaverse Efforts

By: Nick Heer
5 December 2025 at 23:52

Kurt Wagner, Bloomberg:

Meta Platforms Inc.’s Mark Zuckerberg is expected to meaningfully cut resources for building the so-called metaverse, an effort that he once framed as the future of the company and the reason for changing its name from Facebook Inc.

Executives are considering potential budget cuts as high as 30% for the metaverse group next year, which includes the virtual worlds product Meta Horizon Worlds and its Quest virtual reality unit, according to people familiar with the talks, who asked not to be named while discussing private company plans. Cuts that high would most likely include layoffs as early as January, according to the people, though a final decision has not yet been made.

Wagner’s reporting was independently confirmed by Mike Isaac, of the New York Times, and Meghan Bobrowsky and Georgia Wells, of the Wall Street Journal, albeit in slightly different ways. While Wagner wrote it “would most likely include layoffs as early as January”, Isaac apparently confirmed the budget cuts are likely large-scale personnel cuts, which makes sense:

The cuts could come as soon as next month and amount to 10 to 30 percent of employees in the Metaverse unit, which works on virtual reality headsets and a V.R.-based social network, the people said. The numbers of potential layoffs are still in flux, they said. Other parts of the Reality Labs division develop smart glasses, wristbands and other wearable devices. The total number of employees in Reality Labs could not be learned.

Alan Dye is just about to join Reality Labs. I wonder if this news comes as a fun surprise for him.

At Meta Connect a few months ago, the company spent basically the entire time on augmented reality glasses, but it swore up and down it was all related to its metaverse initiatives:

We’re hard at work advancing the state of the art in augmented and virtual reality, too, and where those technologies meet AI — that’s where you’ll find the metaverse.

The metaverse is whatever Meta needs it to be in order to justify its 2021 rebrand.

Our vision for the future is a world where anyone anywhere can imagine a character, a scene, or an entire world and create it from scratch. There’s still a lot of work to do, but we’re making progress. In fact, we’re not far off from being able to create compelling 3D content as easily as you can ask Meta AI a question today. And that stands to transform not just the imagery and videos we see on platforms like Instagram and Facebook, but also the possibilities of VR and AR, too.

You know, whenever I am unwinding and chatting with friends after a long day at work, I always get this sudden urge to create compelling 3D content.

⌥ Permalink

Lisa Jackson and Kate Adams Out at Apple, Jennifer Newstead to Join

By: Nick Heer
4 December 2025 at 22:53

Apple:

Apple today announced that Jennifer Newstead will become Apple’s general counsel on March 1, 2026, following a transition of duties from Kate Adams, who has served as Apple’s general counsel since 2017. She will join Apple as senior vice president in January, reporting to CEO Tim Cook and serving on Apple’s executive team.

In addition, Lisa Jackson, vice president for Environment, Policy, and Social Initiatives, will retire in late January 2026. The Government Affairs organization will transition to Adams, who will oversee the team until her retirement late next year, after which it will be led by Newstead. Newstead’s title will become senior vice president, General Counsel and Government Affairs, reflecting the combining of the two organizations. The Environment and Social Initiatives teams will report to Apple chief operating officer Sabih Khan.

What will tomorrow bring, I wonder?

Newstead has spent the past year working closely with Joel Kaplan, and fighting the FTC’s case against Meta — successfully, I should add. Before that, she was a Trump appointee at the U.S. State Department. Well positioned, then, to fight Apple’s U.S. antitrust lawsuit against a second-term Trump government that has successfully solicited Apple’s money.

John Voorhees, MacStories:

Although Apple doesn’t say so in its press release, it’s pretty clear that a few things are playing out among its executive ranks. First, a large number of them are approaching retirement age, and Apple is transitioning and changing roles internally to account for those who are retiring. Second, the company is dealing with departures like Alan Dye’s and what appears to be the less-than-voluntary retirement of John Giannandrea. Finally, the company is reducing the number of Tim Cook’s direct reports, which is undoubtedly to simplify the transition to a new CEO in the relatively near future.

A careful reader will notice Apple’s newsroom page currently has press releases for these departures and, from earlier this week, John Giannandrea’s, but there is nothing about Alan Dye’s. In fact, even in the statement quoted by Bloomberg, Dye is not mentioned. In fairness, Adams, Giannandrea, and Jackson all have bios on Apple’s leadership page. Dye’s was removed between 2017 and 2018.

Starting to think Mark Gurman might be wrong about that FT report.

⌥ Permalink

Waymo Data Indicates Dramatic Safety Improvements Over Human Drivers, So It Is Making Its Cars More Human

By: Nick Heer
4 December 2025 at 05:44

Jonathan Slotkin, a surgeon and venture capital investor, wrote for the New York Times about data released by Waymo indicating impressive safety improvements over human drivers through June 2025:

If Waymo’s results are indicative of the broader future of autonomous vehicles, we may be on the path to eliminating traffic deaths as a leading cause of mortality in the United States. While many see this as a tech story, I view it as a public health breakthrough.

[…]

There’s a public health imperative to quickly expand the adoption of autonomous vehicles. […]

We should be skeptical of all self-reported stats, but these figures look downright impressive.

Slotkin responsibly notes several caveats, though neglects to mention the specific cities in which Waymo operates: Austin, Los Angeles, Phoenix, and San Francisco. These are warm cities with relatively low annual precipitation, almost none of which is ever snow. Slotkin’s enthusiasm for widespread adoption should be tempered somewhat by this narrow range of climate data. Still, its data is compelling. These cars seem to crash less often than those driven by people in the same cities and, in particular, avoid causing serious injuries at an impressive rate.

It is therefore baffling to me that Waymo appears to be treating this as a cushion for experimentation.

Katherine Bindley, in a Wall Street Journal article published the very same day as Slotkin’s Times piece:

The training wheels are off. Like the rule-following nice guy who’s tired of being taken advantage of, Waymos are putting their own needs first. They’re bending traffic laws, getting impatient with pedestrians and embracing the idea that when it comes to city driving, politeness doesn’t pay: It’s every car for itself.

[…]

Waymo has been trying to make its cars “confidently assertive,” says Chris Ludwick, a senior director of product management with Waymo, which is owned by Google parent Alphabet. “That was really necessary for us to actually scale this up in San Francisco, especially because of how busy it gets.”

A couple years ago, Tesla’s erroneously named “Full Self-Driving” feature began cruising through crosswalks if it judged it could pass a crossing pedestrian in time, and I wrote:

Advocates of autonomous vehicles often say increased safety is one of its biggest advantages over human drivers. Compliance with the law may not be the most accurate proxy for what constitutes safe driving, but not to a disqualifying extent. Right now, it is the best framework we have, and autonomous vehicles should follow the law. That should not be a controversial statement.

I stand by that. A likely reason for Waymo’s impressive data is that its cars behave with caution and deference. Substituting that with “confidently assertive” driving is a move in entirely the wrong direction. It should not roll through stop signs, even if its systems understand nobody is around. It should not mess up the order of an all-way stop intersection. I have problems with the way traffic laws are written, but it is not up to one company in California to develop a proprietary interpretation. Just follow the law.

Slotkin:

This is not a call to replace every vehicle tomorrow. For one thing, self-driving technology is still expensive. Each car’s equipment costs $100,000 beyond the base price, and Waymo doesn’t yet sell cars for personal use. Even once that changes, many Americans love driving; some will resist any change that seems to alter that freedom.

[…]

There is likely to be some initial public trepidation. We do not need everyone to use self-driving cars to realize profound safety gains, however. If 30 percent of cars were fully automated, it might prevent 40 percent of crashes, as autonomous vehicles both avoid causing crashes and respond better when human drivers err. Insurance markets will accelerate this transition, as premiums start to favor autonomous vehicles.

Slotkin is entirely correct in writing that “Americans love driving” — the U.S. National Household Travel Survey, last conducted in 2022, found 90.5% of commuters said they primarily used a car of some kind (table 7-2, page 50). 4.1% said they used public transit, 2.9% said they walked, and just 2.5% said they chose another mode of transportation in which taxicabs are grouped along with bikes and motorcycles. Those figures are about the same in 2017, though with an unfortunate decline in the number of transit commuters. Commuting is not the only reason for travelling, of course, but this suggests to me that even if every taxicab ride was in an autonomous Waymo, there would still be a massive gap to achieve that 30% adoption rate Slotkin wants. And, if insurance companies begin incentivizing autonomous vehicles, it really means rich people will reap the reward of being able to buy a new car.

Any argument about road safety has to be more comprehensive than what Slotkin is presenting in this article. Regardless of how impressive Waymo’s stats are, it is a vision of the future that is an individualized solution to a systemic problem. I have no specialized knowledge in this area, but I am fascinated by it. I read about this stuff obsessively. The things I want to see are things everyone can benefit from: improvements to street design that encourage drivers to travel at lower speeds, wider sidewalks making walking more comfortable, and generous wheeling infrastructure for bicycles, wheelchairs, and scooters. We can encourage the adoption of technological solutions, too; if this data holds up, it would seem welcome. But we can do so much better for everyone, and on a more predictable timeline.

This is, as Slotkin writes, a public health matter. Where I live, record numbers of people are dying, in part because more people than ever are driving bigger and heavier vehicles with taller fronts while they are distracted. Many of those vehicles will still be on the road in twenty years’ time, even if we accelerate the adoption pace of more autonomous vehicles. We do not need to wait for a headline-friendly technological upgrade. There are boring things cities can start doing tomorrow that would save lives.

⌥ Permalink

Alan Dye Out at Apple

By: Nick Heer
3 December 2025 at 19:58

Mark Gurman, Bloomberg:

Meta Platforms Inc. has poached Apple Inc.’s most prominent design executive in a major coup that underscores a push by the social networking giant into AI-equipped consumer devices.

The company is hiring Alan Dye, who has served as the head of Apple’s user interface design team since 2015, according to people with knowledge of the matter. Apple is replacing Dye with longtime designer Stephen Lemay, according to the people, who asked not to be identified because the personnel changes haven’t been announced.

Big week for changes in Apple leadership.

I am sure more will trickle out about this, but one thing notable to me is that Lemay has been a software designer for over 25 years at Apple. Dye, on the other hand, came from marketing and print design. I do not want to put too much weight on that — someone can be a sufficiently talented multidisciplinary designer — but I am curious to see what Lemay might do in a more senior role.

Admittedly I also have some (perhaps morbid) curiosity about what Dye will do at Meta.

One more note from Gurman’s report:

Dye had taken on a more significant role at Apple after Ive left, helping define how the company’s latest operating systems, apps and devices look and feel. The executive informed Apple this week that he’d decided to leave, though top management had already been bracing for his departure, the people said. Dye will join Meta as chief design officer on Dec. 31.

Let me get this straight: Dye personally launches an overhaul of Apple’s entire visual interface language, then leaves. Is that a good sign for its reception, either internally or externally?

⌥ Permalink

Microsoft Lowers A.I. Software Growth Targets

By: Nick Heer
3 December 2025 at 19:01

Benj Edwards, Ars Technica:

Microsoft has lowered sales growth targets for its AI agent products after many salespeople missed their quotas in the fiscal year ending in June, according to a report Wednesday from The Information. The adjustment is reportedly unusual for Microsoft, and it comes after the company missed a number of ambitious sales goals for its AI offerings.

Based on Edwards’ summary — I still have no interest in paying for the Information — it sounds like this mostly affects sales of A.I. “agents”, a riskier technology proposition for businesses. This sounds to me like more concrete evidence of a plateau in corporate interest than the surveys reported on by the Economist.

⌥ Permalink

‘Mad Men’ on HBO Max, in 4K, Somehow Lacking VFX

By: Nick Heer
3 December 2025 at 01:11

Todd Vaziri:

As far as I can tell, Paul Haine was the first to notice something weird going on with HBO Max’ presentation. In one of season one’s most memorable moments, Roger Sterling barfs in front of clients after climbing many flights of stairs. As a surprise to Paul, you can clearly see the pretend puke hose (that is ultimately strapped to the back side of John Slattery’s face) in the background, along with two techs who are modulating the flow. Yeah, you’re not supposed to see that.

It appears as though this represents the original photography, unaltered before digital visual effects got involved. Somehow, this episode (along with many others) do not include all the digital visual effects that were in the original broadcasts and home video releases. It’s a bizarro mistake for Lionsgate and HBO Max to make and not discover until after the show was streaming to customers.

Eric Vilas-Boas, Vulture:

How did this happen? Apparently, this wasn’t actually HBO Max’s fault — the streamer received incorrect files from Lionsgate Television, a source familiar with the exchange tells Vulture. Lionsgate is now in the process of getting HBO Max the correct files, and the episodes will be updated as soon as possible.

It just feels clumsy and silly for Lionsgate to supply the wrong files in the first place, and for nobody at HBO to verify they are the correct work. An amateur mistake, frankly, for an ostensibly premium service costing U.S. $11–$23 per month. If I were king for a day, it would be illegal to sell or stream a remastered version of something — a show, an album, whatever — without the original being available alongside it.

⌥ Permalink

John Giannandrea Out at Apple

By: Nick Heer
2 December 2025 at 20:16

Apple:

Apple today announced John Giannandrea, Apple’s senior vice president for Machine Learning and AI Strategy, is stepping down from his position and will serve as an advisor to the company before retiring in the spring of 2026. Apple also announced that renowned AI researcher Amar Subramanya has joined Apple as vice president of AI, reporting to Craig Federighi. Subramanya will be leading critical areas, including Apple Foundation Models, ML research, and AI Safety and Evaluation. The balance of Giannandrea’s organization will shift to Sabih Khan and Eddy Cue to align closer with similar organizations.

When Apple hired Giannandrea from Google in 2018, the New York Times called it a “major coup”, given that Siri was “less effective than its counterparts at Google and Amazon”. The world changed a lot in the past six-and-a-half years, though: Siri is now also worse than a bunch of A.I. products. Of course, Giannandrea’s role at Apple was not limited to Siri. He spent time on the Project Titan autonomous car, which was cancelled early last year, before moving to generative A.I. projects. The first results of that effort were shown at WWDC last year; the most impressive features have yet to ship.

I feel embarrassed and dumb for hoping Giannandrea would help shake the company out of its bizarre Siri stupor. Alas, he is now on the Graceful Executive Exit Express, where he gets to spend a few more months at Apple in a kind of transitional capacity — you know the drill. Maybe Subramanya will help move the needle. Maybe this ex-Googler will make it so. Maybe I, Charlie Brown, will get to kick that football.

⌥ Permalink

⌥ A Questionable A.I. Plateau

By: Nick Heer
2 December 2025 at 05:33

The Economist:

On November 20th American statisticians released the results of a survey. Buried in the data is a trend with implications for trillions of dollars of spending. Researchers at the Census Bureau ask firms if they have used artificial intelligence “in producing goods and services” in the past two weeks. Recently, we estimate, the employment-weighted share of Americans using AI at work has fallen by a percentage point, and now sits at 11% (see chart 1). Adoption has fallen sharply at the largest businesses, those employing over 250 people. Three years into the generative-AI wave, demand for the technology looks surprisingly flimsy.

[…]

Even unofficial surveys point to stagnating corporate adoption. Jon Hartley of Stanford University and colleagues found that in September 37% of Americans used generative AI at work, down from 46% in June. A tracker by Alex Bick of the Federal Reserve Bank of St Louis and colleagues revealed that, in August 2024, 12.1% of working-age adults used generative AI every day at work. A year later 12.6% did. Ramp, a fintech firm, finds that in early 2025 AI use soared at American firms to 40%, before levelling off. The growth in adoption really does seem to be slowing.

I am skeptical of the metrics used by the Economist to produce this summary, in part because they are all over the place, and also because they are mostly surveys. I am not sure people always know they are using a generative A.I. product, especially when those features are increasingly just part of the modern office software stack.

While the Economist has an unfortunate allergy to linking to its sources, I wanted to track them down because a fuller context is sometimes more revealing. I believe the U.S. Census data is the Business Trends and Outlook Survey though I am not certain because its charts are just plain, non-interactive images. In any case, it is the Economist’s own estimate of falling — not stalling — adoption by workers, not an estimate produced by the Census Bureau, which is curious given two of its other sources indicate more of a plateau instead of a decline.

The Hartley, et al. survey is available here and contains some fascinating results other than the specific figures highlighted by the Economist — in particular, that the construction industry has the fourth-highest adoption of generative A.I., that Gemini is shown in Figure 9 as more popular than ChatGPT even though the text on page 7 indicates the opposite, and that the word “Microsoft” does not appear once in the entire document. I have some admittedly uninformed and amateur questions about its validity. At any rate, this is the only source the Economist cites which indicates a decline.

The data point attributed to the tracker operated by the Federal Reserve Bank of St. Louis is curious. The Economist notes “in August 2024, 12.1% of working-age adults used generative A.I. every day at work. A year later 12.6% did”, but I am looking at the dashboard right now, and it says the share using generative A.I. daily at work is 13.8%, not 12.6%. In the same time period, the share of people using it “at least once last week” jumped from 36.1% to 46.9%. I have no idea where that 12.6% number came from.

Finally, Ramp’s data is easy enough to find. Again, I have to wonder about the Economist’s selective presentation. If you switch the chart from an overall view to a sector-based view, you can see adoption of paid subscriptions has more than doubled in many industries compared to October last year. This is true even in “accommodation and food services”, where I have to imagine use cases are few and far between.

After finding the actual source of the Economist’s data, it has left me skeptical of the premise of this article. However, plateauing interest — at least for now — makes sense to me on a gut level. There is a ceiling to work one can entrust to interns or entry-level employees, and that is approximately similar for many of today’s A.I. tools. There are also sector-level limits. Consider Ramp’s data showing high adoption in the tech and finance industries, with considerably less in sectors like healthcare and food services. (Curiously, Ramp says only 29% of the U.S. construction industry has a subscription to generative A.I. products, while Hartley, et al. says over 40% of the construction industry is using it.)

I commend any attempt to figure out how useful generative A.I. is in the real world. One of the problems with this industry right now is that its biggest purveyors are not public companies and, therefore, have fewer disclosure requirements. Like any company, they are incentivized to inflate their importance, but we have little understanding of how much they are exaggerating. If you want to hear some corporate gibberish, OpenAI interviewed executives at companies like Philips and Scania about their use of ChatGPT, but I do not know what I gleaned from either interview — something about experimentation and vague stuff about people being excited to use it, I suppose. It is not very compelling to me. I am not in the C-suite, though.

The biggest public A.I. firm is arguably Microsoft. It has rolled out Copilot to Windows and Office users around the world. Again, however, its press releases leave much to be desired. Levi Strauss employees, Microsoft says, “report the devices and operating system have led to significant improvements in speed, reliability and data handling, with features like the Copilot key helping reduce the time employees spend searching and free up more time for creating”. Sure. In another case study, Microsoft and Pantone brag about the integration of a colour palette generator that you can use with words instead of your eyes.

Microsoft has every incentive to pretend Copilot is a revolutionary technology. For people actually doing the work, however, its ever-nagging presence might be one of many nuisances getting in the way of the job that person actually knows how to do. A few months ago, the company replaced the familiar Office portal with a Copilot prompt box. It is still little more than a thing I need to bypass to get to my work.

All the stats and apparent enthusiasm about A.I. in the workplace are, as far as I can tell, a giant mess. A problem with this technology is that the ways in which it is revolutionary are often not very useful, its practical application in a work context is a mixed bag that depends on industry and role, and its hype encourages otherwise respectable organizations to suggest their proximity to its promised future.

The Economist being what it is, much of this article revolves around the insufficiently realized efficiency and productivity gains, and that is certainly something for business-minded people to think about. But there are more fundamental issues with generative A.I. to struggle with. It is a technology built on a shaky foundation. It shrinks the already-scant field of entry-level jobs. Its results are unpredictable and can validate harm. The list goes on, yet it is being loudly inserted into our SaaS-dominated world as a top-down mandate.

It turns out A.I. is not magic dust you can sprinkle on a workforce to double their productivity. CEOs might be thrilled by having all their email summarized, but the rest of us do not need that. We need things like better balance of work and real life, good benefits, and adequate compensation. Those are things a team leader cannot buy with a $25-per-month-per-seat ChatGPT business license.

An App Named Alan

By: Nick Heer
2 December 2025 at 04:20

Tyler Hall:

Maybe it’s because my eyes are getting old or maybe it’s because the contrast between windows on macOS keeps getting worse. Either way, I built a tiny Mac app last night that draws a border around the active window. I named it “Alan”.

A good, cheeky name. The results are not what I would call beautiful, but that is not the point, is it? It works well. I wish it did not feel understandable for there to be an app that draws a big border around the currently active window. That should be something made sufficiently obvious by the system.

Unfortunately, this is a problem plaguing the latest versions of MacOS and Windows alike, which is baffling to me. The bar for what constitutes acceptable user interface design seems to have fallen low enough that it is tripping everyone at the two major desktop operating system vendors.

⌥ Permalink

Threads Continues to Reward Rage Bait

By: Nick Heer
2 December 2025 at 01:07

Hank Green was not getting a lot of traction on a promotional post on Threads about a sale on his store. He got just over thirty likes, which does not sound awful, until you learn that was over the span of seven hours and across Green’s following of 806,000 accounts on Threads.

So he tried replying to rage bait with basically the same post, and that was far more successful. But, also, it has some pretty crappy implications:

That’s the signal that Threads is taking from this: Threads is like oh, there’s a discussion going on.

It’s 2025! Meta knows that “lots of discussion” is not a surrogate for “good things happening”!

I assume the home feed ranking systems are similar for Threads and Instagram — though they might not be — and I cannot tell you how many times my feed is packed with posts from many days to a week prior. So many businesses I frequent use it as a promotional tool for time-bound things I learn about only afterward. The same thing is true of Stories, since they are sorted based on how frequently you interact with an account.

Everyone is allowed one conspiracy theory, right? Mine is that a primary reason Meta is hostile to reverse-chronological feeds is because it requires businesses to buy advertising. I have no proof to support this, but it seems entirely plausible.

⌥ Permalink

⌥ Moraine Luck

By: Nick Heer
1 December 2025 at 04:11

You have seen Moraine Lake. Maybe it was on a postcard or in a travel brochure, or it was on Reddit, or in Windows Vista, or as part of a “Best of California” demo on Apple’s website. Perhaps you were doing laundry in Lucerne. But I am sure you have seen it somewhere.

Moraine Lake is not in California — or Switzerland, for that matter. It is right here in Alberta, between Banff and Lake Louise, and I have been lucky enough to visit many times. One time I was particularly lucky, in a way I only knew in hindsight. I am not sure the confluence of events occurring in October 2019 is likely to be repeated for me.

In 2019, the road up to the lake would be open to the public from May until about mid-October, though the closing day would depend on when it was safe to travel. This is one reason why so many pictures of it have only the faintest hint of snow capping the mountains behind — it is only really accessible in summer.

I am not sure why we decided to head up to Lake Louise and Moraine Lake that Saturday. Perhaps it was just an excuse to get out of the house. It was just a few days before the road was shut for the season.

We visited Lake Louise first and it was, you know, just fine. Then we headed to Moraine.

I posted a higher-quality version of this on my Glass profile.
A photo of Moraine Lake, Alberta, frozen with chunks of ice and rocks on its surface.

Walking from the car to the lakeshore, we could see its surface was that familiar blue-turquoise, but it was entirely frozen. I took a few images from the shore. Then we realized we could just walk on it, as did the handful of other people who were there. This is one of several photos I took from the surface of the lake, the glassy ice reflecting that famous mountain range in the background.

I am not sure I would be able to capture a similar image today. Banff and Lake Louise have received more visitors than ever in recent years, to the extent private vehicles are no longer allowed to travel up to Moraine Lake. A shuttle bus is now required. The lake also does not reliably freeze at an accessible time and, when it does, it can be covered in snow or the water line may have receded. I am not arguing this is an impossible image to create going forward. I just do not think I am likely to see it this way again.

I am very glad I remembered to bring my camera.

OpenAI’s House Counsel to Be Deposed Over Deleted Pirated Material

By: Nick Heer
29 November 2025 at 19:27

Winston Cho, the Hollywood Reporter:

To rewind, authors and publishers have gained access to Slack messages between OpenAI’s employees discussing the erasure of the datasets, named “books 1 and books 2.” But the court held off on whether plaintiffs should get other communications that the company argued were protected by attorney-client privilege.

In a controversial decision that was appealed by OpenAI on Wednesday, U.S. District Judge Ona Wang found that OpenAI must hand over documents revealing the company’s motivations for deleting the datasets. OpenAI’s in-house legal team will be deposed.

Wang’s decision (PDF), to the extent I can read it as a layperson, examines OpenAI’s shifting story about why it erased the books 1 and books2 data sets — apparently, the only time possible training materials were deleted.

I am not sure it has yet been proven OpenAI trained its models on pirated books. Anthropic settled a similar suit in September, and Meta and Apple are facing similar accusations. For practical purposes, however, it is trivial to suggest it did use pirated data in general: if you have access to its Sora app, enter any prompt followed by the word “camrip”.

What is a camrip?, a strictly law-abiding person might ask. It is a label added to a movie pirated in the old-fashioned way: by pointing a video camera at the screen in a theatre. As a result, these videos have a distinctive look and sound which is reproduced perfectly by Sora. It is very difficult for me to see a way in which OpenAI could have trained this model to understand what a camrip is without feeding it a bunch of them, and I do not know of a legitimate source for such videos.

⌥ Permalink

Internet Archive Wayback Machine Link Fixer

By: Nick Heer
28 November 2025 at 05:05

The Internet Archive released a WordPress plugin not too long ago:

Internet Archive Wayback Machine Link Fixer is a WordPress plugin designed to combat link rot—the gradual decay of web links as pages are moved, changed, or taken down. It automatically scans your post content — on save and across existing posts — to detect outbound links. For each one, it checks the Internet Archive’s Wayback Machine for an archived version and creates a snapshot if one isn’t available.

Via Michael Tsai:

The part where it replaces broken links with archive links is implemented in JavaScript. I like that it doesn’t modify the post content in your database. It seems safe to install the plug-in without worrying about it messing anything up. However, I had kind of hoped that it would fix the links as part of the PHP rendering process. Doing it in JavaScript means that the fixed links are not available in the actual HTML tags on the page. And the data that the JavaScript uses is stored in an invisible <div> under the attribute data-iawmlf-post-links, which makes the page fail validation.

I love the idea of this plugin, but I do not love this implementation. I think I understand why it works this way: for the nondestructive property mentioned by Tsai, and also to account for its dependence on a third-party service of varying reliability. I would love to see a demo of this plugin in action.

⌥ Permalink

Investigating a Possible Scammer in Journalism’s A.I. Era

By: Nick Heer
27 November 2025 at 05:14

Nicholas Hune-Brown, the Local:

Every media era gets the fabulists it deserves. If Stephen Glass, Jayson Blair and the other late 20th century fakers were looking for the prestige and power that came with journalism in that moment, then this generation’s internet scammers are scavenging in the wreckage of a degraded media environment. They’re taking advantage of an ecosystem uniquely susceptible to fraud—where publications with prestigious names publish rickety journalism under their brands, where fact-checkers have been axed and editors are overworked, where technology has made falsifying pitches and entire articles trivially easy, and where decades of devaluing journalism as simply more “content” have blurred the lines so much it can be difficult to remember where they were to begin with.

This is likely not the first story you have read about a freelancer managing to land bylines in prestigious publications thanks to dependency on A.I. tools, but it is one told very well.

⌥ Permalink

Web Development Tip: Disable Pointer Events on Link Images

By: Nick Heer
27 November 2025 at 04:45

Good tip from Jeff Johnson:

My business website has a number of “Download on the App Store” links for my App Store apps. Here’s an example of what that looks like:

[…]

The problem is that Live Text, “Select text in images to copy or take action,” is enabled by default on iOS devices (Settings → General → Language & Region), which can interfere with the contextual menu in Safari. Pressing down on the above link may select the text inside the image instead of selecting the link URL.

I love the Live Text feature, but it often conflicts with graphics like these. There is a good, simple, two-line CSS trick for web developers that should cover most situations. Also, if you rock a user stylesheet — and I think you should — it seems to work fine as a universal solution. Any issues I have found have been minor and not worth noting. I say give it a shot.

Update: Adding Johnson’s CSS to a user stylesheet mucks up the layout of Techmeme a little bit. You can exclude it by adding div:not(.ii) > before a:has(> img) { display: inline-block; }.

⌥ Permalink

‘The iPad’s Software Problem Is Permanent’

By: Nick Heer
26 November 2025 at 05:05

Quinn Nelson:

[…] at a moment when the Mac has roared back to the centre of Apple’s universe, the iPad feels closer than ever to fulfilling its original promise. Except it doesn’t, not really, because while the iPad has gained windowing and external display support, pro apps, all the trappings of a “real computer”, underneath it all, iPadOS is still a fundamentally mobile operating system with mobile constraints baked into its very DNA.

Meanwhile, the Mac is rumoured to be getting everything the iPad does best: touchscreens, OLED displays, thinner designs.

There are things I quibble with in Nelson’s video, including the above-quoted comparison to mere rumours about the Mac. The rest of the video is more compelling as it presents comparisons with the same or similar software on each platform in real-world head-to-head matches.

Via Federico Viticci, MacStories:

I’m so happy that Apple seems to be taking iPadOS more seriously than ever this year. But now I can’t help but wonder if the iPad’s problems run deeper than windowing when it comes to getting serious work done on it.

Apple’s post-iPhone platforms are only as good as Apple will allow them to be. I am not saying it needs to be possible to swap out Bluetooth drivers or monkey around with low-level code, but without more flexibility, platforms like the iPad and Vision Pro are destined to progress only at the rate Apple says is acceptable, and with the third-party apps it says are permissible. These are apparently the operating systems for the future of computers. They are not required to have similar limitations to the iPhone, but they do anyway. Those restrictions are holding back the potential of these platforms.

⌥ Permalink

Polarization in the United States Has Become the World’s Side Hustle

By: Nick Heer
25 November 2025 at 04:59

Marina Dunbar, the Guardian:

Many of the most influential personalities in the “Make America great again” (Maga) movement on X are based outside of the US, including Russia, Nigeria and India, a new transparency feature on the social media site has revealed.

The new tool, called “about this account”, became available on Friday to users of the Elon Musk-owned platform. It allows anyone to see where an account is located, when it joined the platform, how often its username has been changed, and how the X app was downloaded.

This is a similar approach to adding labels or notes to tweets containing misinformation in that it is adding more speech and context. It is more automatic, but the function and intent is comparable, which means Musk’s hobbyist P.R. team must be all worked up. But I checked, and none seem particularly bothered. Maybe they actually care about trust and safety now, or maybe they are lying hacks.

Mike Masnick, Techdirt:

For years, Matt Taibbi, Michael Shellenberger, and their allies have insisted that anyone working on these [trust and safety] problems was part of a “censorship industrial complex” designed to silence political speech. Politicians like Ted Cruz and Jim Jordan repeated these lies. They treated trust & safety work as a threat to democracy itself.

Then Musk rolled out one basic feature, and within hours proved exactly why trust & safety work existed in the first place.

Jason Koebler, 404 Media, has been covering the monetization of social media:

This has created an ecosystem of side hustlers trying to gain access to these programs and YouTube and Instagram creators teaching people how to gain access to them. It is possible to find these guide videos easily if you search for things like “monetized X account” on YouTube. Translating that phrase and searching in other languages (such as Hindi, Portuguese, Vietnamese, etc) will bring up guides in those languages. Within seconds, I was able to find a handful of YouTubers explaining in Hindi how to create monetized X accounts; other videos on the creators’ pages explain how to fill these accounts with AI-generated content. These guides also exist in English, and it is increasingly popular to sell guides to make “AI influencers,” and AI newsletters, Reels accounts, and TikTok accounts regardless of the country that you’re from.

[…]

Americans are being targeted because advertisers pay higher ad rates to reach American internet users, who are among the wealthiest in the world. In turn, social media companies pay more money if the people engaging with the content are American. This has created a system where it makes financial sense for people from the entire world to specifically target Americans with highly engaging, divisive content. It pays more.

The U.S. market is a larger audience, too. But those of us in rich countries outside the U.S. should not get too comfortable; I found plenty of guides similar to the ones shown by Koebler for targeting Australia, Canada, Germany, New Zealand, and more. Worrisome — especially if you, say, are somewhere with an electorate trying to drive the place you live off a cliff.

Update: Several X accounts purporting to be Albertans supporting separatism appear to be from outside Canada, including a “Concerned 🍁 Mum”, “Samantha”, “Canada the Illusion”, and this “Albertan” all from the United States, and a smaller account from Laos. I tried to check more, but X’s fragile servers are aggressively rate-limited.

I do not think people from outside a country are forbidden from offering an opinion on what is happening within it. I would be a pretty staggering hypocrite if I thought that. Nor do I think we should automatically assume people who are stoking hostile politics on social media are necessarily external or bots. It is more like a reflection of who we are now, and how easily that can be exploited.

⌥ Permalink

Meta’s Accounting of Its Louisiana Data Centre ‘Strains Credibility’

By: Nick Heer
25 November 2025 at 03:54

Jonathan Weil, Wall Street Journal:

It seems like a marvel of financial engineering: Meta Platforms is building a $27 billion data center in Louisiana, financed with debt, and neither the data center nor the debt will be on its own balance sheet.

That outcome looks too good to be true, and it probably is.

The phrase “marvel of financial engineering” does not seem like a compliment. In addition to the evidence from Weil’s article, Meta is taking advantage of a tax exemption created by Louisiana’s state legislature. But, in its argument, it is merely a user of this data centre.

Also, colour me skeptical this data centre will truly be “the size of Manhattan” before the bubble bursts, despite the disruption to life in the area.

Update: Paris Martineau points to Weil’s bio noting he was “the first reporter to challenge Enron’s accounting practices”.

⌥ Permalink

A.I. Mania Looks and Feels Bigger Than the .Com Bubble

By: Nick Heer
25 November 2025 at 03:41

Fred Vogelstein, Crazy Stupid Tech — which, again, is a compliment:

We’re not only in a bubble but one that is arguably the biggest technology mania any of us have ever witnessed. We’re even back reinventing time. Back in 1999 we talked about internet time, where every year in the new economy was like a dog year – equivalent to seven years in the old.

Now VCs, investors and executives are talking about AI dog years – let’s just call them mouse years – which is internet time divided by five? Or is it by 11? Or 12? Sure, things move way faster than they did a generation ago. But by that math one year today now equals 35 years in 1995. Really?

A sobering piece that, unfortunately, is somewhat undercut since it lacks a single mention of layoffs, jobs, employment, or any other indication that this bubble will wreck the lives of people far outside its immediate orbit. In fairness, few of the related articles linked at the bottom mention that, either. Articles in Stratechery, the Brookings Institute, and the New York Times want you to think a bubble is just a sign of building something new and wonderful. A Bloomberg newsletter mentions layoffs only in the context of changing odds in predictions markets — I chuckled — while M.G. Siegler notes all the people who are being laid off while new A.I. hires get multimillion-dollar employment packages. Maybe all the pain and suffering that is likely to result from the implosion of this massive sector is too obvious to mention for the MBA and finance types. I think it is worth stating, though, not least because it acknowledges other people are worth caring about at least as much as innovation and growth and all that stuff.

⌥ Permalink

Our mixed assortment of DNS server software (as of December 2025)

By: cks
7 December 2025 at 04:12

Without deliberately planning it, we've wound up running an assortment of DNS server software on an assortment of DNS servers. A lot of this involves history, so I might as well tell the story of that history in the process. This starts with our three sets of DNS servers: our internal DNS master (with a duplicate) that holds both the internal and external views of our zones, our resolving DNS servers (which use our internal zones), and our public authoritative DNS server (carrying our external zones, along with various relics of the past). These days we also have an additional resolving DNS server that resolves from outside our networks and so gives the people who can use it an external view of our zones.

In the beginning we ran Bind on everything, as was the custom in those days (and I suspect we started out without a separation between the three types of DNS servers, but that predates my time here), and I believe all of the DNS servers were Solaris. Eventually we moved the resolving DNS servers and the public authoritative DNS server to OpenBSD (and the internal DNS master to Ubuntu), still using Bind. Then OpenBSD switched which nameservers they liked from Bind to Unbound and NSD, so we went along with that. Our authoritative DNS server had a relatively easy NSD configuration, but our resolving DNS servers presented some challenges and we wound up with a complex Unbound plus NSD setup. Recently we switched our internal resolvers to using Bind on Ubuntu, and then we switched our public authoritative DNS server from OpenBSD to Ubuntu but kept it still with NSD, since we already had a working NSD configuration for it.

This has wound up with us running the following setups:

  • Our internal DNS masters run Bind in a somewhat complex split horizon configuration.

  • Our internal DNS resolvers run Bind in a simpler configuration where they act as internal authoritative secondary DNS servers for our own zones and as general resolvers.

  • Our public authoritative DNS server (and its hot spare) run NSD as an authoritative secondary, doing zone transfers from our internal DNS masters.

  • We have an external DNS resolver machine that runs Unbound in an extremely simple configuration. We opted to build this machine with Unbound because we didn't need it to act as anything other than a pure resolver, and Unbound is simple to set up for that.

At one level, this is splitting our knowledge and resources among three DNS servers rather than focusing on one. At another level, two out of the three DNS servers are being used in quite simple setups (and we already had the NSD setup written from prior use). Our only complex configurations are all Bind based, and we've explicitly picked Bind for complex setups because we feel we understand it fairly well from long experience with it.

(Specifically, I can configure a simple Unbound resolver faster and easier than I can do the same with Bind. I'm sure there's a simple resolver-only Bind configuration, it's just that I've never built one and I have built several simple and not so simple Unbound setups.)

Getting out of being people's secondary authoritative DNS server is hard

By: cks
6 December 2025 at 03:28

Many, many years ago, my department operated one of the university's secondary authoritative DNS servers, which was used by most everyone with a university subdomain and as a result was listed as one of their DNS NS records. This DNs server was also the authoritative DNS server for our own domains, because this was in the era where servers were expensive and it made perfect sense to do this. At the time, departments who wanted a subdomain pretty much needed to have a Unix system administrator and probably run their own primary DNS server and so on. Over time, the university's DNS infrastructure shifted drastically, with central IT offering more and more support, and more than half a decade ago our authoritative DNS server stopped being a university secondary, after a lot of notice to everyone.

Experienced system administrators can guess what happened next. Or rather, what didn't happen next. References to our DNS server lingered in various places for years, both in the university's root zones as DNS glue records and in people's own DNS zone files as theoretically authoritative records. As late as the middle of last year, when I started grinding away on this, I believe that roughly half of our authoritative DNS server's traffic was for old zones we didn't serve and was getting DNS 'Refused' responses. The situation is much better today, after several rounds of finding other people's zones that were still pointing to us, but it's still not quite over and it took a bunch of tedious work to get this far.

(Why I care about this is that it's hard to see if your authoritative DNS server is correctly answering everything it should if things like tcpdumps of DNS traffic are absolutely flooded with bad traffic that your DNS server is (correctly) rejecting.)

In theory, what we should have done when we stopped being a university secondary authoritative DNS server was to switch the authoritative DNS server for our own domains to another name and another IP address; this would have completely cut off everyone else when we turned the old server off and removed its name from our DNS. In practice the transition was not clearcut, because for a while we kept on being a secondary for some other university zones that have long-standing associations with the department. Also, I think we were optimistic about how responsive people would be (and how many of them we could reach).

(Also, there's a great deal of history tied up in the specific name and IP address of our current authoritative DNS server. It's been there for a very long time.)

PS: Even when no one is incorrectly pointing to us, there's clearly a background Internet radiation of external machines throwing random DNS queries at us. But that's another entry.

In Linux, filesystems can and do have things with inode number zero

By: cks
5 December 2025 at 04:19

A while back I wrote about how in POSIX you could theoretically use inode (number) zero. Not all Unixes consider inode zero to be valid; prominently, OpenBSD's getdents(2) doesn't return valid entries with an inode number of 0, and by extension, OpenBSD's filesystems won't have anything that uses inode zero. However, Linux is a different beast.

Recently, I saw a Go commit message with the interesting description of:

os: allow direntries to have zero inodes on Linux

Some Linux filesystems have been known to return valid entries with zero inodes. This new behavior also puts Go in agreement with recent glibc.

This fixes issue #76428, and the issue has a simple reproduction to create something with inode numbers of zero. According to the bug report:

[...] On a Linux system with libfuse 3.17.1 or later, you can do this easily with GVFS:

# Create many dir entries
(cd big && printf '%04x ' {0..1023} | xargs mkdir -p)
gio mount sftp://localhost/$PWD/big

The resulting filesystem mount is in /run/user/$UID/gvfs (see the issue for the exact long path) and can be experimentally verified to have entries with inode numbers of zero (well, as reported by reading the directory). On systems using glibc 2.37 and later, you can look at this directory with 'ls' and see the zero inode numbers.

(Interested parties can try their favorite non-C or non-glibc bindings to see if those environments correctly handle this case.)

That this requires glibc 2.37 is due to this glibc bug, first opened in 2010 (but rejected at the time for reasons you can read in the glibc bug) and then resurfaced in 2016 and eventually fixed in 2022 (and then again in 2024 for the thread safe version of readdir). The 2016 glibc issue has a bit of a discussion about the kernel side. As covered in the Go issue, libfuse returning a zero inode number may be a bug itself, but there are (many) versions of libfuse out in the wild that actually do this today.

Of course, libfuse (and gvfs) may not be the only Linux filesystems and filesystem environments that can create this effect. I believe there are alternate language bindings and APIs for the kernel FUSE (also, also) support, so they might have the same bug as libfuse does.

(Both Go and Rust have at least one native binding to the kernel FUSE driver. I haven't looked at either to see what they do about inode numbers.)

PS: My understanding of the Linux (kernel) situation is that if you have something inside the kernel that needs an inode number and you ask the kernel to give you one (through get_next_ino(), an internal function for this), the kernel will carefully avoid giving you inode number 0. A lot of things get inode numbers this way, so this makes life easier for everyone. However, a filesystem can decide on inode numbers itself, and when it does it can use inode number 0 (either explicitly or by zeroing out the d_ino field in the getdents(2) dirent structs that it returns, which I believe is what's happening in the libfuse situation).

Some things on X11's obscure DirectColor visual type

By: cks
4 December 2025 at 03:21

The X Window System has a long standing concept called 'visuals'; to simplify, an X visual determines how to determine the colors of your pixels. As I wrote about a number of years ago, these days X11 mostly uses 'TrueColor' visuals, which directly supply 8-bit values for red, green, and blue ('24-bit color'). However X11 has a number of visual types, such as the straightforward PseudoColor indirect colormap (where every pixel value is an index into an RGB colormap; typically you'd get 8-bit pixels and 24-bit colormaps, so you could have 256 colors out of a full 24-bit gamut). One of the (now) obscure visual types is DirectColor. To quote:

For DirectColor, a pixel value is decomposed into separate RGB subfields, and each subfield separately indexes the colormap for the corresponding value. The RGB values can be changed dynamically.

(This is specific to X11; X10 had a different display color model.)

In a PseudoColor visual, each pixel's value is taken as a whole and used as an index into a colormap that gives the RGB values for that entry. In DirectColor, the pixel value is split apart into three values, one each for red, green, and blue, and each value indexes a separate colormap for that color component. Compared to a PseudoColor visual of the same pixel depth (size, eg each pixel is an 8-bit byte), you get less possible variety within a single color component and (I believe) no more colors in total.

When this came up in my old entry about TrueColor and PseudoColor visuals, in a comment Aristotle Pagaltzis speculated:

[...] maybe it can be implemented as three LUTs in front of a DAC’s inputs or something where the performance impact is minimal? (I’m not a hardware person.) [...]

I was recently reminded of this old entry and when I reread that comment, an obvious realization struck me about why DirectColor might make hardware sense. Back in the days of analog video, essentially every serious sort of video connection between your computer and your display carried the red, green, and blue components separately; you can see this in the VGA connector pinouts, and on old Unix workstations these might literally be separate wires connected to separate BNC connectors on your CRT display.

If you're sending the red, green, and blue signals separately you might also be generating them separately, with one DAC per color channel. If you have separate DACs, it might be easier to feed them from separate LUTs and separate pixel data, especially back in the days when much of a Unix workstation's graphics system was implemented in relatively basic, non-custom chips and components. You can split off the bits from the raw pixel value with basic hardware and then route each color channel to its own LUT, DAC, and associated circuits (although presumably you need to drive them with a common clock).

The other way to look at DirectColor is that it's a more flexible version of TrueColor. A TrueColor visual is effectively a 24-bit DirectColor visual where the color mappings for red, green, and blue are fixed rather than variable (this is in fact how it's described in the X documentation). Making these mappings variable costs you only a tiny bit of extra memory (you need 256 bytes for each color) and might require only a bit of extra hardware in the color generation process, and it enables the program using the display to change colors on the fly with small writes to the colormap rather than large writes to the framebuffer (which, back in the days, were not necessarily very fast). For instance, if you're looking at a full screen image and you want to brighten it, you could simply shift the color values in the colormaps to raise the low values, rather than recompute and redraw all the pixels.

(Apparently DirectColor was often used with 24-bit pixels, split into one byte for each color, which is the same pixel layout as a 24-bit TrueColor visual; see eg this section of the Starlink Project's Graphics Cookbook. Also, this seems to be how the A/UX X server worked. If you were going to do 8-bit pixels I suspected people preferred PseudoColor to DirectColor.)

These days this is mostly irrelevant and the basic simplicity of the TrueColor visual has won out. Well, what won out is PC graphics systems that followed the same basic approach of fixed 24-bit RGB color, and then X went along with it on PC hardware, which became more or less the only hardware.

(There probably was hardware with DirectColor support. While X on PC Unixes will probably still claim to support DirectColor visuals, as reported in things like xdpyinfo, I suspect that it involves software emulation. Although these days you could probably implement DirectColor with GPU shaders at basically no cost.)

Sending DMARC reports is somewhat hazardous

By: cks
3 December 2025 at 03:10

DMARC has a feature where you can request that other mail systems send you aggregate reports about the DMARC results that they observed for email claiming to be from you. If you're a large institution with a sprawling, complex, multi-party mail environment and you're considering trying to make your DMARC policy stricter, it's very useful to get as many DMARC reports from as many people as possible. Especially, 'you' (in a broad sense) probably want to get as much information from mail systems run by sub-units as possible, and if you're a sub-unit, you want to report DMARC information up to the organization so they have as much visibility into what's going on as possible.

In related news, I've been looking into making our mail system send out DMARC reports, and I had what was in retrospect a predictable learning experience:

Today's discovery: if you want to helpfully send out DMARC reports to people who ask for them and you operate even a moderate sized email system, you're going to need to use a dedicated sending server and you probably don't want to. Because a) you'll be sending a lot of email messages and b) a lot of them will bounce because people's DMARC records are inaccurate and c) a decent number of them will camp out in your mail queue because see b, they're trying to go to non-responsive hosts.

Really, all of this DMARC reporting nonsense was predictable from first (Internet) principles, but I didn't think about it and was just optimistic when I turned our reporting on for local reasons. Of course people are going to screw up their DMARC reporting information (or for spammers, just make it up), they screw everything up and DMARC data will be no exception.

(Or they take systems and email addresses out of service without updating their DMARC records.)

If you operate even a somewhat modest email system that gets a wide variety of email, as we do, it doesn't take very long to receive email from hundreds of From: domains that have DMARC records in DNS that request reports. When you generate your DMARC reports (whether once a day or more often), you'll send out hundreds of email messages to those report addresses. If you send them through your regular outgoing email system, you'll have a sudden influx of a lot of messages and you may trigger any anti-flood ratelimits you have. Once your reporting system has upended those hundreds of reports into your mail system, your mail system has to process through them; some of them will be delivered promptly, some of them will bounce (either directly or inside the remote mail system you hand them off to), and some of them will be theoretically destined for (currently) non-responsive hosts and thus will clog up your mail queue with repeated delivery attempts. If you're sending these reports through a general purpose mail system, your mail queue probably has a long timeout for stalled email, which is not really what you want in this case; your DMARC reports are more like 'best effort one time delivery attempt and then throw the message away' email. If this report doesn't get through and the issue is transient, you'll keep getting email with that From: domain and eventually one of your reports will go through. DMARC reports are definitely not 'gotta deliver them all' email.

So in my view, you're almost certainly going to have to be selective about what domains you send DMARC reports for. If you're considering this and you can, it may help to trawl your logs to see what domains are failing DMARC checks and pick out the ones you care about (such as, say, your organization's overall domain or domains). It's somewhat useful to report even successful DMARC results (where the email passes DMARC checks), but if you're considering acting on DMARC results, it's important to get false negatives fixed. If you want to send DMARC reports to everyone, you'll want to set up a custom mail system, perhaps on the DMARC local machine, which blasts everything out, efficiently handles potentially large queues and fast submission rates, and discards queued messages quickly (and obviously doesn't send you any bounces).

(Sending through a completely separate mail system also avoids the possibility that someone will decide to put your regular system on a blocklist because of your high rate of DMARC report email.)

PS: Some of those hundreds of From: domains with DMARC records that request reports will be spammer domains; I assume that putting a 'rua=' into your DMARC record makes it look more legitimate to (some) receiving systems. Spammers sending from their own domains can DKIM sign their messages, but having working reporting addresses requires extra work and extra exposure. And of course spammers often rotate through domains rapidly.

Password fields should usually have an option to show the text

By: cks
2 December 2025 at 03:46

I recently had to abruptly replace my smartphone, and because of how it happened I couldn't directly transfer data from the old phone to the new one; instead, I had to have the new phone restore itself from a cloud backup of the old phone (made on an OS version several years older than the new phone's OS). In the process, a number of passwords and other secrets fell off and I had to re-enter them. As I mentioned on the Fediverse, this didn't always go well:

I did get our work L2TP VPN to work with my new phone. Apparently the problem was a typo in one bit of one password secret, which is hard to see because of course there's no 'show the whole thing' option and you have to enter things character by character on a virtual phone keyboard I find slow and error-prone.

(Phone natives are probably laughing at my typing.)

(Some of the issue was that these passwords were generally not good ones for software keyboards.)

There are reasonable security reasons not to show passwords when you're entering them. In the old days, the traditional reason was shoulder surfing; today, we have to worry about various things that might capture the screen with a password visible. But at the same time, entering passwords and other secrets blindly is error prone, and especially these days the diagnostics of a failed password may be obscure and you might only get so many tries before bad things start happening.

(The smartphone approach of temporarily showing the last character you entered is a help but not a complete cure, especially if you're going back and forth three ways between the form field, the on-screen keyboard, and your saved or looked up copy of the password or secret.)

Partly as a result of my recent experiences, I've definitely come around to viewing those 'reveal the plain text of the password' options that some applications have as a good thing. I think a lot of applications should at least consider whether and how to do this, and how to make password entry less error prone in general. This especially applies if your application (and overall environment) doesn't allow pasting into the field (either from a memorized passwords system or by the person involved simply copying and pasting it from elsewhere, such as support site instructions).

In some cases, you might want to not even treat a 'password' field as a password (with hidden text) by default. Often things like wireless network 'passwords' or L2TP pre-shared keys are broadly known and perhaps don't need to be carefully guarded during input the way genuine account passwords do. If possible I'd still offer an option to hide the input text in whatever way is usual on your platform, but you could reasonably start the field out as not hidden.

Unfortunately, as of December 2025 I think there's no general way to do this in HTML forms in pure CSS, without JavaScript (there may be some browser-specific CSS attributes). I believe support for this is on the CSS roadmap somewhere, but that probably means at least several years before it starts being common.

(The good news is that a pure CSS system will presumably degrade harmlessly if the CSS isn't supported; the password will just stay hidden, which is no worse than today's situation with a basic form.)

Go still supports building non-module programs with GOPATH

By: cks
1 December 2025 at 02:52

When Go 1.18 was released, I said that it made module mode mandatory, which I wasn't a fan of because it can break backward compatibility in practice (and switching a program to Go modules can be non-trivial). Recently on the Fediverse, @thepudds very helpfully taught me that I wasn't entirely correct and Go still sort of supports non-module GOPATH usage, and in fact according to issue 60915, the current support is going to be preserved indefinitely.

Specifically, what's preserved today (and into the future) is support for using 'go build' and 'go install' in non-module mode (with 'GO111MODULE=off'). This inherits all of the behavior of Go 1.17 and earlier, including the use of things in the program's /vendor/ area (which can be important if you made local hacks). This allows you to rebuild and modify programs that you already have a complete GOPATH environment for (with all of their direct and indirect dependencies fetched). Since Go 1.22 and later don't support the non-module version of 'go get', assembling such an environment from scratch is up to you (if, for example, you need to modify an old non-module program). If you have a saved version of a suitable earlier version of Go, using that is probably the easiest way.

(Initially I thought Go 1.17 was the latest version you could use for this, but that was wrong; you can use anything up through Go 1.21. Go 1.17 is merely the latest version where you can do this without explicitly setting 'GO111MODULE=off'.)

Of course you could just build your old non-module programs with your saved copy of Go 1.21 (if it still runs in your current OS and hardware environment), but rebuilding things with a modern version of Go has various advantages and may be required to support modern architectures and operating system versions that you're targeting. The latest versions of Go have compiler and runtime improvements and optimizations, standard library improvements, support for various more modern things in TLS and so on, and a certain amount of security fixes; you'll also get better support for using 'go version -m' on your built binaries (which is useful for tracking things later).

Learning this is probably going to get me to change how I handle some of our old programs. Even if I don't update their code, rebuilding them periodically on the latest Go version to update their binaries is probably a good thing, especially if they deal with cryptography (including SSH) or HTTP things.

(In retrospect this was implied by what the Go 1.18 release notes said. In fact even at the time I didn't read enough of the release notes; in forced 'Go modules off' mode, the Go 1.18 'go get' will still get things for you. That ability was removed later, in Go 1.22. Right up through Go 1.21, 'GO111MODULE=off go get [-u]' will do the traditional dependency fetching and so on for you.)

Discovering that my smartphone had infiltrated my life

By: cks
30 November 2025 at 02:45

While I have a smartphone, I think of myself as not particularly using it all that much. I got a smartphone quite late, it spends a lot of its life merely sitting there (not even necessarily in the same room as me, especially at home), and while I installed various apps (such as a SSH client) I rarely use them; they're mostly for weird emergencies. Then I suddenly couldn't use my current smartphone any more and all sorts of things came out of the woodwork, both things I sort of knew about but hadn't realized how much they'd affect me and things that I didn't even think about until I had a dead phone.

The really obvious and somewhat nerve wracking thing I expected from the start is that plenty of things want to send you text messages (both for SMS authentication codes and to tell you what steps to do to, for example, get your new replacement smartphone). With no operating smartphone I couldn't receive them. I found myself on tenterhooks all through the replacement process, hoping very much that my bank wouldn't decide it needed to authenticate my credit card usage through either its smartphone app or a text message (and I was lucky that I could authenticate some things through another device). Had I been without a smartphone for a more extended time, I could see a number of things where I'd probably have had to make in-person visits to a bank branch.

(Another obvious thing I knew about is that my bike computer wants to talk to a smartphone app (also). At a different time of year this would have been a real issue, but fortunately my bike club's recreational riding season is over so all it did was delay me uploading one commute ride.)

In less obvious things, I use my smartphone as my alarm clock. With my smartphone unavailable I discovered that I had no good alternative (although I had some not so good ones that are too quiet). I've also become used to using my phone for a quick check of the weather on the way out the door, and to check the arrival time of TTC buses, neither of which were available. Nor could I check email (or text messages) on the way to pick up my new phone because with no smartphone I had no data coverage. I was lucky enough to have another wifi-enabled device available that I took with me, which turned out to be critical for the pickup process.

(It also felt weird and wrong to walk out of the door without the weight of my phone in my pocket, as if I was forgetting my keys or something equally important. And there were times on the trip to get the replacement phone when I found myself realizing that if I'd had an operating smartphone, I'd have taken it out for a quick look at this or that or whatever.)

On the level of mere inconveniences, over time I've gotten pulled into using my smartphone's payment setup for things like grocery purchases. I could still do that in several other ways even without a smartphone, but none of them would have been as nice an experience. There would also have been paper cuts in things like checking the balance on my public transit fare card and topping it up.

Having gone through this experience with my smartphone, I'm now wondering what other bits of technology have quietly infiltrated both my personal life and things at work without me noticing their actual importance. I suspect that there are some more and I'll only realize it when they break.

PS: The smartphone I had to replace is the same one I got back in late 2016, so I got a bit over nine years of usage out of it. This is pretty good by smartphone standards (although for the past few years I was carefully ignoring that it had questionable support for security bugs; there were some updates, but also some known issues that weren't being fixed).

Do you care about (all) HTTP requests from cloud provider IP address space?

By: cks
29 November 2025 at 04:21

About a month ago Mike Hoye wrote Raised Shields, in which Hoye said, about defending small websites from crawler abuse in this day and age:

If you only care about humans I strongly advise you to block every cloudhost subnet you can find, pretty easy given the effort they put into finding you. Most of the worst actors out there are living comfortably on Azure, GCP, Yandex and sometimes Huawei’s servers.

(As usual, there's no point in complaining about abusive crawlers to the cloud providers.)

I've said something similar on the Fediverse:

Today's idle thought: how many small web servers actually have any reason to accept requests from AWS or Google Cloud IP address space? If you search through your logs with (eg) grepcidr, you may find that there's little or nothing of value coming from there, and they sure are popular with LLM crawlers these days.

You definitely want to search your logs before doing this, and you may find that you want to make some exceptions even if you do opt for it. For example, you might want or need to let cloud-hosted things fetch your syndication feeds, because there are a fair number of people and feed readers that do their fetching from the cloud. Possibly you'll find that you have a significant number of real visitors that are using do it yourself personal VPN setups that have cloud exit points.

(How many exceptions you want to make may depend on how much of a hard line you want to take. I suspect that Mike Hoye's line is much harder than mine.)

However, I think that for a lot of small, personal web servers and web sites you'll find that almost nothing of genuine value comes from the big cloud provider networks, from AWS, Google Cloud, Azure, Oracle, and so on. You're probably not getting real visitors from these clouds, people who are interested in reading your work and engaging with it. Instead you'll most likely see an ever-growing horde of obvious crawlers, increasingly suspicious user agents, claims to be things that they aren't, and so on.

On the one hand, it's in some sense morally pure to not block these cloud areas unless they're causing your site active harm; it's certainly what the ethos was on the older Internet, and it was a good and useful ethos for those times. On the other hand, that view is part of what got us here. More and more, these days are the days of Raised Shields, as we react to the new environment (much as email had to react to the new environment of ever increasing spam).

If you're doing this, one useful trick you can play if you have the right web server environment is to do your blocking with HTTP 429 Too Many Requests responses. Using this HTTP code is in some sense inaccurate, but it has the useful effect that very few things will take it as a permanent error the way they may take, for example, HTTP 403 (or HTTP 404). This gives you a chance to monitor your web server logs and add a suitable exemption for traffic that you turn out to want after all, without your error responses doing anything permanent (like potentially removing your pages from search engine indexes). You can also arrange to serve up a custom error page for this case, with an explanation or a link to an explanation.

(My view is that serving a 400-series HTTP error response is better than a HTTP 302 temporary redirect to your explanation, for various reasons. Possibly there are clever things you can do with error pages in general.)

We can't fund our way out of the free and open source maintenance problem

By: cks
28 November 2025 at 04:18

It's in the tech news a lot these days that there are 'problems' with free and open source maintenance. I put 'problems' in quotes because the issue is mostly that FOSS maintenance isn't happening as fast or as much as the people who've come to depend on it would like, and the people who maintain FOSS are increasingly saying 'no' when corporations turn up (cf, also). But even with all the corporate presence, there are still a reasonable number of people who use non-corporate FOSS operating systems like Debian Linux, FreeBSD, and so on, and they too suffer when parts of the FOSS software stack struggle with maintenance. Every so often, people will suggest that the problem would be solved if only corporations would properly fund this maintenance work. However, I don't believe this can actually work even in a world where corporations are willing to properly fund such things (in this world, they're very clearly not).

One big problem with 'funding' as a solution to the FOSS maintenance problems is that for many FOSS maintainers, there isn't enough work available to support them. Many FOSS people write and support only a small number of things that don't necessarily need much active development and bug fixing (people have done studies on this), and so can't feasibly provide full time employment (especially at something equivalent to a competitive salary). Certainly,there's plenty of large projects that are underfunded and could support one or more people working on them full time, but there's also a long tail of smaller, less obvious dependencies that are also important for various sorts of maintenance.

(In a way, the lack of funding pushes people toward small projects. With no funding, you have to do your projects in your spare time and the easiest way to make that work is to choose some small area or modest project that simply doesn't need that much time to develop or maintain.)

There are models where people who work on FOSS can be funded to do a bit of work on a lot of projects. But that's not the same as having funding to work full time on your own little project (or set of little projects). It's much more like regular work, in that you're being paid to do development work on other people's stuff (and I suspect that it will be much more time consuming than one might expect, since anyone doing this will have to come up to speed on a whole bunch of projects).

(I'm assuming the FOSS funding equivalent of a perfectly spherical frictionless object from physics examples, so we can wave away all other issues except that there is not enough work on individual projects. In the real world there are a huge host of additional problems with funding people for FOSS work that create significant extra friction (eg, potential liabilities).)

PS: Even though we can't solve the whole problem with funding, companies absolutely should be trying to use funding to solve as much of it as possible. That they manifestly aren't is one of many things that is probably going to bring everything down as pressure builds to do something.

(I'm sure I'm far from the first person to write about this issue with funding FOSS work. I just feel like writing it down myself, partly as elaboration on some parts of past Fediverse posts.)

Sidebar: It's full time work that matters

If someone is already working a regular full time job, their spare time is a limited resource and there are many claims on it. For various reasons, not everyone will take money to spend (potentially) most of their spare time maintaining their FOSS work. Many people will only be willing to spend a limited amount of their spare time on FOSS stuff, even if you could fund them at reasonable rates for all of their spare time. The only way to really get 'enough' time is to fund people to work full time, so their FOSS work replaces their regular full time job.

One of the reasons I suspect some people won't take money for their extra time is that they already have one job and they don't want to effectively get a second one. They do FOSS work deliberately because it's a break from 'job' style work.

(This points to another, bigger issue; there are plenty of people doing all sorts of hobbies, such as photography, who have no desire to 'go pro' in their hobby no matter how avid and good they are. I suspect there are people writing and maintaining important FOSS software who similarly have no desire to 'go pro' with their software maintenance.)

Duplicate metric labels and group_*() operations in Prometheus

By: cks
27 November 2025 at 02:44

Suppose that you have an internal master DNS server and a backup for that master server. The two servers are theoretically fed from the same data and so should have the same DNS zone contents, and especially they should have the same DNS zone SOAs for all zones in both of their internal and external views. They both run Bind and you use the Bind exporter, which provides the SOA values for every zone Bind is configured to be a primary or a secondary for. So you can write an alert with an expression like this:

bind_zone_serial{host="backup"}
  != on (view,zone_name)
    bind_zone_serial{host="primary"}

This is a perfectly good alert (well, alert rule), but it has lost all of the additional labels you might want in your alert. Especially, it has lost both host names. You could hard-code the host name in your message about the alert, but it would be nice to do better and propagate your standard labels into the alert. To do this you want to use one of group_left() and group_right(), but which one you want depends on where you want the labels to come from.

(Normally you have to chose between the two depending on which side has multiple matches, but in this case we have a one to one matching.)

For labels that are duplicated between both sides, the group_*() operators pick which side's labels you get, but backwards from their names. If you use group_right(), the duplicate label values come from the left; if you use group_left(), the duplicate label values come from the right. Here, we might change the backup host's name but we're probably not going to change the primary host's name, so we likely want to preserve the 'host' label from the left side and thus we use group_right():

bind_zone_serial{host="backup"}
  != on (view,zone_name)
    group_right (job,host,instance)
      bind_zone_serial{host="primary"}

One reason this little peculiarity is on my mind at the moment is that Cloudflare's excellent pint Prometheus rule linter recently picked up a new 'redundant label' lint rule that complains about this for custom labels such as 'host':

Query is trying to join the 'host' label that is already present on the other side of the query.

(It doesn't complain about job or instance, presumably because it understands why you might do this for those labels. As the pint message will tell you, to silence this you need to disable 'promql/impossible' for this rule.)

When I first saw pint's warning I didn't think about it and removed the 'host' label from the group_right(), but fortunately I actually tested what the result would be and saw that I was now getting the wrong host name.

(This is different from pulling in labels from other metrics, where the labels aren't duplicated.)

PS: I clearly knew this at some point, when I wrote the original alert rule, but then I forgot it by the time I was looking at pint's warning message. PromQL is the kind of complex thing where the details can fall out of my mind if I don't use it often enough, which I don't these days since our alert rules are relatively stable.

BSD PF versus Linux nftables for firewalls for us

By: cks
26 November 2025 at 03:48

One of the reactions I saw to our move from OpenBSD to FreeBSD for firewalls was to wonder why we weren't moving all the way to nftables based Linux firewalls. It's true that this would reduce the number of different Unixes we have to operate and probably get us more or less state of the art 10G network performance. However, I have some negative views on the choice of PF versus nftables, both in our specific situation and in general.

(I've written about this before but it was in the implicit context of Linux iptables.)

In our specific situation:

  • We have a lot of existing, relatively complex PF firewall rules; for example, our perimeter firewall has over 400 non-comment lines of rules, definitions, and so on. Translating these from OpenBSD PF to FreeBSD PF is easy, if it's necessary at all. Translating everything to nftables is a lot more work, and as far as I know there's no translation tool, especially not one that we could really trust. We'd probably have to basically rebuild each firewall from the ground up, which is both a lot of work and a high-stakes thing. We'd have to be extremely convinced that we had to do this in order to undertake it.

  • We have a lot of well developed tooling around operating, monitoring, and gathering metrics from PF-based firewalls, most of it locally created. Much or all of this tooling ports straight over from OpenBSD to FreeBSD, while we have no equivalent tooling for nftables and would have to develop (or find) equivalents.

  • We already know PF and almost all of that knowledge transfers over from OpenBSD PF to FreeBSD PF (and more will transfer with FreeBSD 15, which has some PF and PF syntax updates from modern OpenBSD).

In general (much of which also applies to our specific situation):

  • There are a number of important PF features that nftables at best has in incomplete, awkward versions. For example, nftables' version of pflog is awkward and half-baked compared to the real thing (also). While you may be able to put together some nftables based rough equivalent of BSD pfsync, casual reading suggests that it's a lot more involved and complex (and maybe less integrated with nftables).

  • The BSD PF firewall system is straightforward and easy to understand and predict. The Linux firewall system is much more complex and harder to understand, and this complexity bleeds through into nftables configuration, where you need to know chains and tables and so on. Much of this Linux complexity is not documented in ways that are particularly accessible.

  • Nftables documentation is opaque compared to the BSD pf.conf manual page (also). Partly this is because there is no 'nftables.conf' manual page; instead, your entry point is the nft manual page, which is both a command line tool and the documentation of the format of nftables rules. I find that these are two tastes that don't go well together.

    (This is somewhat forced by the nftables decision to retain compatibility with adding and removing rules on the fly. PF doesn't give you a choice, you load your entire ruleset from a file.)

  • nftables is already the third firewall rule format and system that the Linux kernel has had over the time that I've been writing Linux firewall rules (ipchains, iptables, nftables). I have no confidence that there won't be a fourth before too long. PF has been quite stable by comparison.

What I mostly care about is what I have to write and read to get the IP filtering and firewall setup that we want (and then understand it later), not how it gets compiled down and represented in the kernel (this has come up before). Assuming that the nftables backend is capable enough and the result performs sufficiently well, I'd be reasonably happy with a PF like syntax (and semantics) on top of kernel nftables (although we'd still have things like the pflog and pfsync issues).

Can I get things done in nftables? Certainly, nftables is relatively inoffensive. Do I want to write nftables rules? No, not really, no more than I want to write iptables rules. I do write nftables and iptables rules when I need to do firewall and IP filtering things on a Linux machine, but for a dedicated machine for this purpose I'd rather use a PF-based environment (which is now FreeBSD).

As far as I can tell, the state of Linux IP filtering documentation is partly a result of the fact that Linux doesn't have a unified IP filtering system and environment the way that OpenBSD does and FreeBSD mostly does (or at least successfully appears to so far). When the IP filtering system is multiple more or less separate pieces and subsystems, you naturally tend to get documentation that looks at each piece in isolation and assumes you already know all of the rest.

(Let's also acknowledge that writing good documentation for a complex system is hard, and the Linux IP filtering system has evolved to be very complex.)

PS: There's no real comparison between PF and the older iptables system; PF is clearly far more high level than you can reasonably do in iptables, which by comparison is basically an IP filtering assembly language. I'm willing to tentatively assume that nftables can be used in a higher level way than iptables can (I haven't used it for enough to have a well informed view either way); if it can't, then there's again no real comparison between PF and nftables.

Making Polkit authenticate people like su does (with group wheel)

By: cks
25 November 2025 at 03:46

Polkit is how a lot of things on modern Linux systems decide whether or not to let people do privileged operations, including systemd's run0, which effectively functions as another su or sudo. Polkit normally has a significantly different authentication model than su or sudo, where an arbitrary login can authenticate for privileged operations by giving the password of any 'administrator' account (accounts in group wheel or group admin, depending on your Linux distribution).

Suppose, not hypothetically, that you want a su like model in Polkit, one where people in group 'wheel' can authenticate by providing the root password, while people not in group 'wheel' cannot authenticate for privileged operations at all. In my earlier entry on learning about Polkit and adjusting it I put forward an untested Polkit stanza to do this. Now I've tested it and I can provide an actual working version.

polkit.addAdminRule(function(action, subject) {
    if (subject.isInGroup("wheel")) {
        return ["unix-user:0"];
    } else {
        // must exist but have a locked password
        return ["unix-user:nobody"];
    }
});

(This goes in /etc/polkit-1/rules.d/50-default.rules, and the filename is important because it has to replace the standard version in /usr/share/polkit-1/rules.d.)

This doesn't quite work the way 'su' does, where it will just refuse to work for people not in group wheel. Instead, if you're not in group wheel you'll be prompted for the password of 'nobody' (or whatever other login you're using), which you can never successfully supply because the password is locked.

As I've experimentally determined, it doesn't work to return an empty list ('[]'), or a Unix group that doesn't exist ('unix-group:nosuchgroup'), or a Unix group that exists but has no members. In all cases my Fedora 42 system falls back to asking for the root password, which I assume is a built-in default for privileged authentication. Instead you apparently have to return something that Polkit thinks it can plausibly use to authenticate the person, even if that authentication can't succeed. Hopefully Polkit will never get smart enough to work that out and stop accepting accounts with locked passwords.

(If you want to be friendly and you expect people on your servers to run into this a lot, you should probably create a login with a more useful name and GECOS field, perhaps 'not-allowed' and 'You cannot authenticate for this operation', that has a locked password. People may or may not realize what's going on, but at least they have a chance.)

PS: This is with the Fedora 42 version of Polkit, which is version 126. This appears to be the most recent version from the upstream project.

Sidebar: Disabling Polkit entirely

Initially I assumed that Polkit had explicit rules somewhere that authorized the 'root' user. However, as far as I can tell this isn't true; there's no normal rules that specifically authorize root or any other UID 0 login name, and despite that root can perform actions that are restricted to groups that root isn't in. I believe this means that you can explicitly disable all discretionary Polkit authorization with an '00-disable.rules' file that contains:

polkit.addRule(function(action, subject) {
    return polkit.Result.NO;
});

Based on experimentation, this disables absolutely everything, even actions that are considered generally harmless (like libvirt's 'virsh list', which I think normally anyone can do).

A slightly more friendly version can be had by creating a situation where there are no allowed administrative users. I think this would be done with a 50-default.rules file that contained:

polkit.addAdminRule(function(action, subject) {
    // must exist but have a locked password
    return ["unix-user:nobody"];
});

You'd also want to make sure that nobody is in any special groups that rules in /usr/share/polkit-1/rules.d use to allow automatic access. You can look for these by grep'ing for 'isInGroup'.

The (early) good and bad parts of Polkit for a system administrator

By: cks
24 November 2025 at 03:46

At a high level, Polkit is how a lot of things on modern Linux systems decide whether or not to let you do privileged operations. After looking into it a bit, I've wound up feeling that Polkit has both good and bad aspects from the perspective of a system administrator (especially a system administrator with multi-user Linux systems, where most of the people using them aren't supposed to have any special privileges). While I've used (desktop) Linuxes with Polkit for a while and relied on it for a certain amount of what I was doing, I've done so blindly, effectively as a normal person. This is the first I've looked at the details of Polkit, which is why I'm calling this my early reactions.

On the good side, Polkit is a single source of authorization decisions, much like PAM. On a modern Linux system, there are a steadily increasing number of programs that do privileged things, even on servers (such as systemd's run0). These could all have their own bespoke custom authorization systems, much as how sudo has its own custom one, but instead most of them have centralized on Polkit. In theory Polkit gives you a single thing to look at and a single thing to learn, rather than learning systemd's authentication system, NetworkManager's authentication system, etc. It also means that programs have less of a temptation to hard-code (some of) their authentication rules, because Polkit is very flexible.

(In many cases programs couldn't feasibly use PAM instead, because they want certain actions to be automatically authorized. For example, in its standard configuration libvirt wants everyone in group 'libvirt' to be able to issue libvirt VM management commands without constantly having to authenticate. PAM could probably be extended to do this but it would start to get complicated, partly because PAM configuration files aren't a programming language and so implementing logic in PAM gets awkward in a hurry.)

On the bad side, Polkit is a non-declarative authorization system, and a complex one with its rules not in any single place (instead they're distributed through multiple files in two different formats). Authorization decisions are normally made in (JavaScript) code, which means that they can encode essentially arbitrary logic (although there are standard forms of things). This means that the only way to know who is authorized to do a particular thing is to read its XML 'action' file and then look through all of the JavaScript code to find and then understand things that apply to it.

(Even 'who is authorized' is imprecise by default. Polkit normally allows anyone to authenticate as any administrative account, provided that they know its password and possibly other authentication information. This makes the passwords of people in group wheel or group admin very dangerous things, since anyone who can get their hands on one can probably execute any Polkit-protected action.)

This creates a situation where there's no way in Polkit to get a global overview of who is authorized to do what, or what a particular person has authorization for, since this doesn't exist in a declarative form and instead has to be determined on the fly by evaluating code. Instead you have to know what's customary, like the group that's 'administrative' for your Linux distribution (wheel or admin, typically) and what special groups (like 'libvirt') do what, or you have to read and understand all of the JavaScript and XML involved.

In other words, there's no feasible way to audit what Polkit is allowing people to do on your system. You have to trust that programs have made sensible decisions in their Polkit configuration (ones that you agree with), or run the risk of system malfunctions by turning everything off (or allowing only root to be authorized to do things).

(Not even Polkit itself can give you visibility into why a decision was made or fully predict it in advance, because the JavaScript rules have no pre-filtering to narrow down what they apply to. The only way you find out what a rule really does is invoking it. Well, invoking the function that the addRule() or addAdminRule() added to the rule stack.)

This complexity (and the resulting opacity of authorization) is probably intrinsic in Polkit's goals. I even think they made the right decision by having you write logic in JavaScript rather than try to create their own language for it. However, I do wish Polkit had a declarative subset that could express all of the simple cases, reserving JavaScript rules only for complex ones. I think this would make the overall system much easier for system administrators to understand and analyze, so we had a much better idea (and much better control) over who was authorized for what.

Brief notes on learning and adjusting Polkit on modern Linuxes

By: cks
23 November 2025 at 04:07

Polkit (also, also) is a multi-faceted user level thing used to control access to privileged operations. It's probably used by various D-Bus services on your system, which you can more or less get a list of with pkaction, and there's a pkexec program that's like su and sudo. There are two reasons that you might care about Polkit on your system. First, there might be tools you want to use that use Polkit, such as systemd's run0 (which is developing some interesting options). The other is that Polkit gives people an alternate way to get access to root or other privileges on your servers and you may have opinions about that and what authentication should be required.

Unfortunately, Polkit configuration is arcane and as far as I know, there aren't really any readily accessible options for it. For instance, if you want to force people to authenticate for root-level things using the root password instead of their password, as far as I know you're going to have to write some JavaScript yourself to define a suitable Administrator identity rule. The polkit manual page seems to document what you can put in the code reasonably well, but I'm not sure how you test your new rules and some areas seem underdocumented (for example, it's not clear how 'addAdminRule()' can be used to say that the current user cannot authenticate as an administrative user at all).

(If and when I wind up needing to test rules, I will probably try to do it in a scratch virtual machine that I can blow up. Fortunately Polkit is never likely to be my only way to authenticate things.)

Polkit also has some paper cuts in its current setup. For example, as far as I can see there's no easy way to tell Polkit-using programs that you want to immediately authenticate for administrative access as yourself, rather than be offered a menu of people in group wheel (yourself included) and having to pick yourself. It's also not clear to me (and I lack a test system) if the default setup blocks people who aren't in group wheel (or group admin, depending on your Linux distribution flavour) from administrative authentication or if instead they get to pick authenticating using one of your passwords. I suspect it's the latter.

(All of this makes Polkit seem like it's not really built for multi-user Linux systems, or at least multi-user systems where not everyone is an administrator.)

PS: Now that I've looked at it, I have some issues with Polkit from the perspective of a system administrator, but those are going to be for another entry.

Sidebar: Some options for Polkit (root) authentication

If you want everyone to authenticate as root for administrative actions, I think what you want is:

polkit.addAdminRule(function(action, subject) {
    return ["unix-user:0"];
});

If you want to restrict this to people in group wheel, I think you want something like:

polkit.addAdminRule(function(action, subject) {
    if (subject.isInGroup("wheel")) {
        return ["unix-user:0"];
    } else {
        // might not work to say 'no'?
        return [];
    }
});

If you want people in group wheel to authenticate as themselves, not root, I think you return 'unix-user:' + subject.user instead of 'unix-user:0'. I don't know if people still get prompted by Polkit to pick a user if there's only one possible user.

You can't (easily) ignore errors in Python

By: cks
22 November 2025 at 04:18

Yesterday I wrote about how there's always going to be a way to not write code for error handling. When I wrote that entry I deliberately didn't phrase it as 'ignoring errors', because in some languages it's either not possible to do that or at least very difficult, and one of them is Python.

As every Python programmer knows, errors raise exceptions in Python and you can catch those exceptions, either narrowly or (very) broadly (possibly by accident). If you don't handle an exception, it bubbles up and terminates your program (which is nice if that's what you want and does mean that errors can't be casually ignored). On the surface it seems like you can ignore errors by simply surrounding all of your code with a try:/except: block that catches everything. But if you do this, you're not ignoring errors in the same way as you do in a language where errors are return values. In a language where you can genuinely ignore errors, all of your code keeps on running when errors happen. But in Python, if you put a broad try block around your code, your code stops executing at the first exception that gets raised, rather than continuing on to the other code within the try block.

(If there's further code outside the try block, it will run but probably not work very well because there will likely be a lot that simply didn't happen inside the try block. Your code skipped right from the statement that raised the exception to the first statement outside the try block.)

To get the C or Go like experience that your program keeps running its code even after an exception, you need to effectively catch and ignore exceptions separately for each statement. You can write this out by hand, putting each statement in its own try: block, but you'll probably get tired of this very fast, the result will be hard to read, and it's extremely obviously not like regular Python. This is the sign that Python doesn't really let you ignore errors in any easy way. All Python lets you do easily is suppress messages about errors and potentially make them not terminate your program. The closer you want to get to actually ignoring all errors, the more work you'll have to do.

(There are probably clever things you can do with Python debugging hooks since I believe that Python debuggers can intercept exceptions, although I'm not sure if they can resume execution after unhandled ones. But this is not going to really be easy.)

There's always going to be a way to not code error handling

By: cks
21 November 2025 at 03:55

Over on the Fediverse, I said something:

My hot take on Rust .unwrap(): no matter what you do, people want convenient shortcut ways of not explicitly handling errors in programming languages. And then people will use them in what turn out to be inappropriate places, because people aren't always right and sometimes make mistakes.

Every popular programming language lets your code not handle errors in some way, taking an optimistic approach. If you're lucky, your program notices at runtime when there actually is an error.

The subtext for this is that Cloudflare had a global outage where one contributing factor was using Rust's .unwrap(), which will panic your program if an error actually happens.

Every popular programming language has something like this. In Python you can ignore the possibility of exceptions, in C and Go you can ignore or explicitly discard error returns, in Java you can catch and ignore all exceptions, and so on. What varies from language to language is what the consequences are. In Python and Rust, your program dies (with an uncaught exception or a panic, respectively). In Go, your program either sails on making an increasingly big mess or panics (for example, if another return value is nil when there's an error and you try to do something with it that requires a non-nil value).

(Some languages let you have it either way. The default state of the Bourne shell is to sail onward in the face of failures, but you can change that with 'set -e' (mostly) and even get good error reports sometimes.)

These features don't exist because language designers are idiots (especially since error handling isn't a solved problem). They ultimately exist because people want a way to not so much ignore errors as not write code to 'handle' them. These people don't expect errors, they think in practice errors will either be extremely infrequent or not happen, and they don't want to write code that will deal with them anyway (if they're forced to write code that does something, often their choice will be to end the program).

You could probably create a programming language that didn't allow you to do this (possibly Haskell and other monad-using functional languages are close to it). I suspect it would be unpopular. If it wasn't unpopular, I suspect people would write their own functions or whatever to ignore the possibility of errors (either with or without ending the program if an error actually happens). People want to not have to write error handling, and they'll make it happen one way or another.

(Then, as I mentioned, some of the time they'll turn out to be wrong about errors not happening.)

Automatically scrubbing ZFS pools periodically on FreeBSD

By: cks
20 November 2025 at 03:17

We've been moving from OpenBSD to FreeBSD for firewalls. One advantage of this is giving us a mirrored ZFS pool for the machine's filesystems; we have a lot of experience operating ZFS and it's a simple, reliable, and fully supported way of getting mirrored system disks on important machines. ZFS has checksums and you want to periodically 'scrub' your ZFS pools to verify all of your data (in all of its copies) through these checksums (ideally relatively frequently). All of this is part of basic ZFS knowledge, so I was a little bit surprised to discover that none of our FreeBSD machines had ever scrubbed their root pools, despite some of them having been running for months.

It turns out that while FreeBSD comes with a configuration option to do periodic ZFS scrubs, the option isn't enabled by default (as of FreeBSD 14.3). Instead you have to know to enable it, which admittedly isn't too hard to find once you start looking.

FreeBSD has a general periodic(8) system for triggering things on a daily, weekly, monthly, or other basis. As covered in the manual page, the default configuration for this is in /etc/defaults/periodic.conf and you can override things by creating or modifying /etc/periodic.conf. ZFS scrubs are a 'daily' periodic setting, and as of 14.3 the basic thing you want is an /etc/periodic.conf with:

# Enable ZFS scrubs
daily_scrub_zfs_enable="YES"

FreeBSD will normally scrub each pool a certain number of days after its previous scrub (either a manual scrub or an automatic scrub through the periodic system). The default number of days is 35, which is a bit high for my tastes, so I suggest that you shorten it, making your periodic.conf stanza be:

# Enable ZFS scrubs
daily_scrub_zfs_enable="YES"
daily_scrub_zfs_default_threshold="14"

There are other options you can set that are covered in /etc/defaults/periodic.conf.

(That the daily automatic scrubs happen some number of days after the pool was last scrubbed means that you can adjust their timing by doing a manual scrub. If you have a bunch of machines that you set up at the same time, you can get them to space out their scrubs by scrubbing one a day by hand, and so on.)

Looking at the other ZFS periodic options, I might also enable the daily ZFS status report, because I'm not certain if there's anything else that will alert you if or when ZFS starts reporting errors:

# Find out about ZFS errors?
daily_status_zfs_enable="YES"

You can also tell ZFS to TRIM your SSDs every day. As far as I can see there's no option to do the TRIM less often than once a day; I guess if you want that you have to create your own weekly or monthly periodic script (perhaps by copying the 801.trim-zfs daily script and modifying it appropriately). Or you can just do 'zpool trim ...' every so often by hand.

We're (now) moving from OpenBSD to FreeBSD for firewalls

By: cks
19 November 2025 at 04:17

A bit over a year ago I wrote about why we'd become interested in FreeBSD; to summarize, FreeBSD appeared promising as a better, easier to manage host operating system for PF-based things. Since then we've done enough with FreeBSD to have decided that we actively prefer it to OpenBSD. It's been relatively straightforward to convert our firewall OpenBSD PF rulesets to FreeBSD PF and the resulting firewalls have clearly better performance on our 10G network than our older OpenBSD ones did (with less tuning).

(It's possible that the very latest OpenBSD has significantly improved bridging and routing firewall performance so that it no longer requires the fastest single-core CPU performance you can get to go decently. But pragmatically it's too late; FreeBSD had that performance earlier and we now have more confidence in FreeBSD's performance in the firewall role than OpenBSD's.)

There are some nice things about FreeBSD, like root on ZFS, and broadly I feel that it's more friendly than OpenBSD. But those are secondary to its firewall network performance (and PF compatibility); if its network performance was no better than OpenBSD (or worse), we wouldn't be interested. Since it is better, it's now displacing OpenBSD for our firewalls and our latest VPN servers. We've stopped building new OpenBSD machines, so as firewalls come up for replacement they get rebuilt as FreeBSD machines.

(We have a couple of non-firewall OpenBSD machines that will likely turn into Ubuntu machines when we replace them, although we can't be sure until it actually happens.)

Would we consider going back to OpenBSD? Maybe, but probably not. Now that we've migrated a significant number of firewalls, moving the remaining ones to FreeBSD is the easiest approach, even if new OpenBSD firewalls would equal their performance. And the FreeBSD 10G firewall performance we're getting is sufficiently good that it leaves OpenBSD relatively little ground to exceed it.

(There are some things about FreeBSD that we're not entirely enthused about. We're going to be doing more firewall upgrades than we used to with OpenBSD, for one.)

PS: As before, I don't think there's anything wrong with OpenBSD if it meets your needs. We used it happily for years until we started being less happy with its performance on 10G Ethernet. A lot of people don't have that issue.

A surprise with how '#!' handles its program argument in practice

By: cks
18 November 2025 at 03:54

Every so often I get to be surprised about some Unix thing. Today's surprise is the actual behavior of '#!' in practice on at least Linux, FreeBSD, and OpenBSD, which I learned about from a comment by Aristotle Pagaltzis on my entry on (not) using '#!/usr/bin/env'. I'll quote the starting part here:

In fact the shebang line doesn’t require absolute paths, you can use relative paths too. The path is simply resolved from your current directory, just as any other path would be – the kernel simply doesn’t do anything special for shebang line paths at all. [...]

I found this so surprising that I tested it on our Linux servers as well as a FreeBSD and an OpenBSD machine. On the Linux servers (and probably on the others too), the kernel really does accept the full collection of relative paths in '#!'. You can write '#!python3', '#!bin/python3', '#!../python3', '#!../../../usr/bin/python3', and so on, and provided that your current directory is in the right place in the filesystem, they all worked.

(On FreeBSD and OpenBSD I only tested the '#!python3' case.)

As far as I can tell, this behavior goes all the way back to 4.2 BSD (which isn't quite the origin point of '#!' support in the Unix kernel but is about as close as we can get). The execve() kernel implementation in sys/kern_exec.c finds the program from your '#!' line with a namei() call that uses the same arguments (apart from the name) as it did to find the initial executable, and that initial executable can definitely be a relative path.

Although this is probably the easiest way to implement '#!' inside the kernel, I'm a little bit surprised that it survived in Linux (in a completely independent implementation) and in OpenBSD (where the security people might have had a double-take at some point). But given Hyrum's Law there are probably people out there who are depending on this behavior so we're now stuck with it.

(In the kernel, you'd have to go at least a little bit out of your way to check that the new path starts with a '/' or use a kernel name lookup function that only resolves absolute paths. Using a general name lookup function that accepts both absolute and relative paths is the simplest approach.)

PS: I don't have access to Illumos based systems, other BSDs (NetBSD, etc), or macOS, but I'd be surprised if they had different behavior. People with access to less mainstream Unixes (including commercial ones like AIX) can give it a try to see if there are any Unixes that don't support relative paths in '#!'.

People are sending HTTP requests with X-Forwarded-For across the Internet

By: cks
17 November 2025 at 03:49

Over on the Fediverse, I shared a discovery that came from turning over some rocks here on Wandering Thoughts:

This is my face when some people out there on the Internet send out HTTP requests with X-Forwarded-For headers, and maybe even not maliciously or lying. Take a bow, ZScaler.

The HTTP X-Forwarded-For header is something that I normally expect to see only on something behind a reverse proxy, where the reverse proxy frontend is using it to tell the backend the real originating IP (which is otherwise not available when the HTTP requests are forwarded with HTTP). As a corollary of this usage, if you're operating a reverse proxy frontend you want to remove or rename any X-Forwarded-For headers that you receive from the HTTP client, because it may be trying to fool your backend about who it is. You can use another X- header name for this purpose if you want, but using X-Forwarded-For has the advantage that it's a de-facto standard and so random reverse proxy aware software is likely to have an option to look at X-Forwarded-For.

(See, for example, the security and privacy concerns section of the MDN page.)

Wandering Thoughts doesn't run behind a reverse proxy, and so I assume that I wouldn't see X-Forwarded-For headers if I looked for them. More exactly I assumed that I could take the presence of an X-Forwarded-For header as an indication of a bad request. As I found out, this doesn't seem to be the case; one source of apparently legitimate traffic to Wandering Thoughts appears to attach what are probably legitimate X-Forwarded-For headers to requests going through it. I believe this particular place operates partly as a (forward) HTTP proxy; if they aren't making up the X-Forwarded-For IP addresses, they're willing to leak the origin IPs of people using them to third parties.

All of this makes me more curious than usual to know what HTTP headers and header values show up on requests to Wandering Thoughts. But not curious enough to stick in logging, because that would be quite verbose unless I could narrow things down to only some requests. Possibly I should stick in logging that can be quickly turned on and off, so I can dump header information only briefly.

(These days I've periodically wound up in a mood to hack on DWiki, the underlying engine behind Wandering Thoughts. It reminds me that I enjoy programming.)

We haven't seen ZFS checksum failures for a couple of years

By: cks
16 November 2025 at 04:04

Over on the Fediverse I mentioned something about our regular ZFS scrubs:

Another weekend, another set of ZFS scrubs of work's multiple terabytes of data sitting on a collection of consumer 4 TB SSDs (mirrored, we aren't crazy, and also we have backups). As usual there is not a checksum error to be seen. I think it's been years since any came up.

I accept that SSDs decay (we've had some die, of course) and random read errors happen, but our ZFS-based experience across both HDDs and SSDs has been that the rate is really low for us. Probably we're not big enough.

We regularly scrub our pools through automation, currently once every few weeks. Back in 2022 I wrote about us seeing only a few errors since we moved to SSDs in 2018, and then I had the impression that everything had been quiet since then. Hand-checking our records tells me that I'm slightly wrong about this and we had some errors on our fileservers in 2023, but none since then.

  • starting in January of 2023, one particular SSD began experiencing infrequent read and checksum errors that persisted (off and on) through early March of 2023, when we gave in and replaced it. This was a relatively new 4 TB SSD that had only been in service for a few months at the time.

  • In late March of 2023 we saw a checksum error on a disk that later in the year (in November) experienced some read errors, and then in late February of 2024 had read and write errors. We replaced the disk at that point.

I believe these two SSDs are the only ones that we've replaced since 2022, although I'm not certain and we've gone through a significant amount of SSD shuffling since then for reasons outside the scope of this entry. That shuffling means that I'm not going to try to give any number for what percentage of our fileserver SSDs have had problems.

In the first case, the checksum errors were effectively a lesser form of the read errors we saw at the same time, so it was obvious the SSD had problems. In the second case the checksum error may have been a very early warning sign of what later became an obvious slow SSD failure. Or it could be coincidence.

(It also could be that modern SSDs have so much internal error checking and correction that if there is some sort of data rot or mis-read it's most likely to be noticed inside the SSD and create a read failure at the protocol level (SAS, SATA, NVMe, etc).)

I definitely believe that disk read errors and slow disk failures happen from time to time, and if you have a large enough population of disks (SSDs or HDDs or both) you definitely need to worry about these problems. We get all sorts of benefits from ZFS checksums and ZFS scrubs, and the peace of mind about this is one of them. But it looks like we're not big enough to have run into this across our fileserver population.

(At the moment we have 114 4 TB SSDs in use across our production fileservers.)

OIDC, Identity Providers, and avoiding some obvious security exposures

By: cks
15 November 2025 at 04:40

OIDC (and OAuth2) has some frustrating elements that make it harder for programs to support arbitrary identity providers (as discussed in my entry on the problems facing MFA-enabled IMAP in early 2025). However, my view is that these elements exist for good reason, and the ultimate reason is that an OIDC-like environment is by default an obvious security exposure (or several of them). I'm not sure there's any easy way around the entire set of problems that push towards these elements or something quite like them.

Let's imagine a platonically ideal OIDC-like identity provider for clients to use, something that's probably much like the original vision of OpenID. In this version, people (with accounts) can authenticate to the identity provider from all over the Internet, and it will provide them with a signed identity token. The first problem is that we've just asked identity providers to set up an Internet-exposed account and password guessing system. Anyone can show up, try it out, and best of all if it works they don't just get current access to something, they get an identity token.

(Within a trusted network, such as an organization's intranet, this exposed authentication endpoint is less of a concern.)

The second problem is that identity token, because the IdP doesn't actually provide the identity token to the person, it provides the token to something that asked for it. One of the uses of that identity token is to present it to other things to demonstrate that you're acting on the person's behalf; for example, your IMAP client presents it to your IMAP server. If what the identity token is valid for is not restricted in some way, a malicious party could get you to 'sign up with your <X> ID' for their website, take the identity token it got from the IdP, and reuse it with your IMAP server.

To avoid issues, this identity token must have a limited scope (and everything that uses identity tokens needs to check that the token for them). This implies that you can't just ask for an identity token in general, you have to ask for it for use with something specific. As a further safety measure the identity provider doesn't want to give such a scoped token to anything except the thing that's supposed to get it. You (an attacker) should not be able to tell the identity provider 'please create a token for webserver X, and give it to me, not webserver X' (this is part of the restrictions on OIDC redirect URIs).

In OIDC, what deals with much of these risks is client IDs, optionally client secrets, and redirect URIs. Client IDs are used to limit what an identity token can be used for and where it can be sent to (in combination with redirect URIs), and a client secret can be used by something getting a token to prove that it is the client ID it claims to be. If you don't have the right information, the OIDC IdP won't even talk to you. However, this means that all of this information has to be given to the client, or at least obtained by the client and stored by it.

(These days OIDC has a specification for Dynamic Client Registration and can support 'open' dynamic registration of clients, if desired (although it's apparently not widely implemented). But clients do have to register to get the risk-mitigating information for the main IdP endpoint, and I don't know how this is supposed to handle the IMAP situation if the IMAP server wants to verify that the OIDC token it receives was intended for it, since each dynamic client will have a different client ID.)

My script to 'activate' Python virtual environments

By: cks
14 November 2025 at 03:27

After I wrote about Python virtual environments and source code trees, I impulsively decided to set up the development tree of our Django application to use a Django venv instead of a 'pip install --user' version of Django. Once I started doing this, I quickly decided that I wanted a general script that would switch me into a venv. This sounds a little bit peculiar if you know Python virtual environments so let me explain.

Activating a Python virtual environment mostly means making sure that its 'bin' directory is first on your $PATH, so that 'python3' and 'pip' and so on come from it. Venvs come with files that can be sourced into common shells in order to do this (with the one for Bourne shells called 'activate'), but for me this has three limits. You have to use the full path to the script, they change your current shell environment instead of giving you a new one that you can just exit to discard this 'activation', and I use a non-standard shell that they don't work in. My 'venv' script is designed to work around all three of those limitations. As a script, it starts a new shell (or runs a command) instead of changing my current shell environment, and I set it up so that it knows my standard place to keep virtual environments (and then I made it so that I can use symbolic links to create 'django' as the name of 'whatever my current Django venv is').

(One of the reasons I want my 'venv' command to default to running a shell for me is that I'm putting the Python LSP server into my Django venvs, so I want to start GNU Emacs from an environment with $PATH set properly to get the right LSP server.)

My initial version only looked for venvs in my standard location for development related venvs. But almost immediately after starting to use it, I found that I wanted to be able to activate pipx venvs too, so I added ~/.local/pipx/venvs to what I really should consider to be a 'venv search path' and formalize into an environment variable with a default value.

I've stuffed a few other features into the venv script. It will print out the full path to the venv if I ask it to (in addition to running a command, which can be just 'true'), or something to set $PATH. I also found I sometimes wanted it to change directory to the root of the venv. Right now I'm still experimenting with how I want to build other scripts on top of this one, so some of this will probably change in time.

One of my surprises about writing the script is how much nicer it's made working with venvs (or working with things in venvs). There's nothing it does that wasn't possible before, but the script has removed friction (more friction than I realized was there, which is traditional for me).

PS: This feels like a sufficiently obvious idea that I suspect that a lot of people have written 'activate a venv somewhere along a venv search path' scripts. There's unlikely to be anything special about mine, but it works with my specific shell.

Getting feedback as a small web crawler operator

By: cks
13 November 2025 at 04:17

Suppose, hypothetically, that you're trying to set up a small web crawler for a good purpose. These days you might be focused on web search for text focused sites, or small human written sites, or similar things, and certainly given the bad things that are happening with the major crawlers we could use them. As a small crawler, you might want to get feedback and problem reports from web site operators about what your crawler is doing (or not doing). As it happens, I have some advice and views on this.

  • Above all, remember that you are not Google or even Bing. Web site operators need Google to crawl them, and they have no choice but to bend over backward for Google and to send out plaintive signals into the void if Googlebot is doing something undesirable. Since you're not Google and you need websites much more than they need you, the simplest thing for website operators to do with and about your crawler is to ignore the issue, potentially block you if you're causing problems, and move on.

    You cannot expect people to routinely reach out to you. Anyone who does reach out to you is axiomatically doing you a favour, at the expense of some amount of their limited time and at some risk to themselves.

  • Website operators have no reason to trust you or trust that problem reports will be well received. This is a lesson plenty of people have painfully learned from reporting spam (email or otherwise) and other abuse; a lot of the time your reports can wind up in the hands of people who aren't well intentioned toward you (either going directly to them or 'helpfully' being passed on by the ISP). At best you confirm that your email address is alive and get added to more spam address lists; at worst you get abused in various ways.

    The consequence of this is that if you want to get feedback, you should make it as low-risk as possible for people. The lowest risk way (to website operators) is for you to have a feedback form on your site that doesn't require email or other contact methods. If you require that website operators reveal their email addresses, social media handles, or whatever, you will get much less feedback (this includes VCS forge handles if you force them to make issue reports on some VCS forge).

    (This feedback form should be easy to find, for example being directly linked from the web crawler information URL in your User-Agent.)

  • As far as feedback goes, both your intentions and your views on the reasonableness of what your web crawler is doing (and how someone's website behaves) are irrelevant. What matters is the views of website operators, who are generally doing you a favour by not simply blocking or ignoring your crawler and moving on. If you disagree with their feedback, the best thing to do is be quiet (and maybe say something neutral if they ask for a reply). This is probably most important if your feedback happens through a public VCS forge issue tracker, where future people who are thinking about filing an issue the way you asked may skim over past issues to see how they went.

    (You may or may not ignore website operator feedback that you disagree with depending on how much you want to crawl (all of) their site.)

At the moment, most website operators who notice a previously unknown crawler will likely assume that it's an (abusive) LLM crawler. One way to lower the chances of this is to follow social conventions around crawlers for things like crawler User-Agents and not setting the Referer header. I don't think you have to completely imitate how Googlebot, bingbot, Applebot, the archive.org bot and so on format their User-Agent strings, but it's going to help to generally look like them and clearly put the same sort of information into yours. Similarly, if you can it will help to crawl from clearly identified IPs with reverse DNS. The more that people think you're legitimate and honest, the more likely they are to spend the time and take the risk to give you feedback; the more sketchy or even uncertain you look, the less likely you are to get feedback.

(In general, any time you make website operators uncertain about an aspect of your web crawler, some number of them will not be charitable in their guess. The more explicit and unambiguous you are in the more places, the better.)

Building and running a web crawler is not an easy thing on today's web. It requires both technical knowledge of various details of HTTP and how you're supposed to react to things (eg), and current social knowledge of what is customary and expected of web crawlers, as well as what you may need to avoid (for example, you may not want to start your User-Agent with 'Mozilla/5.0' any more, and in general the whole anti-crawling area is rapidly changing and evolving right now). Many website operators revisit blocks and other reactions to 'bad' web crawlers only infrequently, so you may only get one chance to get things right. This expertise can't be outsourced to a random web crawling library because many of them don't have it either.

(While this entry was sparked by a conversation I had on the Fediverse, I want to be explicit that it is in no way intended as a subtoot of that conversation. I just realized that I had some general views that didn't fit within the margins of Fediverse posts.)

Firefox's sudden weird font choice and fixing it

By: cks
12 November 2025 at 04:03

Today, while I was in the middle of using my normal browser instance, it decided to switch from DejaVu Sans to Noto Sans as my default font:

Dear Firefox: why are you using Noto Sans all of a sudden? I have you set to DejaVu Sans (and DejaVu everything), and fc-match 'sans' and fc-match serif both say they're DejaVu (and give the DejaVu TTF files). This is my angry face.

This is a quite noticeable change for me because it changes the font I see on Wandering Thoughts, my start page, and other things that don't set any sort of explicit font. I don't like how Noto Sans looks and I want DejaVu Sans.

(I found out that it was specifically Noto Sans that Firefox was using all of a sudden through the Web Developer tools 'Font' information, and confirmed that Firefox should still be using DejaVu through the way to see this in Settings.)

After some flailing around, it appears that what I needed to do to fix this was explicitly set about:config's font.name.serif.x-western, font.name.sans-serif.x-western, and font.name.monospace.x-western to specific values instead of leaving them set to nothing, which seems to have caused Firefox to arrive on Noto Sans through some mysterious process (since the generic system font name 'sans' was still mapping to DejaVu Sans). I don't know if these are exposed through the Fonts advanced options in Settings → General, which are (still) confusing in general. It's possible that these are what are used for 'Latin'.

(I used to be using the default 'sans', 'serif', and 'monospace' font names that cascaded through to the DejaVu family. Now I've specifically set everything to the DejaVu set, because if something in Fedora or Firefox decides that the default mapping should be different, I don't want Firefox to follow it, I want it to stay with DejaVu.)

I don't know why Firefox would suddenly decide these pages are 'western' instead of 'unicode'; all of them are served as or labeled as UTF-8, and nothing about that has changed recently. Unfortunately, as far as I know there's no way to get Firefox to tell you what font.name preference name it used to pick (default) fonts for a HTML document. When it sends HTTP 304 Not Modified responses, Wandering Thoughts doesn't include a Content-Type header (with the UTF-8 character set), but as far as I know that's a standard behavior and browsers presumably cope with it.

(Firefox does see 'Noto Sans' as a system UI font, which it uses on things like HTML form buttons, so it didn't come from nowhere.)

It makes me sad that Firefox continues to have no global default font choice. You can set 'Unicode' but as I've just seen, this doesn't make what you set there the default for unset font preferences, and the only way to find out what unset font preferences you have is to inspect about:config.

PS: For people who aren't aware of this, it's possible for Firefox to forget some of your about:config preferences. Working around this probably requires using Firefox policies (via), which can force-set arbitrary about:config preferences (among other things).

Discovering orphaned binaries in /usr/sbin on Fedora 42

By: cks
11 November 2025 at 04:10

Over on the Fediverse, I shared a somewhat unwelcome discovery I made after upgrading to Fedora 42:

This is my face when I have quite a few binaries in /usr/sbin on my office Fedora desktop that aren't owned by any package. Presumably they were once owned by packages, but the packages got removed without the files being removed with them, which isn't supposed to happen.

(My office Fedora install has been around for almost 20 years now without being reinstalled, so things have had time to happen. But some of these binaries date from 2021.)

There seem to be two sorts of these lingering, unowned /usr/sbin programs. One sort, such as /usr/sbin/getcaps, seems to have been left behind when its package moved things to /usr/bin, possibly due to this RPM bug (via). The other sort is genuinely unowned programs dating to anywhere from 2007 (at the oldest) to 2021 (at the newest), which have nothing else left of them sitting around. The newest programs are what I believe are wireless management programs: iwconfig, iwevent, iwgetid, iwlist, iwpriv, and iwspy, and also "ifrename" (which I believe was also part of a 'wireless-tools' package). I had the wireless-tools package installed on my office desktop until recently, but I removed it some time during Fedora 40, probably sparked by the /sbin to /usr/sbin migration, and it's possible that binaries didn't get cleaned up properly due to that migration.

The most interesting orphan is /usr/sbin/sln, dating from 2018, when apparently various people discovered it as an orphan on their system. Unlike all the other orphan programs, the sln manual page is still shipped as part of the standard 'man-pages' package and so you can read sln(8) online. Based on the manual page, it sounds like it may have been part of glibc at one point.

(Another orphaned program from 2018 is pam_tally, although it's coupled to pam_tally2.so, which did get removed.)

I don't know if there's any good way to get mappings from files to RPM packages for old Fedora versions. If there is, I'd certainly pick through it to try to find where various of these files came from originally. Unfortunately I suspect that for sufficiently old Fedora versions, much of this information is either offline or can't be processed by modern versions of things like dnf.

(The basic information is used by eg 'dnf provides' and can be built by hand from the raw RPMs, but I have no desire to download all of the RPMs for decade-old Fedora versions even if they're still available somewhere. I'm curious but not that curious.)

PS: At the moment I'm inclined to leave everything as it is until at least Fedora 43, since RPM bugs are still being sorted out here. I'll have to clean up genuinely orphaned files at some point but I don't think there's any rush. And I'm not removing any more old packages that use '/sbin/<whatever>', since that seems like it has some bugs.

Python virtual environments and source code trees

By: cks
10 November 2025 at 04:22

Python virtual environments are mostly great for actually deploying software. Provided that you're using the same version of Python (3) everywhere (including CPU architecture), you can make a single directory tree (a venv) and then copy and move it around freely as a self-contained artifact. It's also relatively easy to use venvs to switch the version of packages or programs you're using, for example Django. However, venvs have their frictions, at least for me, and often I prefer to do Python development outside of them, especially for our Django web application).

(This means using 'pip install --user' to install things like Django, to the extent that it's still possible.)

One point of friction is in their interaction with working on the source code of our Django web application. As is probably common, this source code lives in its own version control system controlled directory tree (we use Mercurial for this for reasons). If Django is installed as a user package, the native 'python3' will properly see it and be able to import Django modules, so I can directly or indirectly run Django commands with the standard Python and my standard $PATH.

If Django is installed in a venv, I have two options. The manual way is to always make sure that this Django venv is first on my $PATH before the system Python, so that 'python3' is always from the venv and not from the system. This has a little bit of a challenge with Python scripts, and is one of the few places where '#!/usr/bin/env python3' makes sense. In my particular environment it requires extra work because I don't use a standard Unix shell and so I can't use any of the venv bin/activate things to do all the work for me.

The automatic way is to make all of the convenience scripts that I use to interact with Django explicitly specify the venv python3 (including for things like running a test HTTP server and invoking local management commands), which works fine since a program can be outside the venv it uses. This leaves me with the question of where the Django venv should be, and especially if it should be outside the source tree or in a non-VCS-controlled path inside the tree. Outside the source tree is the pure option but leaves me with a naming problem that has various solutions. Inside the source tree (but not VCS controlled) is appealingly simple but puts a big blob of otherwise unrelated data into the source tree.

(Of course I could do both at once by having a 'venv' symlink in the source tree, ignored by Mercurial, that points to wherever the Django venv is today.)

Since 'pip install --user' seems more and more deprecated as time goes by, I should probably move to developing with a Django venv sooner or later. I will probably use a venv outside the source tree, and I haven't decided about an in-tree symlink.

(I'll still have the LSP server problem but I have that today. Probably I'll install the LSP server into the Django venv.)

PS: Since this isn't a new problem, the Python community has probably come up with some best practices for dealing with it. But in today's Internet search environment I have no idea how to find reliable sources.

A HTTP User-Agent that claims to be Googlebot is now a bad idea

By: cks
9 November 2025 at 04:04

Once upon a time, people seem to have had a little thing for mentioning Googlebot in their HTTP User-Agent header, much like browsers threw in claims to make them look like Firefox or whatever (the ultimate source of the now-ritual 'Mozilla/5.0' at the start of almost every browser's User-Agent). People might put in 'allow like Googlebot' or just say 'Googlebot' in their User-Agent. Some people are still doing this today, for example:

Gwene/1.0 (The gwene.org rss-to-news gateway) Googlebot

This is now an increasingly bad idea on the web and if you're doing it, you should stop. The problem is that there are various malicious crawlers out there claiming to be Googlebot, and Google publishes their crawler IP address ranges. Anything claiming to be Googlebot that is not from a listed Google IP is extremely suspicious and in this day and age of increasing anti-crawler defenses, blocking all 'Googlebot' activity that isn't from one of their listed IP ranges is an obvious thing to do. Web sites may go even further and immediately taint the IP address or IP address range involved in impersonating Googlebot, blocking or degrading further requests regardless of the User-Agent.

(Gwene is not exactly claiming to be Googlebot but they're trying to get simple Googlebot-recognizers to match them against Googlebot allowances. This is questionable at best. These days such attempts may do more harm than good as they get swept up in precautions against Googlebot forgery, or rules that block Googlebot from things it shouldn't be fetching, like syndication feeds.)

A similar thing applies to bingbot and the User-Agent of any other prominent web search engines, and Bing does publish their IP address ranges. However, I don't think I've ever seen someone impersonate bingbot (which probably doesn't surprise anyone). I don't know if anyone ever impersonates Archive.org (no one has in the past week here), but it's possible that crawler operators will fish to see if people give special allowances to them that can be exploited.

(The corollary of this is that if you have a website, an extremely good signal of bad stuff is someone impersonating Googlebot and maybe you could easily block that. I think this would be fairly easy to do in an Apache <If> clause that then Allow's from Googlebot's listed IP addresses and Denies everything else, but I haven't actually tested it.)

Containers and giving up on expecting good software installation practices

By: cks
8 November 2025 at 03:58

Over on the Fediverse, I mentioned a grump I have about containers:

As a sysadmin, containers irritate me because they amount to abandoning the idea of well done, well organized, well understood, etc installation of software. Can't make your software install in a sensible way that people can control and limit? Throw it into a container, who cares what it sprays where across the filesystem and how much it wants to be the exclusive owner and controller of everything in sight.

(This is a somewhat irrational grump.)

To be specific, it's by and large abandoning the idea of well done installs of software on shared servers. If you're only installing software inside a container, your software can spray itself all over the (container) filesystem, put itself in hard-coded paths wherever it feels like, and so on, even if you have completely automated instructions for how to get it to do that inside a container image that's being built. Some software doesn't do this and is well mannered when installed outside a container, but some software does and you'll find notes to the effect that the only supported way of installing it is 'here is this container image', or 'here is the automated instructions for building a container image'.

To be fair to containers, some of this is due to missing Unix APIs (or APIs that theoretically exist but aren't standardized). Do you want multiple Unix logins for your software so that it can isolate different pieces of itself? There's no automated way to do that. Do you run on specific ports? There's generally no machine-readable way to advertise that, and people may want you to build in mechanisms to vary those ports and then specify the new ports to other pieces of your software (that would all be bundled into a container image). And so on. A container allows you to put yourself in an isolated space of Unix UIDs, network ports, and so on, one where you won't conflict with anyone else and won't have to try to get the people who want to use your software to create and manage the various details (because you've supplied either a pre-built image or reliable image building instructions).

But I don't have to be happy that software doesn't necessarily even try, that we seem to be increasingly abandoning much of the idea of running services in shared environments. Shared environments are convenient. A shared Unix environment gives you a lot of power and avoids a lot of complexity that containers create. Fortunately there's still plenty of software that is willing to be installed on shared systems.

(Then there is the related grump that the modern Linux software distribution model seems to be moving toward container-like things, which has a whole collection of issues associated with it.)

Go's runtime may someday start explicitly freeing some internal memory

By: cks
7 November 2025 at 03:30

One of my peculiar hobbies is that I read every commit message for the Go (development) repository. Often this is boring, but sometimes I discover things I find amusing:

This is my amused face when Go is adding explicit, non-GC freeing of memory from within the runtime and compiler-generated code under some circumstances. It's perfectly sensible, but still.

It turns out that right now, the only thing that's been added is a 'GOEXPERIMENT=runtimefree' Go experiment, which you can set without build errors. There's no actual use of it in the current development tree.

The proposal that led to this doesn't seem to currently be visible in a mainline commit in the Go proposal repository, but until it surfaces you can access Directly freeing user memory to reduce GC work from the (proposed?) change (update: see below for the final version), and also Go issue 74299: runtime, cmd/compile: add runtime.free, runtime.freetracked and GOEXPERIMENT=runtimefree and the commit itself, which only adds the Go experiment flag. A preview of performance results (from a link in issue 74299) is in the message of slices: free intermediate memory in Collect via runtime.freeSlice.

(Looking into this has caused me to find the Go Release Dashboard, and see eg the pending proposals section, where you can find multiple things for this proposal.)

Update: The accepted proposal is now merged in the Go proposals repository, Directly freeing user memory to reduce GC work.

I feel the overall idea is perfectly sensible, for all that it feels a bit peculiar in a language with a mark and sweep garbage collector. As the proposal points out, there are situations where the runtime knows that something doesn't escape but it has to allocate it on the heap instead of the stack, and also situations where the runtime knows that some value is dead but the compiler can't prove it. In both situations we can reduce pressure on memory allocation and to some extent garbage collection by explicitly marking the objects as free right away. A runtime example cited in the proposal is when maps grow and split, which is safe since map values are unaddressable so no one can have (validly formed) pointers to them.

(Because unused objects aren't traversed by the garbage collector, this doesn't directly reduce the amount of work GC has to do but it does mean GC might not have to run as much.)

Sadly, so far only the GOEXPERIMENT setting has landed in the Go development tree so there's nothing to actually play with (and no code to easily read). We have to look from afar and anticipate, and at this point it's possible no actual code will land until after Go 1.26, since based on the usual schedule there will be a release freeze soon, leaving not very much time to land all of these changes).

(The whole situation turns out to be less exciting than I thought when I read the commit message and made my Fediverse post, but that's one reason to write these entries.)

PS: In general, garbage collected languages can also have immediate freeing of memory, for example if they use reference counting. CPython is an example and CPython people can be quite used to deterministic, immediate collection of unreferenced objects along with side effects such as closing file descriptors. Sometimes this can mask bugs.

A problem for downloading things with curl

By: cks
6 November 2025 at 04:24

For various reasons, I'm working to switch from wget to curl, and generally this has been going okay. However, I've now run into one situation where I don't know how to make curl do what I want. It is, of course, a project that doesn't bother to do easily-fetched downloads, but in a very specific way. In fact it's Django (again).

The Django URLs for downloads look like this:

https://www.djangoproject.com/download/5.2.8/tarball/

The way the websites of many projects turn these into actual files is to provide a filename in the HTTP Content-Disposition header in the reply. In curl, these websites can be handled with the -J (--remote-header-name) option, which uses the filename from the Content-Disposition if there is one.

Unfortunately, Django's current website does not operate this way. Instead, the URL above is a HTTP redirection to the actual .tar.gz file (on media.djangoproject.com). The .tar.gz file is then served without a Content-Disposition header as an application/octet-stream. Wget will handle this with --trust-server-names, but as far as I can tell from searching through the curl manpage, there is no option that will do this in curl.

(In optimistic hope I even tried --location-trusted, but no luck.)

If curl is directed straight to the final URL, 'curl -O' alone is enough to get the right file name. However, if curl goes through a redirection, there seems to be no option that will cause it to re-evaluate the 'remote name' based on the new URL; the initial URL and the name derived from it sticks, and you get a file unhelpfully called 'tarball' (in this case). If you try to be clever by running the initial curl without -O but capturing any potential redirection with "-w '%{redirect_url}\n'" so you can manually follow it in a second curl command, this works (for one level of redirections) but leaves you with a zero-length file called 'tarball' from the first curl.

It's possible that this means curl is the wrong tool for the kind of file downloads I want to do from websites like this, and I should get something else entirely. However, that something else should at least be a completely self contained binary so that I can easily drag it around to all of the assorted systems where I need to do this.

(I could always try to write my own in Go, or even take this as an opportunity to learn Rust, but that way lies madness and a lot of exciting discoveries about HTTP downloads in the wild. The more likely answer is that I hold my nose and keep using wget for this specific case.)

PS: I think it's possible to write a complex script using curl that more or less works here, but one of the costs is that you have to make first a HEAD and then a GET request to the final target, and that irritates me.

Some notes on duplicating xterm windows

By: cks
5 November 2025 at 03:45

Recently on the Fediverse, Dave Fischer mentioned a neat hack:

In the decades-long process of getting my fvwm config JUST RIGHT, my xterm right-click menu now has a "duplicate" command, which opens a new xterm with the same geometry, on the same node, IN THE SAME DIRECTORY. (Directory info aquired via /proc.)

[...]

(See also a followup note.)

This led to @grawity sharing an xterm-native approach to this, using xterm's spawn-new-terminal() internal function that's available through xterm's keybindings facility.

I have a long-standing shell function in my shell that attempts to do this (imaginatively called 'spawn'), but this is only available in environments where my shell is set up, so I was quite interested in the whole area and did some experiments. The good news is that xterm's 'spawn-new-terminal' works, in that it will start a new xterm and the new xterm will be in the right directory. The bad news for me is that that's about all that it will do, and in my environment this has two limitations that will probably make it not something I use a lot.

The first limitation is that this starts an xterm that doesn't copy the command line state or settings of the parent xterm. If you've set special options on the parent xterm (for example, you like your root xterms to have a red foreground), this won't be carried over to the new xterm. Similarly, if you've increased (or decreased) the font size in your current xterm or otherwise changed its settings, spawn-new-terminal doesn't duplicate these; you get a default xterm. This is reasonable but disappointing.

(While spawn-new-terminal takes arguments that I believe it will pass to the new xterm, as far as I know there's no way to retrieve the current xterm's command line arguments to insert them here.)

The larger limitation for me is that when I'm at home, I'm often running SSH inside of an xterm in order to log in to some other system (I have a 'sshterm' script to automate all the aspects of this). What I really want when I 'duplicate' such an xterm is not a copy of the local xterm running a local shell (or even starting another SSH to the remove system), but the remote (shell) context, with the same (remote) current directory and so on. This is impossible to get in general and difficult to set up even for situations where it's theoretically possible. To use spawn-new-terminal effectively, you basically need either all local xterms or copious use of remote X forwarded over SSH (where the xterm is running on the remote system, so a duplicate of it will be as well and can get the right current directory).

Going through this experience has given me some ideas on how to improve the situation overall. Probably I should write a 'spawn' shell script to replace or augment my 'spawn' shell function so I can readily have it in more places. Then when I'm ssh'd in to a system, I can make the 'spawn' script at least print out a command line or two for me to copy and paste to get set up again.

(Two command lines is the easiest approach, with one command that starts the right xterm plus SSH combination and the other a 'cd' to the right place that I'd execute in the new logged in window. It's probably possible to combine these into an all-in-one script but that starts to get too clever in various ways, especially as SSH has no straightforward way to pass extra information to a login shell.)

My GPS bike computer is less distracting than the non-computer option

By: cks
4 November 2025 at 02:44

I have a GPS bike computer primarily for following pre-planned routes, because it became a better supported option than our old paper cue sheets. One of the benefits of switching to from paper cue sheets to a GPS unit was better supported route following, but after I made the switch, I found that it was also less distracting than using paper cue sheets. On the surface this might sound paradoxical, since people often say that computer screens are more distracting. It's true that a GPS bike computer has a lot that you can look at, but for route following, a GPS bike computer also has features that let me not pay attention to it.

When I used paper cue sheets, I always had to pay a certain amount of attention to following the route. I needed to keep track of where we were on the cue sheet's route, and either remember what the next turn was or look at the cue sheet frequently enough that I could be sure I wouldn't miss it. I also needed to devote a certain amount of effort to scanning street signs to recognize the street we'd be turning on to. All of this distracted me from looking around and enjoying the ride; I could never check out completely from route following.

When I follow a route on my GPS bike computer, it's much easier to not pay attention to route following most of the time. My GPS bike computer will beep at me and display a turn alert when we get close to a turn, and I always have it display the distance to the next turn so I can take a quick glance to reassure myself that we're nowhere near the turn. If there's any ambiguity about where to turn, I can look at the route's trace on a map and see that the turn is, for example, two streets ahead, and of course the GPS bike computer is always keeping track of where in the route I am.

Because the GPS bike computer can tell me when I need to pay attention to following the route, I'm free to not pay attention at other times. I can stop thinking about the route at all and look around at the scenery, talk with my fellow club riders, and so on.

(When I look around there are similar situations at work, with some of our systems. Our metrics, monitoring, and alerting system often has the net effect that I don't even look at how things are going because I assume that silence means all is okay. And if I want to do the equivalent of glancing at my GPS bike computer to check the distance to the next turn, I can look at our dashboards.)

How I handle URLs in my unusual X desktop

By: cks
3 November 2025 at 04:34

I have an unusual X desktop environment that has evolved over a long period, and as part of that I have an equally unusual and slowly evolved set of ways to handle URLs. By 'handle URLs', what I mean is going from an URL somewhere (email, text in a terminal, etc) to having the URL open in one of my several browser environments. Tied into this is handling non-URL things that I also want to open in a browser, for example searching for various sorts of things in various web places.

The simplest place to start is at the end. I have several browser environments and to go along with them I have a script for each that opens URLs provided as command line arguments in a new window of that browser. If there's no command line arguments, the scripts open a default page (usually a blank page, but for my main browser it's a special start page of links). For most browsers this works by running 'firefox <whatever>' and so will start the browser if it's not already running, but for my main browser I use a lightweight program that uses Firefox's X-based remote control protocol. which means I have to start the browser outside of it.

Layered on top of these browser specific scripts is a general script to open URLs that I call 'openurl'. The purpose of openurl is to pick a browser environment based on the particular site I'm going to. For example, if I'm opening the URL of a site where I know I need JavaScript, the script opens the URL in my special 'just make it work' JavaScript enabled Firefox. Most urls open in my normal, locked down Firefox. I configure programs like Thunderbird to open URLs through this openurl script, sometimes directly and sometimes indirectly.

(I haven't tried to hook openurl into the complex mechanisms that xdg-open uses to decide how to open URLs. Probably I should but the whole xdg-open thing irritates me.)

Layered on top of openurl and the specific browser scripts is a collection of scripts that read the X selection and do a collection of URL-related things with it. One script reads the X selection, looks for it being a URL, and either feeds the URL to openurl or just runs openurl to open my start page. Other scripts feed the URL to alternate browser environments or do an Internet search for the selection. Then I have a fvwm menu with all of these scripts in it and one of my fvwm mouse button bindings brings up this menu. This lets me select a URL in a terminal window, bring up the menu, and open it in either the default browser choice or a specific browser choice.

(I also have a menu entry for 'open the selection in my main browser' in one of my main fvwm menus, the one attached to the middle mouse button, which makes it basically reflexive to open a new browser window or open some URL in my normal browser.)

The other way I handle URLs is through dmenu. One of the things my dmenu environment does is recognize URLs and open them in my default browser environment. I also have short dmenu commands to open URLs in my other browser environments, or open URLs based on the parameters I pass the command (such as a 'pd' script that opens Python documentation for a standard library module). Dmenu itself can paste in the current X selection with a keystroke, which makes it convenient to move URLs around. Dmenu is also how I typically open a URL if I'm typing it in instead of copying it from the X selection, rather than opening a new browser window, focusing the URL bar, and entering the URL there.

(I have dmenu set up to also recognize 'about:*' as URLs and have various Firefox about: things pre-configured as hidden completions in dmenu, along with some commonly used website URLs.)

As mentioned, dmenu specifically opens plain URLs in my default browser environment rather than going through openurl. I may change this someday but in practice there aren't enough special sites that it's an issue. Also, I've made dedicated little dmenu-specific scripts that open up the various sites I care about in the appropriate browser, so I can type 'mastodon' in dmenu to open up my Fediverse account in the JavaScript-enabled Firefox instance.

Trying to understand Firefox's approaches to tracking cookie isolation

By: cks
2 November 2025 at 02:50

As I learned recently, modern versions of Firefox have two different techniques that try to defeat (unknown) tracking cookies. As covered in the browser addon JavaScript API documentation, in Tracking protection, these are called first-party isolation and dynamic partitioning (or storage partitioning, the documentation seems to use both). Of these two, first party isolation is the easier to describe and understand. To quote the documentation:

When first-party isolation is on, cookies are qualified by the domain of the original page the user visited (essentially, the domain shown to the user in the URL bar, also known as the "first-party domain").

(In practice, this appears to be the top level domain of the site, not necessarily the site's domain itself. For example, Cookie Manager reports that a cookie set from '<...>.cs.toronto.edu' has the first party domain 'toronto.edu'.)

Storage partitioning is harder to understand, and again I'll quote the Storage partitioning section of the cookie API documentation:

When using dynamic partitioning, Firefox partitions the storage accessible to JavaScript APIs by top-level site while providing appropriate access to unpartitioned storage to enable common use cases. [...]

Generally, top-level documents are in unpartitioned storage, while third-party iframes are in partitioned storage. If a partition key cannot be determined, the default (unpartitioned storage) is used. [...]

If you read non-technical writeups like Firefox rolling out Total Cookie Protection (from 2022), it certainly sounds like they're describing first-party isolation. However, if you check things like Status of partitioning in Firefox and the cookies API documentation on first-party isolation, as far as I can tell what Firefox actually normally uses for "Total Cookie Protection" is storage partitioning.

Based on what I can decode from the two descriptions and from the fact that Tor Browser defaults to first-party isolation, it appears that first-party isolation is better and stricter than storage partitioning. Presumably it also causes problems on more websites, enough so that Firefox either no longer uses it for Total Cookie Protection or never did, despite their description sounding like first-party isolation.

(So far I haven't run into any issues with first-party isolation in my cookie-heavy browser environment. It's possible that websites have switched how they do things to avoid problems.)

First-party isolation can be enabled in about:config by setting privacy.firstparty.isolate to true. If and when you do this, the normal Settings → Privacy and Security will show a warning banner at the top to the effect of:

You are using First Party Isolation (FPI), which overrides some of Firefox’s cookie settings.

All of this is relevant to me because one of my add-ons, Cookie AutoDelete, probably works with first-party isolation but almost certainly doesn't work with storage isolation (ie, it will fail to delete some cookies under storage isolation, although I believe it can still delete unpartitioned cookies). Given what I've learned, I'm likely to turn on first-party isolation in my main browser environment soon.

If Cookie Manager is reporting correct information to me, it's possible to have cookies that are both first-party isolated and partitioned; the one I've seen so far is from Youtube. Cookie Manager can't seem to remove these cookies. Based on what I've read about (storage or dynamic) partitioned cookies, I suspect that these are created by embedded iframes.

(Turning on or off first-party isolation effectively drops all of the cookies you currently have, so it's probably best to do it when you restart your browser.)

My mistake with swallowing EnvironmentError errors in our Django application

By: cks
1 November 2025 at 02:50

We have a little Django application to handle request for Unix accounts. Once upon a time it was genuinely little, but it's slowly accreted features over the years. One of the features it grew over the years was a command line program (a Django management command) to bulk-load account request information from files. We use this to handle things like each year's new group of incoming graduate students; rather than force the new graduate students to find the web form on their own, we get information on all of them from the graduate program people and load them into the system in bulk.

One of the things that regularly happens with new graduate students is that they were already involved on the research side of the department. For example, as an undergraduate you might work on a research project with a professor, and then you get admitted as a graduate student (maybe with that professor, or maybe with someone else). When this happens, the new graduate student already has an account and we don't want to give them another one (for various reasons). To detect situations where someone already has an existing account, the bulk loader reads some historical data out of a couple of files and looks through it to match any existing accounts to the new graduate students.

When I originally wrote the code to load data from files, for some reason I decided that it wasn't particular bad if the files didn't exist or couldn't be read, so I wrote code that looked more or less like this:

try:
  fp = open(fname, "r")
  [process file]
  fp.close()
except EnvironmentError:
  pass

Of course, for testing purposes (and other reasons, for example to suppress this check) we should be able to change where the data files were read from, so I made the file names of the data files be argparse options, set the default values to the standard locations that the production application recorded things, and called it all good.

Except that for the past two years, one of the default file names was wrong; when I added this specific file, I made a typo in the file name. Using the command line option to change the file name worked so this passed my initial testing when I added the specific type of historical data, but in production, using my typo'd default file name, we silently never detected existing Unix logins for new graduate students (and others) through this particular type of historical data.

All of this happened because I made a deliberate design decision to silently swallow all EnvironmentError exceptions when trying to open and read these files, instead of either failing or at least reporting a warning. When I made the decision (back in 2013, it turns out), I was probably thinking that the only source of errors was if you ran it as the wrong user or deliberately supplied nonexistent files; I doubt it ever occurred to me that I could make an embarrassing typo in the name of any of the production files. One of the lessons I draw from this is that I don't always even understand the possible sources of errors, which makes it all the more dangerous to casually ignore them.

(Even silently ignoring nonexistent files is rather questionable in retrospect. I don't really know what I was thinking in 2013.)

Removing Fedora's selinux-policy-targeted package is mostly harmless so far

By: cks
1 November 2025 at 01:32

A while back I discussed why I might want to remove the selinux-policy-targeted RPM package for a Fedora 42 upgrade. Today, I upgraded my office workstation from Fedora 41 to Fedora 42, and as part of preparing for that upgrade I removed the selinux-policy-targeted policy (and all of the packages that depended on it). The result appears to work, although there were a few things that came up during the upgrade and I may reinstall at least selinux-policy-targeted itself to get rid of them (for now).

The root issue appears to be that when I removed the selinux-policy-targeted package, I probably should have edited /etc/selinux/config to set SELINUXTYPE to some bogus value, not left it set to "targeted". For entirely sensible reasons, various packages have postinstall scripts that assume that if your SELinux configuration says your SELinux type is 'targeted', they can do things that implicitly or explicitly require things from the package or from the selinux-policy package, which got removed when I removed selinux-policy-targeted.

I'm not sure if my change to SELINUXTYPE will completely fix things, because I suspect that there are other assumptions about SELinux policy programs and data files being present lurking in standard, still-installed package tools and so on. Some of these standard SELinux related packages definitely can't be removed without gutting Fedora of things that are important to me, so I'll either have to live with periodic failures of postinstall scripts or put selinux-policy-targeted and some other bits back. On the whole, reinstalling selinux-policy-targeted is probably the safest way and the issue that caused me to remove it only applies during Fedora version upgrades and might anyway be fixed in Fedora 42.

What this illustrates to me is that regardless of package dependencies, SELinux is not really optional on Fedora. The Fedora environment assumes that a functioning SELinux environment is there and if it isn't, things are likely to go wrong. I can't blame Fedora for this, or for not fully capturing this in package dependencies (and Fedora did protect the selinux-policy-targeted package from being removed; I overrode that by hand, so what happens afterward is on me).

(Although I haven't checked modern versions of Fedora, I suspect that there's no official way to install Fedora without getting a SELinux policy package installed, and possibly selinux-policy-targeted specifically.)

PS: I still plan to temporarily remove selinux-policy-targeted when I upgrade my home desktop to Fedora 42. A few package postinstall glitches is better than not being able to read DNF output due to the package's spam.

Firefox, the Cookie AutoDelete add-on, and "Total Cookie Protection"

By: cks
31 October 2025 at 03:15

In a comment on my entry on flailing around with Firefox's Multi-Account Containers, Ian Z aka nobrowser asked a good question:

The Cookie Autodelete instructions with respect to Total Cookie Protection mode are very confusing. Reading them makes me think this extension is not for me, as I have Strict Mode on in all windows, private or not. [...]

This is an interesting question (and, it turns out, relevant to my usage too) so I did some digging. The short answer is that I suspect the warning on Cookie AutoDelete's add-on page is out of date and it works fine. The long answer starts with the history of HTTP cookies.

Back in the old days, HTTP cookies were global, which is to say that browsers kept a global pool of HTTP cookies (both first party, from the website you were on, and third-party cookies), and it would send any appropriate cookie on any HTTP request to its site. This enabled third-party tracking cookies and a certain amount of CSRF attacks, since the browser would happily send your login cookies along with that request initiated by the JavaScript on some sketchy website you'd accidentally wound up on (or JavaScript injected through an ad network).

This was obviously less than ideal and people wound up working to limit the scope of HTTP cookies, starting with things like Firefox's containers and eventually escalating to first-party cookie isolation, where a cookie is restricted to whatever the first-party domain was when it was set. If you're browsing example.org and the page loads google.com/tracker, which sets a tracker cookie, that cookie will not be sent when you browse example.com and the page also loads google.com/tracker; the first tracking cookie is isolated to example.org.

(There is also storage isolation for cookies, but I think that's been displaced by first-party cookie isolation.)

However, first-party isolation has the possibility to break things you expect to work, as covered in this Firefox FAQ). As a result of this, my impression is that browsers have been cautious and slow to roll out first-party isolation by default. However, they have made it available as an option or part of an option. Firefox calls this Total Cookie Protection (also, also).

(Firefox is working to go even further, blocking all third-party cookies.)

Firefox add-ons have special APIs that allow them to do privileged things, and these include an API for dealing with cookies. When first-party cookie isolation came to pass, these APIs needed to be updated to deal with such isolated cookies (and cookie tracking protection in general). For instance, cookies.remove() has to be passed a special parameter to remove a first-party isolated cookie. As covered in the documentation, an add-on using the cookies APIs without the necessary updates would only see non-isolated cookies, if there were any. So at the time the message on Cookie AutoDelete's add-on page was written, I suspect that it hadn't been updated for first-party isolation. However, based on checking the source code of Cookie AutoDelete, I believe that it currently supports first-party isolation for cookies, and in fact may have done so for some time, perhaps v3.5.0, or v3.4.0 or even earlier.

(It's also possible that this support is incomplete or buggy, or that there are still some things that you can't easily do through it that matter to Cookie AutoDelete.)

Cookie AutoDelete itself is potentially useful even if you have Firefox set to block all third-party cookies, because it will also clean up unwanted first-party cookies (assuming that it truly works with first-party isolation). Part of my uncertainly is that I'm not sure how you reliably find out what cookies you have in a browser world with first-party isolation. There's theoretically some information about this in Settings → Privacy & Security → Cookies and Site Data → "Manage Data...", but since that's part of the normal Settings UI that normal people use, I'm not sure if it's simplifying things.

PS: Now that I've discovered all of this, I'm not certain if my standard Cookie Quick Manager add-on properly supports first-party isolated cookies. There's this comment on an issue that suggests it does support first-party isolation but not storage partitioning (also). The available Firefox documentation and Settings UI is not entirely clear about whether first-party isolation is now on more or less by default.

(That comment points to Cookie Manager as a potential partition-aware cookie manager.)

Finally, run Docker containers natively in Proxmox 9.1 (OCI images)

20 November 2025 at 22:34
Proxmox VE is a virtualization platform, like VMWare, but open source, based on Debian. It can run KVM virtual machines and Linux Containers (LXC). I've been using it for over 10 years, the [first article I wrote mentioning it was in 2012](/s/tags/proxmox.html). At home I have a 2 node Proxmox VE cluster consisting of 2 HP EliteDesk Mini machines, both running with 16 GB RAM and both an NVMe and SATA SSD with ZFS on root (256 GB). It's small enough (physically) and is just enough for my homelab needs specs wise. Proxmox VE 9.1 was released [recently](https://www.proxmox.com/en/about/company-details/press-releases/proxmox-virtual-environment-9-1) and this new version is able to run Docker containers / OCI images natively, no more hacks or VM's required to run docker. This post shows you how to run a simple container from a docker image.

Sparkling Network

12 January 2019 at 00:00
This is an overview of all the servers in the Sparkling Network, mostly as an overview for myself, but it might be interesting for others. It also has a status overview of the nodes. Prices are monthly, excluding VAT.
Yesterday — 6 December 2025Main stream

Recently

1 December 2025 at 00:00

This’ll be the last Recently in 2025. It’s been a decent year for me, a pretty rough year for the rest of the world. I hope, for everyone, that 2026 sees the reversal of some of the current trends.

Watching

This video from Daniel Yang, who makes spectacular bikes of his own, covers a lot of the economics of being a bike-builder, which are all pretty rough. I felt a lot of resonance with Whit: when I ran a business I always felt like it was tough to be commercial about it, and had to fight my own instincts to over engineer parts of it. It’s also heartbreaking to think about how many jobs are so straightforwardly good for the maker and good for the buyer but economically unviable because of the world we live in. I feel like the world would be a lot different if the cost of living was lower.

Yes, the fan video! It’s a solid 48 minutes of learning how fans work. I finally watched it. Man, if every company did their advertising this way it would be so fun. I learned a lot watching this.

I have trouble finding videos about how things work that are actually about how things work. Titles like “how it’s made” or “how it works” or “how we did it” perform well in A/B tests and SEO so they get used for media that actually doesn’t explain how it’s made and how it works, greatly frustrating people like me. But the fan video delivers.

Reading

But then, you realize that the goal post has shifted. As the tech industry has become dramatically more navigable, YC became much less focused on making the world understandable, revolving, instead, around feeding consensus. “Give the ecosystem what they want.”

I have extremely mixed feelings about Build What’s Fundable, this article from Kyle Harrison. Some of it I think is bravely truth-telling in an industry that usually doesn’t do public infighting - Harrison is a General Partner at a big VC firm, and he’s critiquing a lot of firms directly on matters both financial and ethical.

But on the other hand, there’s this section about “Breaking the Normative Chains”:

When you look at successful contrarian examples, many of them have been built by existing billionaires (Tesla, SpaceX, Palantir, Anduril). The lesson from that, I think, isn’t “be a billionaire first then you can have independent thoughts.” It is, instead, to reflect on what other characteristics often lead to those outcomes. And, in my opinion, the other commonality that a lot of those companies have is that they’re led by ideological purists. People that believe in a mission.

And then, in the next section he pulls up an example of a portfolio company that encapsulates the idea of true believers, and he names Base, which has a job ad saying “Don’t tell your grandkids all you did was B2B SaaS.”

Now, Base’s mission is cool: they’re doing power storage at scale. I like the website. But I have to vent here that the founder is Zach Dell. Michael Dell’s son. Of Dell Computer, and a 151 billion dollar fortune.

I just think that if we’re going to talk about how the lesson isn’t that you should be a billionaire first before having independent thoughts and building a big tech company, it should be easy to find someone who is not the son of the 10th wealthiest person in the world to prove that point. I have nothing against Zach in particular: he is probably a talented person. But in the random distribution of talented, hardworking people, very few of them are going to be the son of the 10th wealthiest person in the world.


Like so many other bits of Times coverage, the whole of the piece is structured as an orchestrated encounter. Some people say this; however, others say this. It’s so offhand you can think you’re gazing through a pane of glass. Only when you stand a little closer, or when circumstances make you a little less blinkered, do you notice the fact which then becomes blinding and finally crazymaking, which is just that there is zero, less than zero, stress put on the relation between those two “sides,” or their histories, or their sponsors, or their relative evidentiary authority, or any of it.

I love this article on maybe don’t talk to the New York Times about Zohran Mamdani. It describes the way in which the paper launders its biases, which overlaps with one of my favorite rules from Wikipedia editing about weasel words.

I don’t want you to hate this guy. Yes, he actively promotes poisonous rhetoric – ignore that for now. This is about you. Reflect on all your setbacks, your unmet potential, and the raw unfairness of it all. It sucks, and you mustn’t let that bitterness engulf you. You can forgive history itself; you can practice gratitude towards an unjust world. You need no credentials, nor awards, nor secrets, nor skills to do so. You are allowed to like yourself.

Taylor Troesh on IQ is exactly what I needed that day.

The React team knows this makes React complicated. But the bet is clear: React falls on the sword of complexity so developers don’t have to. That’s admirable, but it asks developers to trust React’s invisible machinery more than ever.

React and Remix Choose Different Futures is perfect tech writing: it unpacks the story and philosophy behind a technical decision without cramming it into a right-versus-wrong framework.

When you consider quitting, try to find a different scoreboard. Score yourself on something else: on how many times you dust yourself off and get up, or how much incremental progress you make. Almost always, in your business or life, there are things you can make daily progress on that can make you feel like you’re still winning. Start compounding.

“Why a language? Because I believe that the core of computing is not based on operating system or processor technologies but on language capability. Language is both a tool of thought and a means of communication. Just as our minds are shaped by human language, so are operating systems shaped by programming languages. We implement what we can express. If it cannot be expressed, it will not be implemented.” – Carl Sassenrath

Alexis Sellier, whose work and aesthetic I’ve admired since the early days of Node.js, is working on a new operating system. A real new operating system, like playb.it or SerenityOS (bad politics warning). I’m totally into it: we need more from-scratch efforts like this!

Yes, the funds available for any good cause are scarce, but that’s not because of some natural law, some implacable truth about human society. It’s because oligarchic power has waged war on benign state spending, leading to the destruction of USAID and drastic cuts to the aid budgets of other countries, including the UK. Austerity is a political choice. The decision to impose it is driven by governments bowing to the wishes of the ultra-rich.

The Guardian on Bill Gates is a good read. I’ve had The Bill Gates Problem on my reading list for a long time. Maybe it’s next after I finish The Fort Bragg Cartel.

Contrast this with the rhetorical shock and awe campaign that has been waged by technology companies for the last fifteen years championing the notion of ephemerality.

Implicit, but unspoken, in this worldview is the idea of transience leading to an understanding of a world awash in ephemeral moments that, if not seized on and immediately capitalized to maximum effect, will be forever lost to the mists of time and people’s distracted lifestyles.

Another incredible article by Aaron Straup Cope about AI, the web, ephemerality, and culture. (via Perfect Sentences)


Also, no exact quote, but I’ve been subscribed to Roma’s Unpolished Posts and they have been pretty incredible: mostly technical articles about CSS, which have been ‘new to me’ almost every day, and the author is producing them once a day. Feels like a cheat code to absorb so much new information so quickly.

Listening

I didn’t add any major new albums to my collection this month. I did see Tortoise play a show, which was something I never expected to do. So in lieu of new albums, here’s a theme.

There are a bunch of songs in my library that use a meter change as a way to add or resolve tension. I’m not a big fan of key changes but I love a good rhythm or production shift.

First off: Kissing the Beehive. Yes, it’s nearly 11 minutes long. Magically feeling “off kilter” and “in the pocket” at the same time. I’m no drummer but I think the first part is something like three measures of 4/4 and one of 6/4. But then at 3:26, brand new connected song, and by the time we get to 7 minutes in, we’re in beautiful breezy easy 4/4!

An Andrew Bird classic: about a minute of smooth 4/4, and then over to 7/4 in the second half or so.

I adore Akron/Family’s Running, Returning. Starts in classic 5/4, then transitions to 4/4, then 6/8. For me it all feels very cohesive. Notably, the band is not from Akron Ohio but formed in Williamsburg and were from other East Coast places. If you’re looking for the band from Akron, it’s The Black Keys.

Slow Mass’s Schemes might be my song of the year. Everything they write just sounds so cool. The switchup happens around 2:40 when the vocals move to 6/4. Astounding.

Predictions

Back in January, I made some predictions about 2025. Let’s see how they turned out!

1: The web becomes adversarial to AI

I am marking this one as a absolute win: more and more websites are using Anubis, which was released in March, to block LLM scrapers. Cloudflare is rolling out more LLM bot protections. I at Val Town have started to turn on those protections to keep LLM bots from eating up all of our bandwidth and CPU. The LLM bots are being assholes and everyone hates them.

2: Copyright nihilism breeds a return to physical-only media

This was at most going to be a moderate win because physical-only media will be niche, but I think there are good signs that this is right. The Internet Phone Book, in which this site is featured, started publishing this year. Gen Z seems to be buying more vinyl and printing out more photos.

3: American tech companies will pull out of Europe because they want to do acquisitions

Middling at best: there are threats and there is speculation, but nothing major to report.

4: The tech industry’s ‘DEI backlash’ will run up against reality

Ugh, probably the opposite has happened. Andreesen Horowitz shut down their fund that focused on women, minorities, and people underepresented in VC funding. We’ll know more about startups themselves when Carta releases their annual report, which looked pretty bad last year.

5: Local-first will have a breakthrough moment

Sadly, no. Lots and lots of promising projects, but the ecosystem really struggles to produce something production-ready that offers good tradeoffs. Tanstack DB might be the next contender.

6: Local, small AI models will be a big deal

Not yet. Big honkin’ models are still grabbing most of the headlines. LLMs still really thrive at accomplishing vague tasks that are achievable with a wide range of acceptance criteria, like chatbots, and are pretty middling at tasks that require specific strict, quantifiable outputs.

For my mini predictions:

  • Substack will re-bundle news. Sort of! Multi-editor newsletters like The Argument are all the rage right now.
  • TypeScript gets a zeitwork equivalent and lots of people use it. Sadly, no.
  • Node.js will fend off its competitors. Mostly! Bun keeps making headway but Node.js keeps implementing the necessary featureset, and doesn’t seem to be losing that much marketshare.
  • Another US city starts seriously considering congestion pricing. There are rumblings from Boston and Chicago!
  • Stripe will IPO. Nope. Unlimited tender offers from Sequoia is just as good as IPOing, so why would they.

Val Town 2023-2025 Retrospective

11 November 2025 at 00:00

It’s the end of 2025, which means that I’m closing in on three years at Val Town. I haven’t written much about the company or what it’s really been like. The real story of companies is usually told well after years after the dust has settled. Founders usually tell a heroic story of success while they’re building.

Reading startup news really warps your perspective, especially when you’re building a startup yourself. Everyone else is getting fabulously rich! It makes me less eager to write about anything.

But I’m incurably honest and like working with people who are too. Steve, the first founder of Val Town (I joined shortly after as cofounder/CTO) is a shining example of this. He is a master of saying the truth in situations when other people are afraid to. I’ve seen it defuse tension and clear paths. It’s a big part of ‘the culture’ of the company.

So here’s some of the story so far.

Delivering on existing expectations and promises

Here’s what the Val Town interface looked like fairly early on:

Val Town user interface in mid-2023

When I initially joined, we had a prototype and a bit of hype. The interface was heavily inspired by Twitter - every time that you ran code, it would save a new ‘val’ and add it to an infinite-scrolling list.

Steve and Dan had really noticed the utter exhaustion in the world of JavaScript: runaway complexity. A lot of frameworks and infrastructure was designed for huge enterprises and was really, really bad at scaling down. Just writing a little server that does one thing should be easy, but if you do it with AWS and modern frameworks, it can be a mess of connected services and boilerplate.

Val Town scaled down to 1 + 1. You could type 1 + 1 in the text field and get 2. That’s the way it should work.

It was a breath of fresh air. And a bunch of people who encountered it even in this prototype-phase state were inspired and engaged.

The arrows marketing page

One of the pivotal moments of this stage was creating this graphic for our marketing site: the arrows graphic. It really just tied it all together: look how much power there was in this little val! And no boilerplate either. Where there otherwise is a big ritual of making something public or connecting an email API, there’s just a toggle and a few lines of code.

I kind of call this early stage, for me, the era of delivering on existing expectations and promises. The core cool idea of the product was there, but it was extremely easy to break.

Security was one of the top priorities. We weren’t going to be a SOC2 certified bank-grade platform, but we also couldn’t stay where we were. Basically, it was trivially easy to hack: we were using the vm2 NPM module to run user code. I appreciate that vm2 exists, but it really, truly, is a trap. There are so many ways to get out of its sandbox and access other people’s code and data. We had a series of embarrassing security vulnerabilities.

For example: we supported web handlers so you could easily implement a little server endpoint, and the API for this was based on express, the Node.js server framework. You got a request object and response object, from express, and in this case they were literally the same as our server’s objects. Unfortunately, there’s a method response.download(path: string) which sends an arbitrary file from your server to the internet. You can see how this one ends: not ideal.

So, we had to deliver on a basic level of security. Thankfully, in the way that it sometimes does, the road rose to meet us. The right technology appeared just in time: Deno. Deno’s sandboxing made it possible to run people’s code securely without having to build a mess of Kubernetes and Docker sandbox optimizations. It delivered being secure, fast, and simple to implement: we haven’t identified a single security bug caused by Deno.

That said, the context around JavaScript runtimes has been tough. Node.js is still dominant and Bun has attracted most of the attention as an alternative, with Deno in a distant third place, vibes-wise. The three are frustratingly incompatible - Bun keeps adding built-ins like an S3 client which would have seemed unthinkable in the recent past. Node added an SQLite client in 22. Contrary to what I hoped in 2022, JavaScript has gotten more splintered and inconsistent as an ecosystem.

Stability was the other problem. The application was going down constantly for a number of reasons, but most of all was the database, which was Supabase. I wrote about switching away from Supabase, which they responded to in a pretty classy way, and I think they’ve since improved. But Render has been a huge step up in maintainability and maturity for how we host Val Town.

Adding Max was a big advance in our devops-chops too: he was not only able to but excited to work on the hard server capacity and performance problems. We quietly made a bunch of big improvements like allowing vals to stay alive after serving requests - before that, every run was a cold start.

What to do about AI

Townie

Townie, the Val Town chatbot, in early 2024

Believe it or not, but in early 2023, there were startups that didn’t say “AI” on the front page of their marketing websites. The last few years have been a dizzying shift in priorities and vibes, which I have had mixed feelings about that I’ve written about a lot.

At some point it became imperative to figure out what Val Town was supposed to do about all that. Writing code is undeniably one of the sweet spots of what LLMs can do, and over the last few years the fastest-growing most hyped startups have emerged from that ability.

This is where JP Posma comes in. He was Steve’s cofounder at a previous startup, Zaplib, and was our ‘summer intern’ - the quotes because he’s hilariously overqualified for that title. He injected some AI-abilities into Val Town, both RAG-powered search and he wrote the first version of Townie, a chatbot that is able to write code.

Townie has been really interesting. Basically it lets you write vals (our word for apps) with plain English. This development happened around the same time as a lot of the ‘vibe-coding’ applications, like Bolt and Lovable. But Townie was attached to a platform that runs code and has community elements and a lot more. It’s an entry point to the rest of the product, while a lot of other vibe-coding tools were the core product that would eventually expand to include stuff like what Val Town provides.

Ethan Ding has written a few things about this: it’s maybe preferable to sell compute instead of being the frontend for LLM-vibe-coding. But that’s sort of a long-run prediction about where value accrues rather than an observation about what companies are getting hype and funding in the present.

Vibe coding companies

There are way too many companies providing vibe-coding tools without having a moat or even a pathway to positive margins. But having made a vibe-coding tool, I completely see why: it makes charts look amazing. Townie was a huge growth driver for a while, and a lot of people were hearing about Townie first, and only later realizing that Val Town could run code, act as a lightweight GitHub alternative, and power a community.

Unlike a lot of AI startups, we didn’t burn a ton of money running Townie. We did have negative margins on it, but to the tune of a few thousand dollars a month during the most costly months.

Introducing a pro plan made it profitable pretty quickly and today Townie is pay-as-you-go, so it doesn’t really burn money at all. But on the flip side, we learned a lot about the users of vibe-coding tools. In particular, they use the tools a lot, and they really don’t want to pay for them. This kind of makes sense: vibe-coding actual completed apps without ever dropping down to write or read code is zeno’s paradox: every prompt gets you halfway there, so you inch closer and closer but never really get to your destination.

So you end up chatting for eight hours, typically getting angrier and angrier, and using a lot of tokens. This would be great for business in theory, but in practice it doesn’t work for obvious reasons: people like to pay for results, not the process. Vibe-coding is a tough industry - it’s simultaneously one of the most expensive products to run, and one of the most flighty and cost-sensitive user-bases I’ve encountered.

So AI has been complicated. On one hand, it’s amazing for growth and obviously has spawned wildly successful startups. On the other, it can be a victim of its own expectations: every company seems to promise perfect applications generated from a single prompt and that just isn’t the reality. And that results in practically every tool falling short of those expectations and thus getting the rough end of user sentiment.

We’re about to launch MCP support, which will make it possible to use Val Town via existing LLM interfaces like Claude Code. It’s a lot better than previous efforts - more powerful and flexible, plus it requires us to reinvent less of the wheel. The churn in the ‘state of the art’ feels tremendous: first we had tool-calling, then MCPs, then tool calling writing code to call MCPs: it’s hard to tell if this is fast progress or just churn.

As a business

When is a company supposed to make money? It’s a question that I’ve thought about a lot. When I was running a bootstrapped startup, the answer was obviously as soon as possible, because I’d like to stop paying my rent from my bank account. Venture funding lets you put that off for a while, sometimes a very long while, and then when companies start making real revenue they at best achieve break-even. There are tax and finance reasons for all of this – I don’t make the rules!

Anyway, Val Town is far from break-even. But that’s the goal for 2026, and it’s optimistically possible.

One thing I’ve thought for a long time is that people building startups are building complicated machines. They carry out a bunch of functions, maybe they proofread your documents or produce widgets, or whatever, but the machine also has a button on it that says “make money.” And everything kind of relates to that button as you’re building it, but you don’t really press it.

The nightmare is if the rest of the machine works, you press the button, and it doesn’t do anything. You’ve built something useful but not valuable. This hearkens back to the last section about AI: you can get a lot of people using the platform, but if you ask them for money and they’re mostly teenagers or hobbyists, they’re not going to open their wallets. They might not even have wallets.

So we pressed the button. It kind of works.

But what I’ve learned is that making revenue is a lot like engineering: it requires a lot of attempts, testing, and hard work. It’s not something that just results from a good product. Here’s where I really saw Charmaine and Steve at work, on calls, making it happen.

The angle right now is to sell tools for ‘Go To Market’ - stuff like capturing user signups of your website, figuring out which users are from interesting companies or have interesting use-cases, and forwarding that to Slack, pushing it to dashboards, and generally making the sales pipeline work. It’s something Val Town can do really well: most other tools for this kind of task have some sort of limit in how complicated and custom they can get, and Val Town doesn’t.

Expanding and managing the complexity

Product-wise, the big thing about Val Town that has evolved is that it can do more stuff and it’s more normal. When we started out, a Val was a single JavaScript expression - this was part of what made Val Town scale down so beautifully and be so minimal, but it was painfully limited. Basically people would type into the text box

const x = 10;
function hi() {};
console.log(1);

And we couldn’t handle that at all: if you ran the Val did it run that function? Export the x variable? It was magic but too confusing. The other tricky niche choice was that we had a custom import syntax like this:

@tmcw.helper(10);

In which @tmcw.helper was the name of another val and this would automatically import and use it. Extremely slick but really tricky to build off of because this was non-standard syntax, and it overlapped with the proposed syntax for decorators in JavaScript. Boy, I do not love decorators: they have been under development for basically a decade and haven’t landed, just hogging up this part of the unicode plane.

But regardless this syntax wasn’t worth it. I have some experience with this problem and have landed squarely on the side of normality is good.

So, in October 2023, we ditched it, adopted standard ESM import syntax, and became normal. This is was a big technical undertaking, in large part because we tried to keep all existing code running by migrating it. Thankfully JavaScript has a very rich ecosystem of tools that can parse & produce code and manipulate syntax trees, but it was still a big, dramatic shift.

This is one of the core tensions of Val Town as well as practically every startup: where do you spend your user-facing innovation energy?

I’m a follower of the use boring technology movement when it comes to how products are built: Val Town intentionally uses some boring established parts like Postgres and React Router, but what about when it comes to the product itself? I’ve learned the hard way that most of what people call intuition is really familiarity: it’s good when an interface behaves like other interfaces. A product that has ten new concepts and a bunch of new UI paradigms is going to be hard to learn and probably will lose out to one that follows some familiar patterns.

Moving to standard JavaScript made Val Town more learnable for a lot of people while also removing some of its innovation. Now you can copy code into & out of Val Town without having to adjust it. LLMs can write code that targets Val Town without knowing everything about its quirks. It’s good to go with the flow when it comes to syntax.

Hiring and the team

Office Sign

Val Town has an office. I feel like COVID made everything remote by default and the lower-COVID environment that we now inhabit (it’s still not gone!) has led to a swing-back, but the company was founded in the latter era and has never been remote. So, we work from home roughly every other Friday.

This means that we basically try to hire people in New York. It hasn’t been too hard in the past. About 6% of America lives in the New York City metro area and the Northeast captures about 23% of venture funding, so there are lots of people who live here or want to.

Stuff on the window sill in the office

Here’s something hard to publish: we’re currently at three people. It was five pretty recently. Charmaine got poached by Anthropic where she’ll definitely kick ass, and Max is now at Cloudflare where he’s writing C++, which will be even more intimidating than his chess ranking. The company’s really weirdly good at people leaving: we had parties and everyone exchanges hand-written cards. How people handle hard things says a lot.

But those three are pretty rad: Jackson was a personal hero of mine before we hired him (he still is). He’s one of the best designers I’ve worked with, and an incredibly good engineer to boot. He’s worked at a bunch of startups you’ve heard of, had a DJ career, gotten to the highest echelons of tech without acquiring an ego. He recently beat me to the top spot in our GitHub repo’s lines-changed statistic.

Steve has what it takes for this job: grit, optimism, curiosity. The job of founding a company and being a CEO is a different thing every few months - selling, hiring, managing, promoting. Val Town is a very developer-consumer oriented product and that kind of thing requires a ton of promotion. Steve has done so much, in podcasts, spreading the word in person, writing, talking to customers. He has really put everything into this. A lot of the voice and the attitude of the company flows down from the founder, and Steve is that.

Did I mention that we’re hiring?

In particular, for someone to be a customer-facing technical promoter type - now called a “GTM” hire. Basically, who can write a bit of code but has the attitude of someone in sales. Who can see potential and handle rejection. Not necessarily the world’s best programmer, but who can probably code, and definitely someone who can write. Blogging and writing online is a huge green flag for this position.

And the other role that we really need is an “application engineer.” These terms keep shifting, so if full-stack engineer means more, sure, that too. Basically someone who can write code across boundaries. This is more or less what Jackson and I do - writing queries, frontend code, fixing servers, the whole deal. Yeah, it sounds like a lot but this is how all small companies operate, and I’ve made a lot of decisions to make this possible: we’ve avoided complexity like the plague in Val Town’s stack, so it should all be learnable. I’ve written a bunch of documentation for everything, and constantly tried to keep the codebase clean.

Sidenote, but even though I think that the codebase is kind of messy, I’ve heard from very good engineers (even the aforementioned JP Posma) that it’s one of the neatest and most rational codebases they’ve seen. Maybe it is, maybe it isn’t, see for yourself!

What we’re really looking for in hires

Tech hiring has been broken the whole time I’ve been in the industry, for reasons that would take a whole extra article to ponder. But one thing that makes it hard is vagueness, both on the part of applicants and companies. I get it - cast a wide net, don’t put people off. But I can say that:

  • For the GTM position, you should be able to write for the internet. This can be harder than it looks: there are basically three types of writing: academic, corporate, and internet, and they are incompatible.
  • You should also be kind of entrepreneurial: which means optimistic, resilient, and opportunistic.
  • For the application engineering role, you should be a good engineer who understands the code you write and is good at both writing and reading code. Using LLM tools is great, but relying on them exclusively is a dealbreaker. LLMs are not that good at writing code.

What the job is like

The company’s pretty low drama. Our office is super nice. We work hard but not 996. We haven’t had dinner in the office. But we all do use PagerDuty so when the servers go down, we wake up and it sucks. Thankfully the servers go down less than they used to.

We all get paid the same: $175k. Lower than FAANG, but pretty livable for Brooklyn. Both of the jobs listed - the Product Engineer, and Growth Engineer - are set at 1% equity. $175k is kind of high-average for where we’re at, but 1% in my opinion is pretty damn good. Startups say that equity is “meaningful” at all kinds of numbers but it’s definitely meaningful at that one. If Val Town really succeeds, you can get pretty rich off of that.

Of course, will it succeed? It’s something I think about all the time. I was born to ruminate. We have a lot going for us, and a real runway to make things happen. Some of the charts in our last investor update looked great. Some days felt amazing. Other days were a slog. But it’s a good team, with a real shot of making it.

Recently

2 November 2025 at 00:00

Hello! Only a day late this time. October was another busy month but it didn’t yield much content. I ran a second half-marathon, this time with much less training, but only finished a few minutes slower than I did earlier this year. Next year I’m thinking about training at a normal mileage for the kinds of races I’m running - 25 miles or so per week instead of this year’s roughly 15.

And speaking of running, I just wrote up this opinion I have about how fewer people should run marathons.

Reading

I enjoyed reading Why Functor Doesn’t Matter, but I don’t really agree. The problem that I had with functional programming jargon isn’t that the particular terms are strange or uncommon, but that their definitions rely on a series of other jargon terms, and the discipline tends to omit good examples, metaphors, or plain-language explanations. It’s not that the strict definition is bad, but when a function is defined as _a mapping that associates a morphism F: X -> Y in category C to a morphism F(f): F(X) -> F(Y), in category D, you now have to define morphism, categories, and objects, and all of which have domain-specific definitions.

I am loving Sherif’s posting about building a bike from the frame up.

Maximizers are biased to speed, optionality, breadth, momentum, opportunism, parallel bets, hype, luck exposure, momentum, “Why not both?”, “Better to move fast than wait for perfect”. Maximizers want to see concrete examples before they’ll make tradeoffs. They anchor decisions in the tangible. “Stop making things so complicated.” “Stop overthinking.”

Focusers are biased to focus, coherence, depth, meaningful constraints, doing less for more, sequential experiments, intentionality, sustainability, “What matters most?”, compounding clarity. Focusers are comfortable with abstraction. A clear constraint or principle is enough to guide them. “Stop mistaking chaos for progress.” “Stop overdoing.”

John Cutler’s post about maximizers vs. focusers matches my experience in tech. Like many young engineers, I think I started out as a focuser and have tried to drift to the center all the time, but the tension both internally and interpersonally at every job is present.

I recently remarked to a friend that traveling abroad after the advent of the smartphone feels like studying biology after the advent of microplastics. It has touched every aspect of life. No matter where you point your microscope you will see its impact.

Josh Erb’s blog about living in India is great, personal, a classic blog’s blog.

For me, the only reason to keep going is to try and make AI a wonderful technology for the world. Some feel the same. Others are going because they’re locked in on a path to generational wealth. Plenty don’t have either of these alignments, and the wall of effort comes sooner.

This article about AI researchers working all the time and burning out is interesting, in part because I find the intention of AI researchers so confusing. I can see the economic intention: these guys are making bank! Congrats to all of them. But it’s so rare to talk to anyone who has a concrete idea about how they are making the world better by doing what they’re doing, and that’s the reason why they’re working so hard. OpenAI seems to keep getting distracted from that cancer cure, and their restructuring into a for-profit company kind of indicates that there’s more greed than altruism in the mix.

every vc who bet on the modern data stack watched their investments get acquired for pennies or go to zero. the only survivors: the warehouses themselves, or the companies the warehouses bought to strengthen their moats.

It’s niche, but this article about Snowflake, dbt, fivetran, and other ‘data lake’ architecture is really enlightening.

Listening

Totorro’s new album was the only one I picked up this month. It’s pretty good math-rock, very energetic and precise.

Watching

  • One Battle After Another was incredible.
  • eXistenZ is so gloriously weird, I really highly recommend it. It came out the same year as The Matrix and explores similar themes, but the treatment of futuristic technology is something you won’t see anywhere else: instead of retro-steampunk metal or fully dystopian grimness, it’s colorful, slimy, squelchy, organic, and weird.

Speaking of weird, Ben Levin’s gesamtkunstwerk videos are wild and glorious.

OpenAI employees… are you okay?

8 November 2025 at 00:00

You might have seen an article making the rounds this week, about a young man who ended his life after ChatGPT encouraged him to do so. The chat logs are really upsetting.

Someone two degrees removed from me took their life a few weeks ago. A close friend related the story to me, about how this person had approached their neighbor one evening to catch up, make small talk, and casually discussed their suicidal ideation at some length. At the end of the conversation, they asked to borrow a rope, and their neighbor agreed without giving the request any critical thought. The neighbor found them the next morning.

I didn’t know the deceased, nor their neighbor, but I’m close friends with someone who knew both. I found their story deeply chilling – ice runs through my veins when I imagine how the neighbor must have felt. I had a similar feeling upon reading this article, wondering how the people behind ChatGPT and tools like it are feeling right now.

Two years ago, someone I knew personally took their life as well. I was not friendly with this person – in fact, we were on very poor terms. I remember at the time, I had called a crisis hotline just to ask an expert for advice on how to break this news to other people in my life, many of whom were also on poor terms with a person whose struggles to cope with their mental health issues caused a lot of harm to others.

None of us had to come to terms with any decisions with the same gravity as what that unfortunate neighbor had to face. None of us were ultimately responsible for this person’s troubles or were the impetus for what happened. Nonetheless, the uncomfortable and confronting feelings I experienced in the wake of that event perhaps give me some basis for empathy and understanding towards the neighbor, or for OpenAI employees, and others who find themselves in similar situations.

If you work on LLMs, well… listen, I’ve made my position as an opponent of this technology clear. I feel that these tools are being developed and deployed recklessly, and I believe tragedy is the inevitable result of that recklessness. If you confide in me, I’m not going to validate your career choice. But maybe that’s not necessarily a bad quality to have in a confidant? I still feel empathy towards you and I recognize your humanity and our need to acknowledge each other as people.

If you feel that I can help, I encourage you to reach out. I will keep our conversation in confidence, and you can reach out anonymously if that makes you feel safer. I’m a good listener and I want to know how you’re doing. Email me.


If you’re experiencing a crisis, 24-hour support is available from real people who are experts in getting you the help you need. Please consider reaching out. All you need to do is follow the link.

More tales about outages and numeric limits

18 November 2025 at 21:35

Outages, you say? Of course I have stories about outages, and limits, and some limits causing outages, and other things just screwing life up. Here are some random thoughts which sprang to mind upon reading this morning's popcorn-fest.

...

I was brand new at a company that "everybody knew" had AMAZING infrastructure. They could do things with Linux boxes that nobody else could. As part of the new employee process, I had to get accounts in a bunch of systems, and one of them was this database used to track the states of machines. It was where you could look to see if it was (supposed to be) serving, or under repair, or whatever. You could also see (to some degree) what services were supposed to be running on it, and what servers (that is, actual programs), the port numbers, and whether all of that stuff was synced to the files on the box or not.

My request didn't go through for a while, and I found out that it had something to do with my employee ID being a bit over 32767. And yeah, for those of you who didn't just facepalm at seeing that number, that's one of those "magic numbers" which pops up a bunch when talking about limits. That one is what you get when you try to store numbers as 16 bit values... with a sign to allow negative values. Why you'd want a negative employee number is anyone's guess, but that's how they configured it.

I assume they fixed the database schema at some point to allow more than ~15 bits of employee numbers, but they did an interesting workaround to get me going before then. They just shaved off the last digit and gave me that ID in their system instead. I ended up as 34xx instead of 34xxx, more or less.

This was probably my first hint that their "amazing infra" was in fact the same kind of random crazytown as everywhere else once you got to see behind the curtain.

...

Then there was the time that someone decided that a log storage system that had something like a quarter of a million machines (and growing fast) feeding it needed a static configuration. The situation unfolded like this:

(person 1) Hey, why is this thing crashing so much?

(person 2) Oh yeah, it's dumping cores constantly! Wow!

(person 1) It's running but there's nothing in the log?

(person 2) Huh, "runtime error ... bad id mapping?"

(person 2) It's been doing this for a month... and wait, other machines are doing it, too!

(person 1) Guess I'll dig into this.

(person 2) "range name webserv_log.building1.phase3 range [1-20000]"

(person 2) But this machine is named webserv20680...

(person 2) Yeah, that's enough for me. Bye!

The machines were named with a ratcheting counter: any time they were assigned to be a web server, they got names like "webserv1", "webserv2", ... and so on up the line. That had been the case all along.

Whoever designed this log system years later decided to put a hard-coded limiter into it. I don't know if they did it because they wanted to feel useful every time it broke so they could race in and fix it, or if they didn't care, or if they truly had no idea that numbers could in fact grow beyond 20000.

Incidentally, that particular "building1.phase3" location didn't even have 20000 machines at that specific moment. It had maybe 15000 of them, but as things went away and came back, the ever-incrementing counter just went up and up and up. So, there _had been_ north of 20K machines in that spot overall, and that wasn't even close to a surprising number.

...

There was a single line that would catch obvious badness at a particular gig where we had far too many Apache web servers running on various crusty Linux distributions:

locate access_log | xargs ls -la | grep 2147

It was what I'd send in chat to someone who said "hey, the customer's web server won't stay up". The odds were very good that they had a log file that had grown to 2.1 GB, and had hit a hard limit which was present in that particular system. Apache would try to write to it, that write would fail, and the whole process would abort.

"2147", of course, is the first 4 digits of the expected file size: 2147483647 ... or (2^31)-1.

Yep, that's another one of those "not enough bits" problems like the earlier story, but this one is 32 bits with one of them being for the sign, not 16 like before. It's the same problem, though: the counter maxes out and you're done.

These days, files can get quite a bit bigger... but you should still rotate your damn log files once in a while. You should probably also figure out what's pooping in them so much and try to clean that up, too!

...

As the last one for now, there was an outage where someone reported that something like half of their machines were down. They had tried to do a kernel update, and wound up hitting half of them at once. I suspect they wanted to do a much smaller quantity, but messed up and hit fully half of them somehow. Or, maybe they pointed it at all of them, and only half succeeded at it. Whatever the cause, they now had 1000 freshly-rebooted machines.

The new kernel was fine, and the usual service manager stuff came back up, and it went to start the workload for those systems, and then it would immediately crash. It would try to start it again. It would crash again. Crash crash crash. This is why we call it "crashlooping".

Finally, the person in question showed up in the usual place where we discussed outages, and started talking about what was going on.

(person 1) Our stuff isn't coming back.

(person 2) Oh yeah, that's bad, they're all trying to start.

(person 1) Start, abort, start, abort, ...

(person 2) Yep, aborting... right about here: company::project::client::BlahClient::loadConfig ... which is this code: <paste>

(person 2) It's calling "get or throw" on a map for an ID number...

(person 1) My guess is the config provider service isn't running.

(person 2) It's there... it's been up for 30 minutes...

(person 1) Restarting the jobs.

(person 2) Nooooooooooo...

<time passes>

(person 2) Why is there no entry for number 86 in the map in the config?

(person 1) Oh, I bet it's problems with port takeover.

(person 3) I think entry 86 is missing from <file>.

(person 2) Definitely is missing.

(person 4) Hey everyone, we removed that a while back. Why would it only be failing now?

(person 2) It's only loaded at startup, right?

(person 4) Right.

(person 2) So if they were running for a long time, then it changed, then they're toast after a restart...

(person 3) Hey, this change looks related.

(person 4) I'm going to back that out.

This is a common situation: program A reads config C. When it starts up, config C is on version C1, and everything is fine. While A is running, the config is updated from C1 to C2, but nothing notices. Later, A tries to restart and it chokes on the C2 config, and refuses to start.

Normally, you'd only restart a few things to get started, and you'd notice that your program can't consume the new config at that point. You'd still have a few instances down, but that's it - a *few* instances. Your service should keep running on whatever's left over that you purposely didn't touch.

This is why you strive to release things in increments.

Also, it helps when programs notice config changes while they're running, so this doesn't sneak up on you much later when you're trying to restart. If the programs notice the bad config right after the change is made, it's *far* easier to correlate it to the change just by looking at the timeline.

Tuesday, 11:23:51: someone applies change.

Tuesday, 11:23:55: first 1% of machines which subscribe to the change start complaining.

... easy, right? Now compare it to this:

Tuesday, November 18: someone applies a change

Wednesday, January 7: 50% of machines fail to start "for some reason"

That's a lot harder to nail down.

...

Random aside: restarting the jobs did not help. They were already restarting themselves. "Retry, reboot, reinstall, repeat" is NOT a strategy for success.

It was not the config system being down. It was up the whole time.

It was nothing to do with "port takeover". What does that have to do with a config file being bad?

The evidence was there: the processes were crashing. They were logging a message about WHY they were killing themselves. It included a number they wanted to see, but couldn't find. It also said what part of the code was blowing up.

*That* is where you start looking. You don't just start hammering random things.

Unusual circuits in the Intel 386's standard cell logic

22 November 2025 at 16:15

I've been studying the standard cell circuitry in the Intel 386 processor recently. The 386, introduced in 1985, was Intel's most complex processor at the time, containing 285,000 transistors. Intel's existing design techniques couldn't handle this complexity and the chip began to fall behind schedule. To meet the schedule, the 386 team started using a technique called standard cell logic. Instead of laying out each transistor manually, the layout process was performed by a computer.

The idea behind standard cell logic is to create standardized circuits (standard cells) for each type of logic element, such as an inverter, NAND gate, or latch. You feed your circuit description into software that selects the necessary cells, positions these cells into columns, and then routes the wiring between the cells. This "automatic place and route" process creates the chip layout much faster than manual layout. However, switching to standard cells was a risky decision since if the software couldn't create a dense enough layout, the chip couldn't be manufactured. But in the end, the 386 finished ahead of schedule, an almost unheard-of accomplishment.1

The 386's standard cell circuitry contains a few circuits that I didn't expect. In this blog post, I'll take a quick look at some of these circuits: surprisingly large multiplexers, a transistor that doesn't fit into the standard cell layout, and inverters that turned out not to be inverters. (If you want more background on standard cells in the 386, see my earlier post, "Reverse engineering standard cell logic in the Intel 386 processor".)

The photo below shows the 386 die with the automatic-place-and-route regions highlighted; I'm focusing on the red region in the lower right. These blocks of logic have cells arranged in rows, giving them a characteristic striped appearance. The dark stripes are the transistors that make up the logic gates, while the lighter regions between the stripes are the "routing channels" that hold the wiring that connects the cells. In comparison, functional blocks such as the datapath on the left and the microcode ROM in the lower right were designed manually to optimize density and performance, giving them a more solid appearance.

The 386 die with the standard-cell regions highlighted.

The 386 die with the standard-cell regions highlighted.

As for other features on the chip, the black circles around the border are bond wire connections that go to the chip's external pins. The chip has two metal layers, a small number by modern standards, but a jump from the single metal layer of earlier processors such as the 286. (Providing two layers of metal made automated routing practical: one layer can hold horizontal wires while the other layer can hold vertical wires.) The metal appears white in larger areas, but purplish where circuitry underneath roughens its surface. The underlying silicon and the polysilicon wiring are obscured by the metal layers.

The giant multiplexers

The standard cell circuitry that I'm examining (red box above) is part of the control logic that selects registers while executing an instruction. You might think that it is easy to select which registers take part in an instruction, but due to the complexity of the x86 architecture, it is more difficult. One problem is that a 32-bit register such as EAX can also be treated as the 16-bit register AX, or two 8-bit registers AH and AL. A second problem is that some instructions include a "direction" bit that switches the source and destination registers. Moreover, sometimes the register is specified by bits in the instruction, but in other cases, the register is specified by the microcode. Due to these factors, selecting the registers for an operation is a complicated process with many cases, using control bits from the instruction, from the microcode, and from other sources.

Three registers need to be selected for an operation—two source registers and a destination register—and there are about 17 cases that need to be handled. Registers are specified with 7-bit control signals that select one of the 30 registers and control which part of the register is accessed. With three control signals, each 7 bits wide, and about 17 cases for each, you can see that the register control logic is large and complicated. (I wrote more about the 386's registers here.)

I'm still reverse engineering the register control logic, so I won't go into details. Instead, I'll discuss how the register control circuit uses multiplexers, implemented with standard cells. A multiplexer is a circuit that combines multiple input signals into a single output by selecting one of the inputs.2 A multiplexer can be implemented with logic gates, for instance, by ANDing each input with the corresponding control line, and then ORing the results together. However, the 386 uses a different approach—CMOS switches—that avoids a large AND/OR gate.

Schematic of a CMOS switch.

Schematic of a CMOS switch.

The schematic above shows how a CMOS switch is constructed from two MOS transistors. When the two transistors are on, the output is connected to the input, but when the two transistors are off, the output is isolated. An NMOS transistor is turned on when its input is high, but a PMOS transistor is turned on when its input is low. Thus, the switch uses two control inputs, one inverted. The motivation for using two transistors is that an NMOS transistor is better at pulling the output low, while a PMOS transistor is better at pulling the output high, so combining them yields the best performance.3 Unlike a logic gate, the CMOS switch has no amplification, so a signal is weakened as it passes through the switch. As will be seen below, inverters can be used to amplify the signal.

The image below shows how CMOS switches appear under the microscope. This image is very hard to interpret because the two layers of metal on the 386 are packed together densely, but you can see that some wires run horizontally and others run vertically. The bottom layer of metal (called M1) runs vertically in the routing area, as well as providing internal wiring for a cell. The top layer of metal (M2) runs horizontally; unlike M1, the M2 wires can cross a cell. The large circles are vias that connect the M1 and M2 layers, while the small circles are connections between M1 and polysilicon or M1 and silicon. The central third of the image is a column of standard cells with two CMOS switches outlined in green. The cells are bordered by the vertical ground rail and +5V rail that power the cells. The routing areas are on either side of the cells, holding the wiring that connects the cells.

Two CMOS switches, highlighted in green. The lower switch is flipped vertically compared to the upper switch.

Two CMOS switches, highlighted in green. The lower switch is flipped vertically compared to the upper switch.

Removing the metal layers reveals the underlying silicon with a layer of polysilicon wiring on top. The doped silicon regions show up as dark outlines. I've drawn the polysilicon in green; it forms a transistor (brighter green) when it crosses doped silicon. The metal ground and power lines are shown in blue and red, respectively, with other metal wiring in purple. The black dots are vias between layers. Note how metal wiring (purple) and polysilicon wiring (green) are combined to route signals within the cell. Although this standard cell is complicated, the important thing is that it only needs to be designed once. The standard cells for different functions are all designed to have the same width, so the cells can be arranged in columns, snapped together like Lego bricks.

A diagram showing the silicon for a standard-cell switch. The polysilicon is shown in green. The bottom metal is shown in blue, red, and purple.

A diagram showing the silicon for a standard-cell switch. The polysilicon is shown in green. The bottom metal is shown in blue, red, and purple.

To summarize, this switch circuit allows the input to be connected to the output or disconnected, controlled by the select signal. This switch is more complicated than the earlier schematic because it includes two inverters to amplify the signal. The data input and the two select lines are connected to the polysilicon (green); the cell is designed so these connections can be made on either side. At the top, the input goes through a standard two-transistor inverter. The lower left has two transistors, combining the NMOS half of an inverter with the NMOS half of the switch. A similar circuit on the right combines the PMOS part of an inverter and switch. However, because PMOS transistors are weaker, this part of the circuit is duplicated.

A multiplexer is constructed by combining multiple switches, one for each input. Turning on one switch will select the corresponding input. For instance, a four-to-one multiplexer has four switches, so it can select one of the four inputs.

A four-way multiplexer constructed from CMOS switches and individual transistors.

A four-way multiplexer constructed from CMOS switches and individual transistors.

The schematic above shows a hypothetical multiplexer with four inputs. One optimization is that if an input is always 0, the PMOS transistor can be omitted. Likewise, if an input is always 1, the NMOS transistor can be omitted. One set of select lines is activated at a time to select the corresponding input. The pink circuit selects 1, green selects input A, yellow selects input B, and blue selects 0. The multiplexers in the 386 are similar, but have more inputs.

The diagram below shows how much circuitry is devoted to multiplexers in this block of standard cells. The green, purple, and red cells correspond to the multiplexers driving the three register control outputs. The yellow cells are inverters that generate the inverted control signals for the CMOS switches. This diagram also shows how the automatic layout of cells results in a layout that appears random.

A block of standard-cell logic with multiplexers highlighted. The metal and polysilicon layers were removed for this photo, revealing the silicon transistors.

A block of standard-cell logic with multiplexers highlighted. The metal and polysilicon layers were removed for this photo, revealing the silicon transistors.

The misplaced transistor

The idea of standard-cell logic is that standardized cells are arranged in columns. The space between the cells is the "routing channel", holding the wiring that links the cells. The 386 circuitry follows this layout, except for one single transistor, sitting between two columns of cells.

The "misplaced" transistor, indicated by the arrow. The irregular green regions are oxide that was incompletely removed.

The "misplaced" transistor, indicated by the arrow. The irregular green regions are oxide that was incompletely removed.

I wrote some software tools to help me analyze the standard cells. Unfortunately, my tools assumed that all the cells were in columns, so this one wayward transistor caused me considerable inconvenience.

The transistor turns out to be a PMOS transistor, pulling a signal high as part of a multiplexer. But why is this transistor out of place? My hypothesis is that the transistor is a bug fix. Regenerating the cell layout was very costly, taking many hours on an IBM mainframe computer. Presumably, someone found that they could just stick the necessary transistor into an unused spot in the routing channel, manually add the necessary wiring, and avoid the delay of regenerating all the cells.

The fake inverter

The simplest CMOS gate is the inverter, with an NMOS transistor to pull the output low and a PMOS transistor to pull the output high. The standard cell circuitry that I examined contains over a hundred inverters of various sizes. (Performance is improved by using inverters that aren't too small but also aren't larger than necessary for a particular circuit. Thus, the standard cell library includes inverters of multiple sizes.)

The image below shows a medium-sized standard-cell inverter under the microscope. For this image, I removed the two metal layers with acid to show the underlying polysilicon (bright green) and silicon (gray). The quality of this image is poor—it is difficult to remove the metal without destroying the polysilicon—but the diagram below should clarify the circuit. The inverter has two transistors: a PMOS transistor connected to +5 volts to pull the output high when the input is 0, and an NMOS transistor connected to ground to pull the output low when the input is 1. (The PMOS transistor needs to be larger because PMOS transistors don't function as well as NMOS transistors due to silicon physics.)

An inverter as seen on the die. The corresponding standard cell is shown below.

An inverter as seen on the die. The corresponding standard cell is shown below.

The polysilicon input line plays a key role: where it crosses the doped silicon, a transistor gate is formed. To make the standard cell more flexible, the input to the inverter can be connected on either the left or the right; in this case, the input is connected on the right and there is no connection on the left. The inverter's output can be taken from the polysilicon on the upper left or the right, but in this case, it is taken from the upper metal layer (not shown). The power, ground, and output lines are in the lower metal layer, which I have represented by the thin red, blue, and yellow lines. The black circles are connections between the metal layer and the underlying silicon.

This inverter appears dozens of times in the circuitry. However, I came across a few inverters that didn't make sense. The problem was that the inverter's output was connected to the output of a multiplexer. Since an inverter is either on or off, its value would clobber the output of the multiplexer.4 This didn't make any sense. I double- and triple-checked the wiring to make sure I hadn't messed up. After more investigation, I found another problem: the input to a "bad" inverter didn't make sense either. The input consisted of two signals shorted together, which doesn't work.

Finally, I realized what was going on. A "bad inverter" has the exact silicon layout of an inverter, but it wasn't an inverter: it was independent NMOS and PMOS transistors with separate inputs. Now it all made sense. With two inputs, the input signals were independent, not shorted together. And since the transistors were controlled separately, the NMOS transistor could pull the output low in some circumstances, the PMOS transistor could pull the output high in other circumstances, or both transistors could be off, allowing the multiplexer's output to be used undisturbed. In other words, the "inverter" was just two more cases for the multiplexer.

The "bad" inverter. (Image is flipped vertically for comparison with the previous inverter.)

The "bad" inverter. (Image is flipped vertically for comparison with the previous inverter.)

If you compare the "bad inverter" cell below with the previous cell, they look almost the same, but there are subtle differences. First, the gates of the two transistors are connected in the real inverter, but disconnected by a small gap in the transistor pair. I've indicated this gap in the photo above; it is hard to tell if the gap is real or just an imaging artifact, so I didn't spot it. The second difference is that the "fake" inverter has two input connections, one to each transistor, while the inverter has a single input connection. Unfortunately, I assumed that the two connections were just a trick to route the signal across the inverter without requiring an extra wire. In total, this cell was used 32 times as a real inverter and 9 times as independent transistors.

Conclusions

Standard cell logic and automatic place and route have a long history before the 386, back to the early 1970s, so this isn't an Intel invention.5 Nonetheless, the 386 team deserves the credit for deciding to use this technology at a time when it was a risky decision. They needed to develop custom software for their placing and routing needs, so this wasn't a trivial undertaking. This choice paid off and they completed the 386 ahead of schedule. The 386 ended up being a huge success for Intel, moving the x86 architecture to 32 bits and defining the dominant computer architecture for the rest of the 20th century.

If you're interested in standard cell logic, I also wrote about standard cell logic in an IBM chip. I plan to write more about the 386, so follow me on Mastodon, Bluesky, or RSS for updates. Thanks to Pat Gelsinger and Roxanne Koester for providing helpful papers.

For more on the 386 and other chips, follow me on Mastodon (@kenshirriff@oldbytes.space), Bluesky (@righto.com), or RSS. (I've given up on Twitter.) If you want to read more about the 386, I've written about the clock pin, prefetch queue, die versions, packaging, and I/O circuits.

Notes and references

  1. The decision to use automatic place and route is described on page 13 of the Intel 386 Microprocessor Design and Development Oral History Panel, a very interesting document on the 386 with discussion from some of the people involved in its development. 

  2. Multiplexers often take a binary control signal to select the desired input. For instance, an 8-to-1 multiplexer selects one of 8 inputs, so a 3-bit control signal can specify the desired input. The 386's multiplexers use a different approach with one control signal per input. One of the 8 control signals is activated to select the desired input. This approach is called a "one-hot encoding" since one control line is activated (hot) at a time. 

  3. Some chips, such as the MOS Technology 6502 processor, are built with NMOS technology, without PMOS transistors. Multiplexers in the 6502 use a single NMOS transistor, rather than the two transistors in the CMOS switch. However, the performance of the switch is worse. 

  4. One very common circuit in the 386 is a latch constructed from an inverter loop and a switch/multiplexer. The inverter's output and the switch's output are connected together. The trick, however, is that the inverter is constructed from special weak transistors. When the switch is disabled, the inverter's weak output is sufficient to drive the loop. But to write a value into the latch, the switch is enabled and its output overpowers the weak inverter.

    The point of this is that there are circuits where an inverter and a multiplexer have their outputs connected. However, the inverter must be constructed with special weak transistors, which is not the situation that I'm discussing. 

  5. I'll provide more history on standard cells in this footnote. RCA patented a bipolar standard cell in 1971, but this was a fixed arrangement of transistors and resistors, more of a gate array than a modern standard cell. Bell Labs researched standard cell layout techniques in the early 1970s, calling them Polycells, including a 1973 paper by Brian Kernighan. By 1979, A Guide to LSI Implementation discussed the standard cell approach and it was described as well-known in this patent application. Even so, Electronics called these design methods "futuristic" in 1980.

    Standard cells became popular in the mid-1980s as faster computers and improved design software made it practical to produce semi-custom designs that used standard cells. Standard cells made it to the cover of Digital Design in August 1985, and the article inside described numerous vendors and products. Companies like Zymos and VLSI Technology (VTI) focused on standard cells. Traditional companies such as Texas Instruments, NCR, GE/RCA, Fairchild, Harris, ITT, and Thomson introduced lines of standard cell products in the mid-1980s.  

My mental model of the AI race

5 December 2025 at 17:41

I left a loose end the other day when I said that AI is about intent and context.

That was when I said "what’s context at inference time is valuable training data if it’s recorded."

But I left it at that, and didn’t really get into why training data is valuable.

I think we often just draw a straight arrow from “collect training data,” like ingest pages from Wikipedia or see what people are saying to the chatbot, to “now the AI model is better and therefore it wins.”

But I think it’s worth thinking about what that arrow actually means. Like, what is the mechanism here?

Now all of this is just my mental model for what’s going on.

With that caveat:

To my mind, the era-defining AI company is the one that is the first to close two self-accelerating loops.

Both are to do with training data. The first is the general theory; the second is specific.


Training data for platform capitalism

When I say era-defining companies, to me there’s an era-defining idea, or at least era-describing, and that’s Nick Srnicek’s concept of Platform Capitalism (Amazon).

It is the logic that underpins the success of Uber, Facebook, Amazon, Google search (and in the future, Waymo).

I’ve gone on about platform capitalism before (2020) but in a nutshell Srnicek describes a process whereby

  • these companies create a marketplace that brings together buyers and sellers
  • they gather data about what buyers want, what sellers have, how they decide on each other (marketing costs) and how decisions are finalised (transaction costs)
  • then use that data to (a) increase the velocity of marketplace activity and (b) grow the marketplace overall
  • thereby gathering data faster, increasing marketplace efficiency and size faster, gathering data faster… and so on, a runaway loop.

Even to the point that in 2012 Amazon filed a patent on anticipatory shipping in 2012 (TechCrunch) in which, if you display a strong intent to buy laundry tabs, they’ll put them on a truck and move them towards your door, only aborting delivery if you end up not hitting the Buy Now button.

And this is also kinda how Uber works right?

Uber has a better matching algorithm than you keeping the local minicab company on speed dial on your phone, which only works when you’re in your home location, and surge pricing moves drivers to hotspots in anticipation of matching with passengers.

And it’s how Google search works.

They see what people click on, and use that to improve the algo which drives marketplace activity, and AdSense keyword cost incentivises new entrants which increases marketplace size.

So how do marketplace efficiency and marketplace size translate to, say, ChatGPT?

ChatGPT can see what success looks like for a “buyer” (a ChatGPT user).

They generate an answer; do users respond well to it or not? (However that is measured.)

So that usage data becomes training data to improve the model to close the gap between user intent and transaction.

Right now, ChatGPT itself is the “seller”. To fully close the loop, they’ll need to open up to other sellers and ChatGPT itself transitions to being the market-maker (and taking a cut of transactions).

And you can see that process with the new OpenAI shopping feature right?

This is the template for all kinds of AI app products: anything that people want, any activity, if there’s a transaction at the end, the model will bring buyers and sellers closer together – marketplace efficiency.

Also there is marketplace size.

Product discovery: OpenAI can see what people type into ChatGPT. Which means they know how to target their research way better than the next company which doesn’t have access to latent user needs like that.

So here, training data for the model mainly comes from usage data. It’s a closed loop.

But how does OpenAI (or whoever) get the loop going in the first place?

With some use cases, like (say) writing a poem, the “seed” training data was in the initial web scrape; with shopping the seed training data came as a result of adding web search to chat and watching users click on links.

But there are more interesting products…

How do product managers triage tickets?

How do plumbers do their work?

You can get seed training data for those products in a couple ways but I think there’s an assumption that what happens is that the AI companies need to trick people out of their data by being present in their file system or adding an AI agent to their SaaS software at work, then hiding something in the terms of service that says the data can be used to train future models.

I just don’t feel like that assumption holds, at least not for the biggest companies.

Alternate access to seed training data method #1: just buy it.

I’ll take one example which is multiplayer chat. OpenAI just launched group chat in ChatGPT:

We’ve also taught ChatGPT new social behaviors for group chats. It follows the flow of the conversation and decides when to respond and when to stay quiet based on the context of the group conversation.

Back in May I did a deep dive into multiplayer AI chat. It’s really complicated. I outlined all the different parts of conversational turn taking theory that you need to account for to have a satisfying multiplayer conversation.

What I didn’t say at the end of that post was that, if I was building it, the whole complicated breakdown that I provided is not what I would do.

Instead I would find a big corpus of group chats for seed data and just train the model against that.

And it wouldn’t be perfect but it would be good enough to launch a product, and then you have actual live usage data coming in and you can iteratively train from there.

Where did that seed data come from for OpenAI? I don’t know. There was that reddit deal last year, maybe it was part of the bundle.

So they can buy data.

Or they can make it.

Alternate access to seed training data #2: cosplay it.

Every so often you hear gossip about how seed training data can be manufactured… I remember seeing a tweet about this a few months ago and now there’s a report:

AI agents are being trained on clones of SaaS products.

According to a new @theinformation report, Anthropic and OpenAI are building internal clones of popular SaaS apps so that they can train AI agents how to use them.

Internal researchers are giving the agents cloned, fake versions of products like Zendesk and Salesforce to teach the agents how to perform the tasks that white collar workers currently do.

The tweet I ran across was from a developer saying that cloning business apps for the purpose of being used in training was a sure-fire path to a quick acquisition, but that it felt maybe not ok.

My point is that AI companies don’t need sneak onto computers to watch product managers triaging tickets in Linear. Instead, given the future value is evident, it’s worth it to simply build a simulation of Linear, stuff it with synthetic data, then pay fake product managers to cosplay managing product inside fake Linear, and train off that.

Incidentally, the reason I keep saying seed training data is that the requirement for it is one-off. Once the product loop has started, the product creates it own. Which is why I don’t believe that revenue from licensing social network data or scientific paper is real. There will be a different pay-per-access model in the future.

I’m interested in whether this model extends to physical AI.

Will they need lanyards around the necks of plumbers in order to observe plumbing and to train the humanoid robots of the future?

Or will it be more straightforward to scrape YouTube plumbing tutorials to get started, and then build a simulation of a house (physical or virtual, in Unreal Engine) and let the AI teach itself?

What I mean is that AI companies need access to seed training data, but where it comes from is product-dependent and there are many ways to skin a cat.


That’s loop #1 – a LLM-mediated marketplace loop that (a) closes on transactions and (b) throws off usage data that improves market efficiency and reveals other products.

Per-product seed training data is a one-off investment for the AI company and can be found in many ways.

This loop produces cash.


Coding is the special loop that changes everything

Loop #2 starts with a specific product from loop #1.

A coding product isn’t just a model which is good at understanding and writing code. It has to be wrapped in an agent for planning, and ultimately needs access to collaboration tools, AI PMs, AI user researchers, and all the rest.

I think it’s pretty clear now that coding with an agent is vastly quicker than a human coding on their own. And not just quicker but, from my own experience, I can achieve goals that were previously beyond my grasp.

The loop closes when coding agents accelerate the engineers who are building the coding agents and also, as a side effect, working on the underlying general purpose large language model.

There’s an interesting kind of paperclip maximisation problem here which is, if you’re choosing where to put your resources, do you build paperclip machines or do you build the machines to build the paperclip machines?

Well it seems like all the big AI companies have made the same call right now which is to pile their efforts into accelerating coding, because doing that accelerates everything else.


So those are the two big loops.

Whoever gets those first will win, that’s how I think about it.

I want to add two notes on this.


On training data feeding the marketplace loop:

Running the platform capitalism/marketplace loop is not the only way for a company to participate in the AI product economy.

Another way is to enable it.

Stripe is doing this. They’re working hard to be the default transaction rails for AI agents.

Apple has done this for the last decade or so of the previous platform capitalism loop. iPhone is the place to reach people for all of Facebook, Google, Amazon, Uber and more.

When I said before that AI companies are trying to get closer to the point of intent, part of what I mean I that they are trying to figure out a way that a single hardware company like Apple can’t insert itself into the loop and take its 30%.

Maybe, in the future, device interactions will be super commoditised. iPhone’s power is that is bundles together an interaction surface, connectivity, compute, identity and payment, and we have one each. It’s interesting to imagine what might break that scarcity.


On coding tools that improve coding tools:

How much do you believe in this accelerating, self-improving loop?

The big AI research labs all believe – or at least, if they don’t believe, they believe that the risk of being wrong is worse.

But, if true, “tools that make better tools that allow grabbing bigger marketplaces” is an Industrial Revolution-like driver: technology went from the steam engine to the transistor in less than 200 years. Who knows what will happen this time around.

Because there’s a third loop to be found, and that’s when the models get so good that they can be used for novel R&D, and the AI labs (who have the cash and access to the cheapest compute) start commercialising wheels with weird new physics or whatever.

Or maybe it’ll stall out. Hard to know where the top of the S-curve is.


Auto-detected kinda similar posts:

Context plumbing

29 November 2025 at 18:05

These past few weeks I’ve been deep in code and doing what I think about as context plumbing.

I’ve been building an AI system and that’s what it feels like.

Let me unpack.


Intent

Loosely AI interfaces are about intent and context.

Intent is the user’s goal, big or small, explicit or implicit.

Uniquely for computers, AI can understand intent and respond in a really human way. This is a new capability! Like the user can type "I want to buy a camera" or point at a keylight and subvocalise "I’ve got a call in 20 minutes" or hit a button labeled "remove clouds" and job done.

Companies care about this because computers that are closer to intent tend to win.

e.g. the smartphone displaced the desktop. On a phone, you see something and then you touch it directly. With a desktop that intent is mediated through a pointer – you see something on-screen but to interact you tell your arm to move the mouse that moves the pointer. Although it doesn’t seem like much your monkey brain doesn’t like it.

So the same applies to user interfaces in general: picking commands from menus or navigating and collating web pages to plan a holiday or remembering how the control panel on your HVAC works. All of that is bureaucracy. Figuring out the sequence for yourself is administrative burden between intent and result.

Now as an AI company, you can overcome that burden. And you want to be present at the very millisecond and in the very location where the user’s intent - desire - arises. You don’t want the user to have the burden of even taking a phone out of their pocket, or having to formulate an unconscious intent into words. Being closest to the origin of intent will crowd out their competitor companies.

That explains the push for devices like AI-enabled glasses or lanyards or mics or cameras that read your body language.

This is why I think the future of interfaces is Do What I Mean: it’s not just a new capability enabled by AI, there’s a whole attentional economics imperative to it.


Context

What makes an AI able to handle intent really, really well is context.

Sure there’s the world knowledge in the large language model itself, which it gets from vast amounts of training data.

But let’s say an AI agent is taking some user intent and hill-climbing towards that goal using a sequence of tool calls (which is how agents work) then it’s going to do way better when the prompt is filled with all kinds of useful context:

For example:

  • Background knowledge from sources like Wikipedia or Google about what others have done in this situation.
  • Documentation about the tools the agent will use to satisfy the intent.
  • The user’s context such as what they’ve done before, the time of day, etc.
  • Tacit knowledge and common ground shared between the user and the AI, i.e. what we’re all assuming we’re here to do.
  • The shared “whiteboard”: the document we’re working on.
  • For the agent itself, session context: whether this task is a subtask of a larger goal, what’s worked before and what hasn’t, and so on.

This has given rise to the idea of context engineering (LangChain blog):

Context engineering is building dynamic systems to provide the right information and tools in the right format such that the LLM can plausibly accomplish the task.

btw access to context also explains some behaviour of the big AI companies:

If you want to best answer user intent, then you need to be where the user context is, and that’s why being on a lanyard with an always-on camera is preferred over a regular on-demand camera, and why an AI agent that lives in your email archive is going to be more effective than one that doesn’t. So they really wanna get in there, really cosy up.

(And what’s context at inference time is valuable training data if it’s recorded, so there’s that too.)


Plumbing?

What’s missing in the idea of context engineering is that context is dynamic. It changes, it is timely.

Context appears at disparate sources, by user activity or changes in the user’s environment: what they’re working on changes, emails appear, documents are edited, it’s no longer sunny outside, the available tools have been updated.

This context is not always where the AI runs (and the AI runs as close as possible to the point of user intent).

So the job of making an agent run really well is to move the context to where it needs to be.

Essentially copying data out of one database and putting it into another one – but as a continuous process.

You often don’t want your AI agent to have to look up context every single time it answers intent. That’s slow. If you want an agent to act quickly then you have to plan ahead: build pipes that flow potential context from where it is created to where it’s going to be used.

How can that happen continuously behind the scenes without wasting bandwidth or cycles or the data going stale?

So I’ve been thinking of AI system technical architecture as plumbing the sources and sinks of context.


In the old days of Web 2.0 the go-to technical architecture was a “CRUD” app: a web app wrapping a database where you would have entities and operations to create, read, update, and delete (these are also the HTTP verbs).

This was also the user experience, so the user entity would have a webpage (a profile) and the object entity, say a photo, would have a webpage, and then dynamic webpages would index the entities in different ways (a stream or a feed). And you could decompose webapps like this; the technology and the user understanding aligned.

With AI systems, you want the user to have an intuition about what context is available to it. The plumbing of context flow isn’t just what is technically possible or efficient, but what matches user expectation.


Anyway.

I am aware this is getting - for you, dear reader - impossibly abstract.

But for me, I’m building the platform I’ve been trying to build for the last 2 years only this time it’s working.

I’m building on Cloudflare and I have context flowing between all kinds of entities and AI agents and sub-agents running where they need to run, and none of it feels tangled or confusing because it is plumbed just right.

And I wanted to make a note about that even if I can’t talk specifically, yet, about what it is.


Auto-detected kinda similar posts:

Spinning up a new thing: Inanimate

19 November 2025 at 16:30

I’m spinning up something new with a buddy and you can guess what it is by what I’ve been writing about recently.

Big picture, there are two visions for the future of computing: cyborgs and rooms. I’m Team Augmented Environments. Mainly because so much of what I care about happens in small groups in physical space: family time, team collaboration, all the rest.

Then what happens when we’re together with AI in these environments? Interfaces will be intent-first so what’s the room-scale OS for all of this? What are the products you find there?

And where do you start?

Anyway we’ve been designing and coding and planning.

“We” is me and Daniel Fogg. We’ve known each other for ages, both done hardware, and he’s been scaling and running Big Things the last few years.

We’re at the point where it’s more straightforward than not to give this thing a name and a landing page…

Yes early days but we need a logo for renders, software protos and the raise deck haha

(Drop me a note if you’d like to chat.)

So say hello to Inanimate and you can find us over here.


Auto-detected kinda similar posts:

3 books with Samuel Arbesman

14 November 2025 at 17:16

I had a look to see when I first mentioned Samuel Arbesman here. It was 2011: the average size of scientific discoveries is getting smaller.

Anyway I’ve been reading his new book, The Magic of Code (official site).

There’s computing history, magic, the simulation hypothesis, and a friendly unpacking of everything from procedural generation to Unix.

And through it all, an enthusiastic appeal to look again at computation, as if to say, look, isn’t it WEIRD! Isn’t it COOL! Because we’ve forgotten that code and computation deserves our wonder. And although this book isn’t an apology for technology ("computing is meant to be for the humans", says Arbesman), it is a reminder - demonstrated chapter by chapter - that wonder, delight and curiosity are there to be found.

(And if we look at computation afresh then we’ll have new ideas about what to do with it.)

Now I’m decently well-read in this kind of stuff.

Yet The Magic of Code is bringing me new-to-me computing lore, which I’m loving.

So, in the spirit of a virtual book tour - an old idea from the internet where book authors would tour blogs instead of book stores, as previously mentioned - I asked Samuel Arbesman for a reading list: 3 books from the Magic of Code bibliography.

(I’ve collected a couple dozen 3 Books reading lists over the years.)

I’ll ask him to introduce himself first…


Samuel! Tell us about yourself?

I’m a scientist and writer playing in the world of venture capital as Lux Capital‘s Scientist in Residence, where I help Lux explore the ever-changing landscape of science and technology, and also host a podcast called The Orthogonal Bet where I get to speak with some of the most interesting thinkers and authors I can find. I also write books about science and tech, most recently The Magic of Code, as well as The Half-Life of Facts and Overcomplicated. The themes in my work are often related to radical interdisciplinarity, intellectual humility in the face of complex technologies and our changing knowledge, and how to use tech to allow us to be the best version of ourselves.

The best way to follow me and what I’m thinking about is my newsletter: Cabinet of Wonders.

I asked for three fave books the bibliography…

#1. Ideas That Created the Future: Classic Papers of Computer Science, edited by Harry R. Lewis

I love the history of computing. It’s weird and full of strange turns and dead ends, things worth rediscovering and understanding. But it’s far too easy to forget the historically contingent reasons why we have the technologies that we have (or simply know the paths not taken), and understanding this history-including the history of the ideas that undergird this world-is vital. More broadly, I want everyone in tech to have a “historical sense” and this book is a good place to start: it’s a handbook to seminal ideas and developments in computing, from the ELIZA chatbot and Licklider’s vision of “man-computer symbiosis” to Dijkstra’s hatred of the “go to” command. Because the ideas we are currently grappling with are not necessarily new and they have a deep intellectual pedigree. Want to know the grand mages of computing history and what they thought about? Read this book.

Ideas That Created the Future: Classic Papers of Computer Science: Amazon

#2. In the Beginning… Was the Command Line, Neal Stephenson

I’m pretty sure that I first read this entire book–it’s short–in a single sitting at the library after stumbling upon it. It’s ornery and opinionated about so many computing ideas, from Linux and GUIs to open source and even the Be operating system (it was written in the 1990’s and is very much of its time). Want to think about these ideas in the context of bizarre metaphors or a comparison to the Epic of Gilgamesh? Stephenson is your guy. This expanded my mind as to what computing is and what it can mean (the image of a demiurge using a command line to generate our universe has long stuck with me).

In the Beginning… Was the Command Line: Amazon / Wikipedia

#3. Building SimCity: How to Put the World in a Machine, Chaim Gingold

Chaim Gingold worked with Will Wright while at Maxis and has thought a lot about the history of SimCity. And when I mean history, I don’t just mean the way that Maxis came about and how SimCity was created and published, though there’s that too; I mean the winding intellectual origins of SimCity: cellular automata, system dynamics, and more. SimCity and its foundation is a window into the smashing-together of so many ideas–analog computers, toys, the nature of simulation–that is indicative of the proper way to view computing: computers are weirder and far more interdisciplinary than we give them credit for and we all need to know that. Computing is a liberal art and this book takes this idea seriously.

Building SimCity: How to Put the World in a Machine: Amazon


Amazing.

Hey here’s a deep cut ref for you: in 2010 Arbesman coined the term mesofact, "facts which we tend to view as fixed, but which shift over the course of a lifetime," or too slowly for us to notice. I think we all carry around a bunch of outdated priors and that means we often don’t see what’s right in-front of us. I use this term a whole bunch in trying to think about and identity what I’m not seeing but should be.

Thank you Sam!


More posts tagged: 3-books (34).

Auto-detected kinda similar posts:

Oedipus is about the act of figuring out what Oedipus is about

7 November 2025 at 18:58

Ok spoilers ahead.

But Oedipus Rex a.k.a. Oedipus Tyrannus by Sophocles is almost 2,500 years old at this point so it’s fair game imo.

The Oedipus story in a nutshell:

Oedipus, who was secretly adopted, receives a prophecy that he will kill his dad. So to thwart fate he leaves his dad and winds up in a city with a missing king (btw killing an argumentative guy on the way). Many years after becoming the new king and marrying the widow, he discovers that the dude he long-ago killed on the road was the missing king. Uh oh. And that the missing king was actually his birth dad, prophecy fulfilled. Double uh oh. And that his now-wife is therefore his birth mom. Uh oh for the third time, wife/mom suicides, stabs out own eyes, exiles self. End.

So the Sophocles play is a re-telling of this already well-worn story, at a time when Athenian culture was oriented around annual drama competitions (it came second).

The narrative new spin on the old tale is that it’s told as a whodunnit set over a single day, sunrise to sunset.

In a combination of flashbacks and new action, Oedipus himself acts as detective solving the mystery of the old king’s murder.

We’re already well into Oedipus’ reign of Thebes at the moment of the play, so his arrival is all backstory, then it’s tragic revelation after tragic revelation as his investigations bear fruit, and–

Oedipus discovers the identity of the mysterious murderer, and it’s him.

What a twist!


I mean, this is “he was dead all along” levels of whoa, right?

So I’ve been trying to think of other whodunnits in which the detective finds out that they themselves are the killer.

I can only think of one and a half, plus one I’m not sure about?

SPOILERS

SPOILERS

SPOILERS

So there’s Fight Club (1999) which, if you see it as a whodunnit in which the protagonist is trying to catch up with Tyler Durden, they discover that yes indeed etc

A clearer fit is Angel Heart (1987) in which Mickey Rourke plays PI Harry Angel who is commissioned by Robert De Niro to dig into a murder, and well you can guess who did it by my topic, and also it turns out that De Niro is the devil.

There is also Memento (2000), maybe, because ironically I can’t remember what happened.

You would have thought that detective-catching-up-with-their-quarry-and-it’s-them would be a common twist.

But yeah, 3.5 auto-whodunnits in 2.5 thousand years is not so many.

There must be more?


In literature:

I can’t think of any Agatha Christies that do this but admittedly I’ve not read too many.

There’s a sci-fi time-loop element to the auto-whodunnit - the investigating time traveller from the future turns out to be the instigator in the past - but although the concept feels tantalisingly familiar, no specific stories come to mind.


I enjoy a Straussian reading and I would like to dowse the hidden, esoteric meaning of Oedipus, Angel Heart and the rest. What is the meaning behind the story?

Freud has his famous interpretation of course but although I am taken with his take on Medusa I don’t think he goes deep enough with Oedipus.

BECAUSE:

My go-to razor for deciphering creative works is that all creative works are fundamentally about the act of creation (2017).

That’s true of Star Wars (the Force is narrative), Blade Runner (the replicants are actors), Hamlet (Shakespeare himself played the ghost), and in a follow-up post I added Groundhog Day (the experience of script iteration) and 1984 (the real omniscient Big Brother is the reader).

Many such cases, as they say.

I call it the Narcissist Creator Razor. They can’t help themselves, those creators, it’s all they know.

So I believe that Oedipus Tyrannus, the original auto-whodunnit, is the ur-exemplar of this razor: what Oedipus tells us is that we can search and search and search for the meaning of a story, and search some more, and ultimately what we’ll find is ourselves, searching.

(Even as an author, part of what you do is try to fully understand what you’re saying in your own creation, so both author and reader are engaged in working to interpret the work.)

i.e. when you interpret Oedipus, you learn that what Oedipus is really about is the act of trying to interpret what Oedipus is really about.

Which makes you want to stab your eyes out, perhaps.

Honestly I’m wasted in the technology world, I should be a philosopher working to understand the nature of reality working to understand itself over an overflowing ashtray in a smoke-filled cafe in 50s Paris.


More posts tagged: inner-and-outer-realities (6), the-ancient-world-is-now (16).

Filtered for wobbly tables and other facts

30 October 2025 at 12:48

1.

When you sit with friends at a wobbly table,
Simply rotate till it becomes stable.
No need to find a wedge for one of its four feet.
Math will ensure nothing spills while you eat.

The Wobbly Table Theorem (Department of Mathematics, Harvard University).

2.

David Ogilvy changed advertising in 1951.
Shirts sold. Job done.
He used a surprise black eyepatch in the magazine spot:
“story appeal” makes consumers think a lot.

History of advertising: No 110: The Hathaway man’s eyepatch (Campaign, 2014).

3.

Frogs live in ponds.
These massive ones too.
But they dig their own ponds
when nothing else will do.

The world’s biggest frogs build their own ponds (Science, 2019).

4.

Rhyming poems have been going away,
from 70% in 1900 to almost zero today.
You know, I feel like we should all be doing our bit
to reverse the decline. But my poems are terrible.

Can you tell AI from human genius? (James Marriott).


More posts tagged: filtered-for (119).

Some wholesome media

24 October 2025 at 14:23

I’m on my hols. Some recommendations.


Watch The Ballad of Wallis Island.

Charming, poignant comedy about a washed-up folk musician and loss. By Tim Key and Tom Basden.

Now I knew Basden can write - the first episode of Party is the tightest wittiest 25 minutes of ensemble radio you’ll hear - and I love everything Tim Key does as a comedian. But Key really is the revelation. Who knew he could act like that.

Watch on streaming and then listen to the soundtrack.


Play A Short Hike (I played it on Switch).

Indie video game about a cartoon bird hiking and climbing. A play-through will take you about 3 hours.

It’s cute and gentle and fun with a dozen subplots, and by the time I achieved the ostensible goal of the game I had forgotten what the purpose was and it totally took me by surprise. (Which made me cry, for personal reasons, another surprise.)

Also my kid just played this, her first self-guided video game experience. A Short Hike is deftly designed to nudge you forward through lo-fi NPC interactions, and invisibly gates levels of difficulty using geography.

Once you’ve played, watch Adam Robinson-Yu discussing A Short Hike’s design (GDC, 2020).


New daily puzzle: Clues by Sam.

A logic game that’ll take you 10 minutes each day. Follow the clues to figure out who is guilty and who is innocent. It’s easiest on Mondays so maybe begin then.


Meanwhile I’m running woodland trails in Ibiza and the scent of wild rosemary and sage fills the air in the morning. Right now I’m on a cove beach listening to the surf and the others are variously exploring, snacking and sunbathing. See you on the other side.

One more week to the Logic for Programmers Food Drive

24 November 2025 at 18:21

A couple of weeks ago I started a fundraiser for the Greater Chicago Food Depository: get Logic for Programmers 50% off and all the royalties will go to charity.1 Since then, we've raised a bit over $1600. Y'all are great!

The fundraiser is going on until the end of November, so you still have one more week to get the book real cheap.

I feel a bit weird about doing two newsletter adverts without raw content, so here's a teaser from a old project I really need to get back to. Notes on structured concurrency argues that old languages had a "old-testament fire-and-brimstone goto" that could send control flow anywhere, like from the body of one function into the body of another function. This "wild goto", the article claims, what Dijkstra was railing against in Go To Statement Considered Harmful, and that modern goto statements are much more limited, "tame" if you will, and wouldn't invoke Dijkstra's ire.

I've shared this historical fact about Dijkstra many times, but recently two separate people have told me it doesn't makes sense: Dijkstra used ALGOL-60, which already had tame gotos. All of the problems he raises with goto hold even for tame ones, none are exclusive to wild gotos. So

This got me looking to see which languages, if any, ever had the wild goto. I define this as any goto which lets you jump from outside to into a loop or function scope. Turns out, FORTRAN had tame gotos from the start, BASIC has wild gotos, and COBOL is a nonsense language intentionally designed to horrify me. I mean, look at this:

The COBOL ALTER statement, which redefines a goto target

The COBOL ALTER statement changes a goto's target at runtime.

(Early COBOL has tame gotos but only on a technicality: there are no nested scopes in COBOL so no jumping from outside and into a nested scope.)

Anyway I need to write up the full story (and complain about COBOL more) but this is pretty neat! Reminder, fundraiser here. Let's get it to 2k.


  1. Royalties are 80% so if you already have the book you get a bit more bang for your buck by donating to the GCFD directly 

I'm taking a break

27 October 2025 at 21:02

Hi everyone,

I've been getting burnt out on writing a weekly software essay. It's gone from taking me an afternoon to write a post to taking two or three days, and that's made it really difficult to get other writing done. That, plus some short-term work and life priorities, means now feels like a good time for a break.

So I'm taking off from Computer Things for the rest of the year. There might be some announcements and/or one or two short newsletters in the meantime but I won't be attempting a weekly cadence until 2026.

Thanks again for reading!

Hillel

Modal editing is a weird historical contingency we have through sheer happenstance

21 October 2025 at 16:46

A while back my friend Pablo Meier was reviewing some 2024 videogames and wrote this:

I feel like some artists, if they didn't exist, would have the resulting void filled in by someone similar (e.g. if Katy Perry didn't exist, someone like her would have). But others don't have successful imitators or comparisons (thinking Jackie Chan, or Weird Al): they are irreplaceable.

He was using it to describe auteurs but I see this as a property of opportunity, in that "replaceable" artists are those who work in bigger markets. Katy Perry's market is large, visible and obviously (but not easily) exploitable, so there are a lot of people who'd compete in her niche. Weird Al's market is unclear: while there were successful parody songs in the past, it wasn't clear there was enough opportunity there to support a superstar.

I think that modal editing is in the latter category. Vim is now very popular and has spawned numerous successors. But its key feature, modes, is not obviously-beneficial, to the point that if Bill Joy didn't make vi (vim's direct predecessor) fifty years ago I don't think we'd have any modal editors today.

A quick overview of "modal editing"

In a non-modal editor, pressing the "u" key adds a "u" to your text, as you'd expect. In a modal editor, pressing "u" does something different depending on the "mode" you are in. In Vim's default "normal" mode, "u" undoes the last change to the text, while in the "visual" mode it lowercases all selected text. It only inserts the character in "insert" mode. All other keys, as well as chorded shortcuts (ctrl-x), work the same way.

The clearest benefit to this is you can densely pack the keyboard with advanced commands. The standard US keyboard has 48ish keys dedicated to inserting characters. With the ctrl and shift modifiers that becomes at least ~150 extra shortcuts for each other mode. This is also what IMO "spiritually" distinguishes modal editing from contextual shortcuts. Even if a unimodal editor lets you change a keyboard shortcut's behavior based on languages or focused panel, without global user-controlled modes it simply can't achieve that density of shortcuts.

Now while modal editing today is widely beloved (the Vim plugin for VSCode has at least eight million downloads), I suspect it was "carried" by the popularity of vi, as opposed to driving vi's popularity.

Modal editing is an unusual idea

Pre-vi editors weren't modal. Some, like EDT/KED, used chorded commands, while others like ed or TECO basically REPLs for text-editing DSLs. Both of these ideas widely reappear in modern editors.

As far as I can tell, the first modal editor was Butler Lampson's Bravo in 1974. Bill Joy admits he used it for inspiration:

A lot of the ideas for the screen editing mode were stolen from a Bravo manual I surreptitiously looked at and copied. Dot is really the double-escape from Bravo, the redo command. Most of the stuff was stolen.

Bill Joy probably took the idea because he was working on dumb terminals that were slow to register keystrokes, which put pressure to minimize the number needed for complex operations.

Why did Bravo have modal editing? Looking at the Alto handbook, I get the impression that Xerox was trying to figure out the best mouse and GUI workflows. Bravo was an experiment with modes, one hand on the mouse and one issuing commands on the keyboard. Other experiments included context menus (the Markup program) and toolbars (Draw).

Xerox very quickly decided against modes, as the successors Gypsy and BravoX were modeless. Commands originally assigned to English letters were moved to graphical menus, special keys, and chords.

It seems to me that modes started as an unsuccessful experiment deal with a specific constraint and then later successfully adopted to deal with a different constraint. It was a specialized feature as opposed to a generally useful feature like chords.

Modal editing didn't popularize vi

While vi was popular at Bill Joy's coworkers, he doesn't attribute its success to its features:

I think the wonderful thing about vi is that it has such a good market share because we gave it away. Everybody has it now. So it actually had a chance to become part of what is perceived as basic UNIX. EMACS is a nice editor too, but because it costs hundreds of dollars, there will always be people who won't buy it.

Vi was distributed for free with the popular BSD Unix and was standardized in POSIX Issue 2, meaning all Unix OSes had to have vi. That arguably is what made it popular, and why so many people ended up learning a modal editor.

Modal editing doesn't really spread outside of vim

I think by the 90s, people started believing that modal editing was a Good Idea, if not an obvious one. That's why we see direct descendants of vi, most famously vim. It's also why extensible editors like Emacs and VSCode have vim-mode extensions, but these are but these are always simple emulation layers on top of a unimodal baselines. This was good for getting people used to the vim keybindings (I learned on Kile) but it means people weren't really doing anything with modal editing. It was always "The Vim Gimmick".

Modes also didn't take off anywhere else. There's no modal word processor, spreadsheet editor, or email client.1 Visidata is an extremely cool modal data exploration tool but it's pretty niche. Firefox used to have vimperator (which was inspired by Vim) but that's defunct now. Modal software means modal editing which means vi.

This has been changing a little, though! Nowadays we do see new modal text editors, like kakoune and Helix, that don't just try to emulate vi but do entirely new things. These were made, though, in response to perceived shortcomings in vi's editing model. I think they are still classifiable as descendants. If vi never existed, would the developers of kak and helix have still made modal editors, or would they have explored different ideas?

People aren't clamouring for more experiments

Not too related to the overall picture, but a gripe of mine. Vi and vim have a set of hardcoded modes, and adding an entirely new mode is impossible. Like if a plugin (like vim's default netrw) adds a file explorer it should be able to add a filesystem mode, right? But it can't, so instead it waits for you to open the filesystem and then adds 60 new mappings to normal mode. There's no way to properly add a "filesystem" mode, a "diff" mode, a "git" mode, etc, so plugin developers have to mimic them.

I don't think people see this as a problem, though! Neovim, which aims to fix all of the baggage in vim's legacy, didn't consider creating modes an important feature. Kak and Helix, which reimagine modal editing from from the ground up, don't support creating modes either.2 People aren't clamouring for new modes!

Modes are a niche power user feature

So far I've been trying to show that vi is, in Pablo's words, "irreplaceable". Editors weren't doing modal editing before Bravo, and even after vi became incredibly popular, unrelated editors did not adapt modal editing. At most, they got a vi emulation layer. Kak and helix complicate this story but I don't think they refute it; they appear much later and arguably count as descendants (so are related).

I think the best explanation is that in a vacuum modal editing sounds like a bad idea. The mode is global state that users always have to know, which makes it dangerous. To use new modes well you have to memorize all of the keybindings, which makes it difficult. Modal editing has a brutal skill floor before it becomes more efficient than a unimodal, chorded editor like VSCode.

That's why it originally appears in very specific circumstances, as early experiments in mouse UX and as a way of dealing with modem latencies. The fact we have vim today is a historical accident.

And I'm glad for it! You can pry Neovim from my cold dead hands, you monsters.


P99 talk this Thursday!

My talk, "Designing Low-Latency Systems with TLA+", is happening 10/23 at 11:40 central time. Tickets are free, the conf is online, and the talk's only 16 minutes, so come check it out!


  1. I guess if you squint gmail kinda counts but it's basically an antifeature 

  2. It looks like Helix supports creating minor modes, but these are only active for one keystroke, making them akin to a better, more ergonomic version of vim multikey mappings. 

Refueling a NUCLEAR REACTOR - Smarter Every Day 311

3 November 2025 at 17:35

💾

You can try AnyDesk for free. It's good. https://anydesk.com/smarter
http://www.patreon.com/smartereveryday
Get Email Updates: https://www.smartereveryday.com/email

Sticker Team!: http://www.patreon.com/smartereveryday

Click here if you're interested in subscribing: http://bit.ly/Subscribe2SED
⇊ Click below for more links! ⇊
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
GET SMARTER SECTION

Learn more about the the Browns Ferry Plant at the Tennessee Valley Authority:
https://www.tva.com/energy/our-power-system/nuclear/browns-ferry-nuclear-plant
https://en.wikipedia.org/wiki/Tennessee_Valley_Authority

Here's a link to an article about my visit:
https://www.tva.com/the-powerhouse/stories/inside-browns-ferry

~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tweet Ideas to me at:
http://twitter.com/smartereveryday

Smarter Every Day on Facebook
https://www.facebook.com/SmarterEveryDay

Smarter Every Day on Patreon
http://www.patreon.com/smartereveryday

Smarter Every Day On Instagram
http://www.instagram.com/smartereveryday

Smarter Every Day SubReddit
http://www.reddit.com/r/smartereveryday

Ambiance, audio and musicy things by: Gordon McGladdery
https://www.ashellinthepit.com/
http://ashellinthepit.bandcamp.com/

If you feel like this video was worth your time and added value to your life, please SHARE THE VIDEO!

If you REALLY liked it, feel free to pitch a few dollars Smarter Every Day by becoming a Patron.
http://www.patreon.com/smartereveryday

Warm Regards,

Destin
❌
❌