Reading view

There are new articles available, click to refresh the page.

The natural home for AI agents is your Reminders app

AI agents do things for you, semi-autonomously, and one question is how we coordinate with them.

By "do things for you" I mean

(btw I use the heck out of Claude Code, despite there being better pure coding models available, proving that the difference is in the quality of the agent harness,_ i.e. how it approaches problems, and Anthropic has nailed that.)

By "coordinate" what I mean is: once you’ve stated your intent and the agent is doing what you mean (2025), or it’s listened to you and made a suggestion, and it has actioned the tasks for that intent then how do you

  • have visibility
  • repair misunderstandings
  • jump in when you’re needed
  • etc.

Hey did you know that

15 billion hours of time is spent every year by UK citizens dealing with administration in their personal lives.

Such as: "private bills, pensions, debt, services, savings and investments" (and also public services like healthcare and taxes).

It’s called the time tax.

I was chatting with someone last week who has a small collection of home-brew agents to do things like translating medical letters into plain language, and monitoring comms from his kid’s school.

It feels like small beans, agents that do this kind of admin, but it adds up.


Every agent could have its own bespoke interface, all isolated in their own windows, but long-term that doesn’t seem likely.

See, agents have common interface requirements. Apps need buttons and lists and notifications; agents need… what?

What is particular to agents is that they need progress bars not notifications (2023): after decades of making human-computer interaction almost instantaneous, suddenly we have long-running processes again. "Thinking…"

Agents sequence tasks into steps, and they pause on some steps where clarification is needed or, for trust reasons, you want a human in the loop: like, to approve an email which will be sent in the user’s name, or a make a payment over a certain threshold, or simply read files in a directory that hasn’t been read before.

Claude Code has nailed how this works re: trust and approvals. Let’s say you’re approving a file edit operation. The permission is cleverly scoped to this time only, this session, or forever; and to individual files and directories.

Claude Code has also nailed plans, an emerging pattern for reliability and tracking progress: structured lists of tasks as text files.

Ahead of doing the work, Claude creates a detailed plan and stores it in one of its internal directories.

You can already entice Claude to make these plans more visible - that’s what I do, structuring the work into phases and testing for each phase - and there’s discussion about making this a built-in behaviour.

Want to see a sample plan file? It’s just some context and a list of to-dos. Check out My Claude Code Workflow And Personal Tips by Zhu Liang.

So… if agents all use plans, put all those plans in one place?


Another quality that is particular to agents is that when you’re running multiple agents each running down its personal plan, and you have a bunch of windows open and they’re all asking for permissions or clarifications or next instructions, and it feels like plate spinning and it is a ton of fun.


Task management software is a great way to interact with many plans at once.

Visually, think of a kanban board: columns that show tasks that are upcoming, in progress, for review and done (and the tasks can have subtasks).

Last week on X, Geoffrey Litt (now at Notion) showed a kanban board for managing coding agents: "When an agent needs your input, it turns the task red to alert you that it’s blocked!"

There’s something in the air. Weft (open source, self-hosted) is

a personal task board where AI agents work on your tasks. Create a task, assign it to an agent, and it gets to work. Agents can read your emails, draft responses, update spreadsheets, create PRs, and write code.

It is wild. Write something like "Create a cleaned up Google Doc with notes from yesterday’s standup and then send me an email with the doc link" and then an agent will write actual code to make the doc, summarise the notes, connect to your Gmail etc.

This is great in that you can instruct and then track progress in the same place, you can run many tasks simultaneously and async, and when you jump in to give an approach then you can immediately see all the relevant context.

Ok great self-driving to-do lists.

But wouldn’t it be great if all my agents used the same task manager?


Is it really worth special-casing the AI agent here?

Linear is a work to-do list. Sorry, a team collaboration tool oriented around tickets.

Linear for Agents is smart in that they didn’t launch any agents themselves, they simply built hooks to allow AI agents to appear like other users, i.e. the agent has an avatar; you can tag it etc:

Agents are full members of your Linear workspace. You can assign them to issues, add them to projects, or @mention them in comment threads.

(Agents are best seen as teammates (2023).)

In the general case what we’re talking about is a multiplayer to-do list which AI agents can use too.


Really this is just the Reminders app on my iPhone?

Long term, long term, the Schelling point for me, my family, and future AI agents is a task manager with well-scoped, shared to-do lists that I already look at every day.

Apple is incredibly well placed here.

Not only do they have access to all my personal context on the phone, but it turns out they have a great coordination surface too.

So Apple should extend Reminders to work with agents, Linear for Agents style. Let any agent ask for permission to read and write to a list. Let agents pick up tasks; let them add sub-tasks and show when something is blocked; let me delegate tasks to my installed agents.

Then add a new marketplace tab to discover (and pay for) other agents to, I don’t know, plan a wedding, figure out my savings, help with meal planning, chip away at some of that billions of hours of time tax.

The Reminders app is a powerful and emerging app runtime (2021) – if Apple choose to grab the opportunity.


Auto-detected kinda similar posts:

Real like ghosts or real like celebrities?

This chart from the back of Ursula Le Guin’s Always Coming Home lives forever in my head:

Always Coming Home is a collection of texts from the Kesh, a society in far future Northern California which is also, I guess, a utopian new Bronze Age I suppose? A beautiful book.

This chart is in in the appendix. It reminds me that

  • we bucket stories of types like journalism and history as “fact” and types like legend and novels as “fiction,” this binary division
  • whereas we could (like the Kesh) accept that no story is clearly fact nor fiction, but instead is somewhere on a continuum.

Myth often has more truth in it than some journalism, right?


There’s a nice empirical typology that breaks down real/not real in this paper about the characters that kids encounter:

To what extent do children believe in real, unreal, natural and supernatural figures relative to each other, and to what extent are features of culture responsible for belief? Are some figures, like Santa Claus or an alien, perceived as more real than figures like Princess Elsa or a unicorn? …

We anticipated that the categories would be endorsed in the following order: ‘Real People’ (a person known to the child, The Wiggles), ‘Cultural Figures’ (Santa Claus, The Easter Bunny, The Tooth Fairy), ‘Ambiguous Figures’ (Dinosaurs, Aliens), ‘Mythical Figures’ (unicorns, ghosts, dragons), and ‘Fictional Figures’ (Spongebob Squarepants, Princess Elsa, Peter Pan).

(The Wiggles are a children’s musical group in Australia.)

btw the researchers found that aliens got bucketed with unicorns/ghosts/dragons, and dinosaurs got bucketed with celebrities (The Wiggles). And adults continue to endorse ghosts more highly than expected, even when unicorns drop away.

Ref.

Kapit’any, R., Nelson, N., Burdett, E. R. R., & Goldstein, T. R. (2020). The child’s pantheon: Children’s hierarchical belief structure in real and non-real figures. PLOS ONE, 15(6), e0234142. https://doi.org/10.1371/journal.pone.0234142


What I find most stimulating about this paper is what it doesn’t touch.

Like, it points at the importance of cultural rituals in the belief in the reality of Santa. But I wonder about the role of motivated reasoning (you only receive gifts if you’re a believer). And the coming of age moment where you realise that everyone has been lying to you.

Or the difference between present-day gods and historic gods.

Or the way facts about real-ness change over time: I am fascinated by the unicorn being real-but-unseen to the Medieval mind and fictional to us.

Or how about the difference between Wyatt Earp (real) and Luke Skywalker (not real) but the former is intensely fictionalised (the western is a genre and public domain, although based on real people) whereas Star Wars is a “cinematic universe” which is like a genre but privately owned and with policed continuity (Star Wars should be a genre).


I struggle to find the words to tease apart these types of real-ness.

Not to mention concepts like the virtual (2021): "The virtual is real but not actual" – like, say, power, as in the power of a king to chop off your head.


So I feel like reality is fracturing this century, so much.

Post-truth and truthiness.

The real world, like cyberspace, now a consensual hallucination – meaning that fiction can forge new realities. (Who would have guessed that a post on social media could make Greenland part of the USA? It could happen.)

That we understand the reality that comes from dreams and the subjectivity of reality…

Comedians doing a “bit,” filters on everything, celebrities who may not exist, body doubles, conspiracy theories that turn out to be true, green screen, the natural eye contact setting in FaceTime

Look, I’m not trained in this. I wish I were, it has all been in the academic discourse forever.

Because we’re not dumb, right? We know that celebs aren’t real in the same sense that our close personal friends are real, and - for a community - ghosts are indeed terrifically true, just as the ghost in Hamlet was a consensus hallucination made real, etc.

But I don’t feel like we have, in the mainstream, words that match our intuitions and give us easy ways to talk about reality in this new reality. And I think we could use them.

My top posts in 2025

Hello! This is my summary of 2025 and the “start here” post for new readers. Links and stats follow…

According to Fathom, my most trafficked posts of 2025 were (in descending order):

Here are all the most popular posts: 20 most popular in 2025.

Even more AI than last year.


My personal faves aren’t always the ones that get the most traffic…

  • Homing pigeons fly by the scent of forests and the song of mountains
  • Keeping the seat warm between peaks of cephalopod civilisation
  • Diane, I wrote a lecture by talking about it

Also MAGA fashion, pneumatic elevators, and what the play Oedipus is really about.

Check out my speculative faves from 2025.


Also check out the decade-long Filtered for… series.

Links and rambling interconnectedness. I like these ones.

Posts in 2025 include:

And more.

Here’s the whole Filtered for… series. 2025 posts at the top.


Looking back over 2025, I’ve been unusually introspective.

Possibly because I hit my 25th anniversary with this blog? (Here are my reflections and a follow-up interview.)

Or something else, who knows.

Anyway here’s a collection from this year:


In other writing, I…

A talk I did in June for a WIRED event has just broken a million views. Watch AI Agents: Your Next Employee, or Your Next Boss (YouTube).


PREVIOUSLY!


Other ways to read:

Or just visit the website: interconnected.org/home.

If you read a post you like, please do pass it along and share on the discords or socials or whatever new thing.

I like email replies. I like it when people send me links they think I’ll enjoy (they’re almost always correct). I especially like hearing about how a post has had a flapping-of-the-butterfly’s-wings effect somehow, whether personally or at work.

I like talking to people most of all. I started opening my calendar for Unoffice Hours over 5 years ago. 400+ calls later and it’s still the highlight of my week. Learn more and book a time here.


You should totally start a blog yourself.

Here are my 15 personal rules for blogging.

If you’re interested in my tech stack, here’s the colophon.

But really, use whatever tech makes it easy for you to write. Just make sure your blogging or newsletter platform lets you publish your posts with an RSS feed. That’s a great marker that you own your own words.


Stats for the stats fans.

  • 2025: 61 posts (58,160 words, 549 links)
  • 2024: 60 posts (62,670 words, 586 links)
  • 2023: 68 posts (69,067 words, 588 links)
  • 2022: 96 posts (104,645 words, 712 links)
  • 2021: 128 posts (103,682 words, 765 links)

My current streak: I’ve been posting weekly or more for 301 weeks.


Looking back over 2025, I’m increasingly straddling this awkward divide:

Where “everything else” is everything from policy suggestions on the need for a strategic fact reserve to going to algoraves to my other speculative faves this year.

Whereas the more bloggy spitball thoughts (which I love, and this is mainly what I wrote in 2020/21/22) are now relegated to occasional compilation posts a.k.a "scraps" – it would be great to give these more space but that doesn’t seem to be where my time is going.

I don’t know what to do about this.

I don’t know if I need to do anything about this.

One of the big reasons that I write here is that it’s my public notebook and so it’s this core sample that cuts across everything that I’m thinking about, which is indeed a weird admixture or melange, and that’s precisely the value for me because that’s how new ideas come, even if that makes this blog hard to navigate and many visitors will just bounce off.

All of which makes me appreciate YOU all the more, dear reader, for sticking with.

Happy 2026.


More posts tagged: meta (20).

Auto-detected kinda similar posts:

More scraps from my notes file

I’m away with family this week so here are some more scraps from my notes (previously).


Disney is considering a reboot of the Indiana Jones franchise. Goodness knows how many Jurassic Park movies there are.

We need to create new IP.

Culture creates new ideas downstream. Without new IP, it’s like trying to feed yourself by eating your own arm.

So: moratorium on re-using IP in movies. The UK makes heavy use of movie subsidies. We should use this to disincentivise anything sits inside an existing franchise. If a movie’s success is likely more to do with existing mindshare than its content, don’t support it.

Radically reduce copyright down to 10 years or something. More than that: invent a new super-anti-copyright which actively imposes costs on any content which is too close to any existing content in an AI-calculated vibes database or something.

i.e. tax unoriginality.

The past is a foreign country that we should impose tariffs on.


There’s a kind of face that we don’t get anymore.

Neil Armstrong, Christopher Reeve as Superman, Keir Dullea as Dave Bowman in 2001.

I don’t know how to characterise it: open, emotionally imperturbable. happy. Where did it go?


I’m at the cricket today and England are losing. It’s an interesting feeling to be with, losing, especially while 90,000 people in the stadium (plus some visiting fans) are yelling for the winners – Australia, at this point.


I have zero memory for where cutlery goes in the cutlery drawer. I don’t consciously look when I take things out but if everything was moved around, it wouldn’t make any difference. On the occasions that all the knives, forks and spoons have been used and I’m unloading the dishwasher (which I do daily), I cannot for the life of me remember which sections they go in, so I return them in any old order. Raw extended mind.


A few weeks ago I was on a zoom call where someone had a standing mirror in their room in the background. I’ve never seen that before. It kept me weirdly on edge throughout like it violated some previously unstated video call feng shui or something.

(I had another call in which the person’s screen was reflected in a shiny window behind them and so I could see my own face over their shoulder. But that seemed fine. This was not the same.)

My disquiet came because the mirror was angled such that it showed an off-screen part of the room. I could see beyond the bounds; it broke the container.


More posts tagged: scraps-from-the-scraps-file (3).

Filtered for conspiracy theories

1.

Why Were All the Bells in the World Removed? The Forgotten Power of Sound and Frequency (Jamie Freeman).

Church bells: "something strange happened in the 19th and 20th centuries: nearly all of the world’s ancient bells were removed, melted down, or destroyed."

(I don’t know whether that’s true, but go with it for a second.)

Why? Mainstream historians attribute this mass removal to wars and the need for metal, but when you dig deeper, the story doesn’t add up.

An explanation:

Some theorists believe that these bells were part of a Tartarian energy grid, designed to harmonise human consciousness, balance electromagnetic fields, and even generate free energy. Removing the bells would have disrupted this energy network, cutting us off from an ancient technology we no longer understand.

Tartarian?

Tartarian Empire (Wikipedia):

Tartary, or Tartaria, is a historical name for Central Asia and Siberia. Conspiracy theories assert that Tartary, or the Tartarian Empire, was a lost civilization with advanced technology and culture.

2.

Risky Wealth: Would You Dare to Open the Mysterious Sealed Door of Padmanabhaswamy Temple? (Ancient Origins).

"The Padmanabhaswamy Temple is a Hindu temple situated in Thiruvananthapuram, Kerala, a province on the southwestern coast of India."

It has a mysterious Vault B with an as-yet-unopened sealed door.

One of the legends surrounding Vault B is that it is impossible at present to open its door. It has been claimed that the door of the vault is magically sealed by sound waves from a secret chant that is now lost. In addition, it is claimed that only a holy man with the knowledge of this chant would be capable of opening the vault’s door.

Maybe the chant was intended to tap Tartarian energies.

3.

Claims that former US military project is being used to manipulate the weather are “nonsense” (RMIT University).

HAARP is a US research program that uses radio waves to study the ionosphere (Earth’s upper atmosphere) and cannot manipulate weather systems.

PREVIOUSLY:

Artificial weather as a military technology (2020), discussing a 1996 study from the US military, "Weather as a Force Multiplier: Owning the Weather in 2025."

4.

What conspiracy theorists get right (Reasonable People #42, Tom Stafford).

Stafford lists 4 “epistemic virtues” of conspiracy theorists:

  • "Listening to other people"
  • "A healthy skepticism towards state power"
  • "Being sensitive to hidden coalitions"
  • "Willing to believe the absurd"

As traits in a search for new truths, these are good qualities!

Where is goes wrong is "the vices of conspiracy theory seem only to be the virtues carried to excess."

Let’s try to keep the right side of the line folks.


More posts tagged: filtered-for (120).

My new fave thing to go to is algoraves

My new fave thing to go to is live coding gigs, a.k.a. algoraves.

There are special browser-based programming languages like strudel where you type code to define the beats and the sound, like mod synth in code, and it plays in a loop even while you’re coding. (The playhead moves along as a little white box.)

As you write more code and edit the code, you make the music.

So people do gigs like this: their laptop is hooked up to (a) speakers and (b) a projector. You see the code on the big screen in real-time as it is written and hear it too.

Here’s what it looks like (Instagram).

That pic is from a crypt under a church in Camberwell at an event called Low Stakes | High Spirits.

(There are more London Live Coding events. I’ve been to an AlgoRhythm night too and it was ace.)


It helps that these beeps and boops are the kind of music I listen to anyway.

But there is something special about the performer performing right there with the audience and vibing off them.

Like all art, there’s some stuff you prefer and some not so much, and sometimes you’ll get some music that is really, really what you’re into and it just builds and builds until you’re totally transported.

So you take a vid or a pic of what’s going on, wanting to capture the moment forever, and what you see when you’re going back through your photo library the next day is endless pics of a bunch of code projected on the wall and you’re like, what is this??

You have to be there.

(I suppose though it also means you can try out some of the code for yourself? View Source but for live music?)


Actually that’s art isn’t it.

All art galleries are a bit weird eh. Each time you visit, there are a hundred paintings scattered in rooms and you walk through like uh-huh, uh-huh, ok, that’s nice, uh-huh, ok. Then at random one of them skewers you through your soul and you’re transfixed by the image for life.


Often what happens is the musician is not alone!

There is also live coding software for visuals e.g. hydra. (hydra is browser-based too so you can try it right now.)

So the person live coding visuals sits right next to the person live coding music, with the music and the visuals projected side-by-side on adjacent big screens. Code overlaid on both.

The visuals don’t necessarily automatically correspond to the music. There may be no microphone involved.

The visuals person and the music person are jamming together but really not off each other directly; both are doing their thing but steering in part by the audience, which itself is responding to the music and visuals together.

So you get this strange loop of vibes and it’s wonderful.


I hadn’t expected to see comments in code.

At the last night I went to, the musician was writing comments in the code, i.e. lines of code that start with // so they are not executed but just there.

The comments like the rest of the code are projected.

There were comments like (not verbatim because this is from memory):

// i’ll make it faster. is this good?

And:

// my face is so red rn this is my first time

So there’s this explicit textual back-channel to the audience that people can read and respond to, separately to the music itself.

And I love the duality there, the two voices of the artist.


You get something similar at academic conferences?

I feel like I must have mentioned this before but I can’t find it.

One of my great joys is going to academic conferences and hearing people present work which is at the far reaches of my understanding. Either sciences/soft sciences or humanities, it’s all good.

My favourite trope is when the researcher self-glosses.

So they read out their paper or their written lecture, and that’s one voice with a certain tone and authority and cadence.

Then every couple of paras they shift their weight on their feet, maybe tilt their head, then add an extended thought or a side note, and their voice becomes brighter and more conversational, just for the duration of that sidebar.

Then they drop back into the regular tone and resume their notes.

Transcribed, a talk like this would read like a single regular essay.

But in person you’re listening to the speaker in dialogue with themselves and it’s remarkable, I love it, it adds a whole extra dimension of meaning.

If you’re an academic then you’ll know exactly what I mean. I’ve noticed these two voices frequently although culture/media studies is where I spot it most.


In Samuel Delaney’s Stars in My Pocket Like Grains of Sand (Amazon) - one of my favourite books of all time - there is a species called evelmi and they have many tongues.

I swear there is a scene in which an evelm speaks different words with different tongues simultaneously.

(I can’t find it for you as I only have the paperback and it’s been a while since my last re-read.)

But there’s a precision here, right? To chord words, to triangulate something otherwise unreachable in semantic space or to make a self-contradicting statement, either playfully or to add depth and intention.


Anyway so I love all these dualities at these live coding nights: the music and visuals, the code and the comments, the genotype which I read and the phenotype which I hear

It’s an incredibly welcoming scene here in London – lots of young people of course who doing things that are minimum 10x cooler than anything I did at that age, and older people too, everyone together.


You know:

Last week the local pub had a band singing medieval carols and suddenly I got that adrift in time, atemporal feeling of knowing that I’m in the company of listeners who have been hearing these same songs for hundreds of years, an audience that is six hundred years deep.

(There was also a harp. Gotta love a harp.)

I never think of myself as a live music person but give me some folk or choral or modern classical or opera and I’m lost in it.

Or, well, electronica, but that’s more about the dancing.

Or the time that dude had a 3D printed replica of a Neanderthal bone flute, the oldest known musical instrument from 50,000 years ago if I remember it right, and he improv’d ancient music led by the sound of the instrument itself as we drove through the Norwegian fjords and holy shit that was a transcendent moment that I will remember forever.


More posts tagged: 20-most-popular-in-2025 (20).

Refinement without Specification

Imagine we have a SQL database with a user table, and users have a non-nullable is_activated boolean column. Having read That Boolean Should Probably Be Something else, you decide to migrate it to a nullable activated_at column. You can change any of the SQL queries that read/update the user table but not any of the code that uses the results of these queries. Can we make this change in a way that preserves all external properties?

Yes. If an update would set is_activated to true, instead set it to the current date. Now define the refinement mapping that takes a new_user and returns an old_user. All columns will be unchanged except is_activated, which will be

f(new_user).is_activated = 
    if new_user.activated_at == NULL 
    then FALSE
    else TRUE

Now new code can use new_user directly while legacy code can use f(new_user) instead, which will behave indistinguishably from the old_user.

A little more time passes and you decide to switch to an event sourcing-like model. So instead of an activated_at column, you have a user_events table, where every record is (user_id, timestamp, event). So adding an activate event will activate the user, adding a deactivate event will deactivate the user. Once again, we can update the queries but not any of the code that uses the results of these queries. Can we make a change that preserves all external properties?

Yes. If an update would change is_activated, instead have it add an appropriate record to the event table. Now, define the refinement mapping that takes newer_user and returns new_user. The activated_at field will be computed like this:

g(newer_user).activated_at =
        # last_activated_event
    let lae = 
            newer_user.events
                      .filter(event = "activate" | "deactivate")
                      .last,
    in
        if lae.event == "activate" 
        then lae.timestamp
        else NULL

Now new code can use newer_user directly while old code can use g(newer_user) and the really old code can use f(g(newer_user)).

Mutability constraints

I said "these preserve all external properties" and that was a lie. It depends on the properties we explicitly have, and I didn't list any. The real interesting properties for me are mutability constraints on how the system can evolve. So let's go back in time and add a constraint to user:

C1(u) = u.is_activated => u.is_activated'

This constraint means that if a user is activated, any change will preserve its activated-ness. This means a user can go from deactivated to activated but not the other way. It's not a particular good constraint but it's good enough for teaching purposes. Such a SQL constraint can be enforced with triggers.

Now we can throw a constraint on new_user:

C2(nu) = nu.activated_at != NULL => nu.activated_at' != NULL

If nu satisfies C2, then f(nu) satisfies C1. So the refinement still holds.

With newer_u, we cannot guarantee that g(newer_u) satisfies C2 because we can go from "activated" to "deactivated" just by appending a new event. So it's not a refinement. This is fixable by removing deactivation events, that would work too.

So a more interesting case is bad_user, a refinement of user that has both activated_at and activated_until. We propose the refinement mapping b:

b(bad_user).activated =
    if bad_user.activated_at == NULL && activated_until == NULL
    then FALSE
    else bad_user.activated_at <= now() < bad_user.activated_until

But now if enough time passes, b(bad_user).activated' = false, so this is not a refinement either.

The punchline

Refinement is one of the most powerful techniques in formal specification, but also one of the hardest for people to understand. I'm starting to think that the reason it's so hard is because they learn refinement while they're also learning formal methods, so are faced with an unfamiliar topic in an unfamiliar context. If that's the case, then maybe it's easier introducing refinement in a more common context like databases.

I've written a bit about refinement in the normal context here (showing one specification is an implementation of another). I kinda want to work this explanation into the book but it might be too late for big content additions like this.

(Food for thought: how do refinement mappings relate to database views?)

My Gripes with Prolog

For the next release of Logic for Programmers, I'm finally adding the sections on Answer Set Programming and Constraint Logic Programming that I TODOd back in version 0.9. And this is making me re-experience some of my pain points with Prolog, which I will gripe about now. If you want to know more about why Prolog is cool instead, go here or here or here or here.

No standardized strings

ISO "strings" are just atoms or lists of single-character atoms (or lists of integer character codes). The various implementations of Prolog add custom string operators but they are not cross compatible, so code written with strings in SWI-Prolog will not work in Scryer Prolog.

No functions

Code logic is expressed entirely in rules, predicates which return true or false for certain values. For example if you wanted to get the length of a Prolog list, you write this:

?- length([a, b, c], Len).

   Len = 3.

Now this is pretty cool in that it allows bidirectionality, or running predicates "in reverse". To generate lists of length 3, you can write length(L, 3). But it also means that if you want to get the length a list plus one, you can't do that in one expression, you have to write length(List, Out), X is Out+1.

For a while I thought no functions was necessary evil for bidirectionality, but then I discovered Picat has functions and works just fine. That by itself is a reason for me to prefer Picat for my LP needs.

(Bidirectionality is a killer feature of Prolog, so it's a shame I so rarely run into situations that use it.)

No standardized collection types besides lists

Aside from atoms (abc) and numbers, there are two data types:

  • Linked lists like [a,b,c,d].
  • Compound terms like dog(rex, poodle), which seem like record types but are actually tuples. You can even convert compound terms to linked lists with =..:
?- L =.. [a, b, c].
   L = a(b, c).
?- a(b, c(c)) =.. L.
   L = [a, b, c(c)].

There's no proper key-value maps or even struct types. Again, this is something that individual distributions can fix (without cross compatibility), but these never feel integrated with the rest of the language.

No boolean values

true and false aren't values, they're control flow statements. true is a noop and false says that the current search path is a dead end, so backtrack and start again. You can't explicitly store true and false as values, you have to implicitly have them in facts (passed(test) instead of test.passed? == true).

This hasn't made any tasks impossible, and I can usually find a workaround to whatever I want to do. But I do think it makes things more inconvenient! Sometimes I want to do something dumb like "get all atoms that don't pass at least three of these rules", and that'd be a lot easier if I could shove intermediate results into a sack of booleans.

(This is called "Negation as Failure". I think this might be necessary to make Prolog a Turing complete general programming language. Picat fixes a lot of Prolog's gripes and still has negation as failure. ASP has regular negation but it's not Turing complete.)

Cuts are confusing

Prolog finds solutions through depth first search, and a "cut" (!) symbol prevents backtracking past a certain point. This is necessary for optimization but can lead to invalid programs.

You're not supposed to use cuts if you can avoid it, so I pretended cuts didn't exist. Which is why I was surprised to find that conditionals are implemented with cuts. Because cuts are spooky dark magic conditionals sometimes conditionals work as I expect them to and sometimes leave out valid solutions and I have no idea how to tell which it'll be. Usually I find it safer to just avoid conditionals entirely, which means my code gets a lot longer and messier.

Non-cuts are confusing

The original example in the last section was this:

foo(A, B) :-
    \+ (A = B),
    A = 1,
    B = 2.

foo(1, 2) returns true, so you'd expect f(A, B) to return A=1, B=2. But it returns false. Whereas this works as expected.

bar(A, B) :-
    A = 1,
    B = 2,
    \+ (A = B).

I thought this was because \+ was implemented with cuts, and the Clocksin book suggests it's call(P), !, fail, so this was my prime example about how cuts are confusing. But then I tried this:

?- member(A, [1,2,3]), \+ (A = 2), A = 3.
A = 3. % wtf?

There's no way to get that behavior with cuts! I don't think \+ uses cuts at all! And now I have to figure out why foo(A, B) doesn't returns results. Is it floundering? Is it because \+ P only succeeds if P fails, and A = B always succeeds? A closed-world assumption? Something else?1

Straying outside of default queries is confusing

Say I have a program like this:

tree(n, n1).
tree(n, n2).
tree(n1, n11).
tree(n2, n21).
tree(n2, n22).
tree(n11, n111).
tree(n11, n112).

branch(N) :- % two children
    tree(N, C1),
    tree(N, C2),
    C1 @< C2. % ordering

And I want to know all of the nodes that are parents of branches. The normal way to do this is with a query:

?- tree(A, N), branch(N).
A = n, N = n2; % show more...
A = n1, N = n11.

This is interactively making me query for every result. That's usually not what I want, I know the result of my query is finite and I want all of the results at once, so I can count or farble or whatever them. It took a while to figure out that the proper solution is bagof(Template, Goal, Bag), which will "Unify Bag with the alternatives of Template":

?- bagof(A, (tree(A, N), branch(N)), As).

As = [n1], N = n11;
As = [n], N = n2.

Wait crap that's still giving one result at a time, because N is a free variable in bagof so it backtracks over that. It surprises me but I guess it's good to have as an option. So how do I get all of the results at once?

?- bagof(A, N^(tree(A, N), branch(N)), As).

As = [n, n1]

The only difference is the N^Goal, which tells bagof to ignore and group the results of N. As far as I can tell, this is the only place the ISO standard uses ^ to mean anything besides exponentiation. Supposedly it's the existential quantifier? In general whenever I try to stray outside simpler use-cases, especially if I try to do things non-interactively, I run into trouble.

I have mixed feelings about symbol terms

It took me a long time to realize the reason bagof "works" is because infix symbols are mapped to prefix compound terms, so that a^b is ^(a, b), and then different predicates can decide to do different things with ^(a, b).

This is also why you can't just write A = B+1: that unifies A with the compound term +(B, 1). A+1 = B+2 is false, as 1 \= 2. You have to write A+1 is B+2, as is is the operator that converts +(B, 1) to a mathematical term.

(And that fails because is isn't fully bidirectional. The lhs must be a single variable. You have to import clpfd and write A + 1 #= B + 2.)

I don't like this, but I'm a hypocrite for saying that because I appreciate the idea and don't mind custom symbols in other languages. I guess what annoys me is there's no official definition of what ^(a, b) is, it's purely a convention. ISO Prolog uses -(a, b) (aka a-b) as a convention to mean "pairs", and the only way to realize that is to see that an awful lot of standard modules use that convention. But you can use -(a, b) to mean something else in your own code and nothing will warn you of the inconsistency.

Anyway I griped about pairs so I can gripe about sort.

go home sort, ur drunk

This one's just a blunder:

?- sort([3,1,2,1,3], Out).
   Out = [1, 2, 3]. % wat

According to an expert online this is because sort is supposed to return a sorted set, not a sorted list. If you want to preserve duplicates you're supposed to lift all of the values into -($key, $value) compound terms, then use keysort, then extract the values. And, since there's no functions, this process takes at least three lines. This is also how you're supposed to sort by a custom predicate, like "the second value of a compound term".

(Most (but not all) distributions have a duplicate merge like msort. SWI-Prolog also has a sort by key but it removes duplicates.)

Please just let me end rules with a trailing comma instead of a period, I'm begging you

I don't care if it makes fact parsing ambiguous, I just don't want "reorder two lines" to be a syntax error anymore


I expect by this time tomorrow I'll have been Cunningham'd and there will be a 2000 word essay about how all of my gripes are either easily fixable by doing XYZ or how they are the best possible choice that Prolog could have made. I mean, even in writing this I found out some fixes to problems I had. Like I was going to gripe about how I can't run SWI-Prolog queries from the command line but, in doing do diligence finally finally figured it out:

swipl -t halt -g "bagof(X, Goal, Xs), print(Xs)" ./file.pl

It's pretty clunky but still better than the old process of having to enter an interactive session every time I wanted to validate a script change.

(Also, answer set programming is pretty darn cool. Excited to write about it in the book!)


  1. A couple of people mentioned using dif/2 instead of \+ A = B. Dif is great but usually I hit the negation footgun with things like \+ foo(A, B), bar(B, C), baz(A, C), where dif/2 isn't applicable. 

The Liskov Substitution Principle does more than you think

Happy New Year! I'm done with the newsletter hiatus and am going to try updating weekly again. To ease into things a bit, I'll try to keep posts a little more off the cuff and casual for a while, at least until Logic for Programmers is done. Speaking of which, v0.13 should be out by the end of this month.

So for this newsletter I want to talk about the Liskov Substitution Principle (LSP). Last week I read A SOLID Load of Bull by cryptographer Loupe Vaillant, where he argues the SOLID principles of OOP are not worth following. He makes an exception for LSP, but also claims that it's "just subtyping" and further:

If I were trying really hard to be negative about the Liskov substitution principle, I would stress that it only applies when inheritance is involved, and inheritance is strongly discouraged anyway.

LSP is more interesting than that! In the original paper, A Behavioral Notion of Subtyping, Barbara Liskov and Jeannette Wing start by defining a "correct" subtyping as follows:

Subtype Requirement: Let ϕ(x) be a property provable about objects x of type T. Then ϕ(y) should be true for objects y of type S where S is a subtype of T.

From then on, the paper determine what guarantees that a subtype is correct.1 They identify three conditions:

  1. Each of the subtype's methods has the same or weaker preconditions and the same or stronger postconditions as the corresponding supertype method.2
  2. The subtype satisfies all state invariants of the supertype.
  3. The subtype satisfies all "history properties" of the supertype. 3 e.g. if a supertype has an immutable field, the subtype cannot make it mutable.

(Later, Elisa Baniassad and Alexander Summers would realize these are equivalent to "the subtype passes all black-box tests designed for the supertype", which I wrote a little bit more about here.)

I want to focus on the first rule about preconditions and postconditions. This refers to the method's contract. For a function f, f.Pre is what must be true going into the function, and f.Post is what the function guarantees on execution. A canonical example is square root:

sqrt.Pre(x) = x >= 0
sqrt.Post(x, out) = out >= 0 && out*out == x

Mathematically we would write this as all x: f.Pre(x) => f.Post(x) (where => is the implication operator). If that relation holds for all x, we say the function is "correct". With this definition we can actually formally deduce the first subtyping requirement. Let caller be some code that uses a method, which we will call super, and let both caller and super be correct. Then we know the following statements are true:

  1. caller.Pre && stuff => super.Pre
  2. super.Pre => super.Post
  3. super.Post && more_stuff => caller.Post

Now let's say we substitute super with sub, which is also correct. Here is what we now know is true:

  1. caller.Pre => super.Pre
- 2. super.Pre => super.Post
+ 2. sub.Pre => sub.Post
  3. super.Post => caller.Post

When is caller still correct? When we can fill in the "gaps" in the chain, aka if super.Pre => sub.Pre and sub.Post => super.Post. In other words, if sub's preconditions are weaker than (or equivalent to) super's preconditions and if sub's postconditions are stronger than (or equivalent to) super's postconditions.

Notice that I never actually said sub was from a subtype of super! The LSP conditions (at least, the contract rule of LSP) doesn't just apply to subtypes but can be applied in any situation where we substitute a function or block of code for another. Subtyping is a common place where this happens, but by no means the only! We can also substitute across time.Any time we modify some code's behavior, we are effectively substituting the new version in for the old version, and so the new version's contract must be compatible with the old version's to guarantee no existing code is broken.

For example, say we maintain an API or function with two required inputs, X and Y, and one optional input, Z. Making Z required strengthens the precondition ("input must have Z" is stronger than "input may have Z"), so potentially breaks existing users of our API. Making Y optional weakens the precondition ("input may have Y" is weaker than "input must have Y"), so is guaranteed to be compatible.

(This also underpins The robustness principle: "be conservative in what you send, be liberal in what you accept".)

Now the dark side of all this is Hyrum's Law. In the below code, are new's postconditions stronger than old's postconditions?

def old():
    return {"a": "foo", "b": "bar"}

def new():
    return {"a": "foo", "b": "bar", "c": "baz"}

On a first appearance, this is a strengthened postcondition: out.contains_keys([a, b, c]) => out.contains_keys([a, b]). But now someone does this:

my_dict = {"c": "blat"} 
my_dict |= new()
assert my_dict[c] == "blat"

Oh no, their code now breaks! They saw old had the postcondition "out does NOT contain "c" as a key", and then wrote their code expecting that postcondition. In a sense, any change the postcondition can potentially break someone. "All observable behaviors of your system will be depended on by somebody", as Hyrum's Law puts it.

So we need to be explicit in what our postconditions actually are, and properties of the output that are not part of our explicit postconditions are subject to be violated on the next version. You'll break people's workflows but you also have grounds to say "I warned you".

Overall, Liskov and Wing did their work in the context of subtyping, but the principles are more widely applicable, certainly to more than just the use of inheritance.


  1. Though they restrict it to just safety properties

  2. The paper lists a couple of other authors as introduce the idea of "contra/covariance rules", but part of being "off-the-cuff and casual" means not diving into every referenced paper. So they might have gotten the pre/postconditions thing from an earlier author, dunno for sure! 

  3. I believe that this is equivalent to the formal methods notion of a refinement

Some Fun Software Facts

Last newsletter of the year!

First some news on Logic for Programmers. Thanks to everyone who donated to the feedchicago charity drive! In total we raised $2250 for Chicago food banks. Proof here.

If you missed buying Logic for Programmers real cheap in the charity drive, you can still get it for $10 off with the holiday code hannukah-presents. This will last from now until the end of the year. After that, I'll be raising the price from $25 to $30.

Anyway, to make this more than just some record keeping, let's close out with something light. I'm one of those people who loves hearing "fun facts" about stuff. So here's some random fun facts I accumulated about software over the years:

  • Computer systems have to deal with leap seconds in order to keep UTC (where one day is 86,400 seconds) in sync with UT1 (where one day is exactly one full earth rotation). The people in charge recently passed a resolution to abolish the leap second by 2035, letting UTC and UT1 slowly drift out of sync.
  • The backslash character basically didn't exist in writing before 1930, and was only added to ASCII so mathematicians (and ALGOLists) could write /\ and \/. It's popular use in computing stems entirely from being a useless key on the keyboard.
  • Galactic Algorithms are algorithms that are theoretically faster than algorithms we use, but only at scales that make them impractical. For example, matrix multiplication of NxN is normally O(N^2.81). The Coppersmith Winograd algorithm is O(N^2.38), but is so complex that it's vastly slower for even 10,000 x 10,000 matrices. It's still interesting in advancing our mathematical understanding of algorithms!
  • Mergesort is older than bubblesort. Quicksort is slightly younger than bubblesort but older than the term "bubblesort". Bubblesort, btw, does have some uses.
  • Speaking of mergesort, most implementations of mergesort pre-2006 were broken. Basically the problem was that the "find the midpoint of a list" step could overflow if the list was big enough. For C with 32-bit signed integers, "big enough" meant over a billion elements, which was why the bug went unnoticed for so long.
  • People make fun of how you have to flip USBs three times to get them into a computer, but there's supposed to be a guide: according to the standard, USBs are supposed to be inserted logo-side up. Of course, this assumes that the port is right-side up, too, which is why USB-C is just symmetric.
  • I was gonna write a fun fact about how all spreadsheet software treats 1900 as a leap year, as that was a bug in Lotus 1-2-3 and everybody preserved backwards compatibility. But I checked and Google sheets considers it a normal year. So I guess the fun fact is that things have changed!
  • Speaking of spreadsheet errors, in 2020 biologists changed the official nomenclature of 27 genes because Excel kept parsing their names as dates. F.ex MARCH1 was renamed to MARCHF1 to avoid being parsed as "March 1st". Microsoft rolled out a fix for this... three years later.
  • It is possible to encode any valid JavaScript program with just the characters ()+[]!. This encoding is called JSFuck and was once used to distribute malware on Ebay.

Happy holidays everyone, and see you in 2026!


  1. Current status update: I'm finally getting line by line structural editing done and it's turning up lots of improvements, so I'm doing more rewrites than I expected to be doing. 

Web Perf Hero: Thiemo Kreuz

Today we recognise Thiemo’s broad impact in improving performance of Wikimedia software. From optimizing code across the MediaWiki stack as felt on Wikipedia.org, to speeding up CI for faster developer feedback; this work benefits us every day!

Thiemo Kreuz works in the Technical Wishes team at Wikimedia Deutschland. He did most of this performance work as a paid software developer. “We are free to spend a portion of our time on side projects like these”, Thiemo wrote to us.

Performance as part of a routine

The tools on performance.wikimedia.org are part of building a culture of performance. These tools help you understand how code performs in production and on real devices. These tools empower developers to maintain performance through regular assessment and incremental improvement. Perf matters, because improving performance is an essential step toward equity of access!

We celebrate Thiemo’s tireless efforts with a story about performance as part of a routine, rather than one specific change. We’ll look at a few examples, but there are many other interesting Git commits if you’re curious for more.

Wikitext editor

The CodeMirror extension for MediaWiki provides syntax highlighting, for example, when editing template pages.

“I found a nasty performance issue in CodeMirror’s syntax highlighter for wikitext that was sitting there for a really, really long time”, Thiemo wrote about T270317 and T270237, which would cause your browser to freeze on long articles. “But nobody could figure out why. Answer: Bad regexes with missing boundary assertions.”

VisualEditor template editor

With the WMDE Technical Wishes team, Thiemo worked on VisualEditor’s template dialog and dramatically improved its performance. “This is mostly about lazy-loading parts of the UI”, Thiemo wrote. This matters because the community maintains templates that sometimes define several hundred parameters.

Faster stylesheet compilation

ResourceLoader is the MediaWiki delivery system for frontend styles, scripts, and localisation. It uses the Less.php library for stylesheet compilation. Thiemo heavily optimized the stylesheet parser through native function calls, inlining, and other techniques. This resulted in a 15% reduction in this change, 8% in this change, 5% in another change, and several more changes after that.

The motivation for this work was faster feedback from CI. While we compile only a handful of Less stylesheets during a page view, we have several hundred Less stylesheet files in our codebase. Our CI automatically checks all frontend assets for compilation errors, without needing dedicated unit tests. This speed-up brought us one step closer to realising the 5-minute pipeline.

Codesniffer rules

MediaWiki has extensive static analysis rules that automate and codify things we learned over two decades. Many such rules are implemented using PHP_CodeSniffer and run both locally and in CI via the composer test command. New rules are developed all the time and discussed in Phabricator. These new rules come at a cost.

“I keep coming back to our MediaWiki ruleset for PHPCS to check if it still runs as fast as it used to”, Thiemo wrote. “I find this particularly interesting because it requires a very specific ‘unfair’ type of optimization: We don’t care how slow the unhappy path is when it finds errors, because that’s the exceptional case that typically never happens. But we care a lot about the happy path, because that gets executed over and over again with every CI run.”

Example changes: 3X faster MultipleEmptyLines rule, 10X faster EmptyTag documentation rule.

Back to basics

Thiemo likes improving low-level libraries and frameworks, such as wikimedia/services and OOUI. “The idea is that even the tiniest of optimizations can make a notable difference, because a piece of library code is executed so often”, Thiemo wrote.

Web Perf Hero award

The Web Perf Hero award is given to individuals who have gone above and beyond to improve the web performance of Wikimedia projects. The initiative started in 2020 and takes the form of a Phabricator badge. You can find past recipients at the Web Perf Hero award page on Wikitech.

Unifying our mobile and desktop domains

How we achieved 20% faster mobile response times, improved SEO, and reduced infrastructure load.

Until now, when you visited a wiki (like en.wikipedia.org), the server responded in one of two ways: a desktop page, or a redirect to the equivalent mobile URL (like en.m.wikipedia.org). This mobile URL in turn served the mobile version of the page from MediaWiki. Our servers have operated this way since 2011, when we deployed MobileFrontend.

Before: Wikimedia CDN responds with a redirect from en.wikipedia.org to en.m.wikipedia.org for requests from mobile clients, and en.m.wikipedia.org then responds with the mobile HTML. After: Wikimedia CDN responds directly with the mobile HTML.
Diagram of technical change.

Over the past two months we unified the mobile and desktop domain for all wikis (timeline). This means we no longer redirect mobile users to a separate domain while the page is loading.

We completed the change on Wednesday 8 October after deploying to English Wikipedia. The mobile domains became dormant within 24 hours, which confirms that most mobile traffic arrived on Wikipedia via the standard domains and thus experienced a redirect until now.[1][2]

Why?

Why did we have a separate mobile domain? And, why did we believe that changing this might benefit us?

The year is 2008 and all sorts of websites large and small have a mobile subdomain. The BBC, IMDb, Facebook, and newspapers around the world featured the iconic m-dot domain. For Wikipedia, a separate mobile domain made the mobile experiment low-risk to launch and avoided technical limitations. It became the default in 2011 by way of a redirect.

Fast-forward seventeen years, and much has changed. It is no longer common for websites to have m-dot domains. Wikipedia’s use of it is surprising to our present day audience, and it may decrease the perceived strength of domain branding. The technical limitations we had in 2008 have long been solved, with the Wikimedia CDN having efficient and well-tested support for variable responses under a single URL. And above all, we had reason to believe Google stopped supporting separate mobile domains, which motivated the project to start when it did.

You can find a detailed history and engineering analysis in the Mobile domain sunsetting RFC along with weekly updates on mediawiki.org.

Site speed

Google used to link from mobile search results directly to our mobile domain, but last year this stopped. This exposed a huge part of our audience to the mobile redirect and regressed mobile response times by 10-20%.[2]

Google supported mobile domains in 2008 by letting you advertise a separate mobile URL. While Google only indexed the desktop site for content, they stored this mobile URL and linked to it when searching from a mobile device.[3] This allowed Google referrals to skip over the redirect.

Google introduced a new crawler in 2016, and gradually re-indexed the Internet with it.[4-7] This new “mobile-first” crawler acts like a mobile device rather than a desktop device, and removes the ability to advertise a separate mobile or desktop link. It’s now one link for everyone! Wikipedia.org was among the last sites Google switched, with May 2024 as the apparent change window.[2] This meant the 60% of incoming pageviews referred by Google, now had to wait for the same redirect that the other 40% of referrals have experienced since 2011.[8]

Persian Wikipedia saw a quarter second cut in the “responseStart” metric from 1.0s to 0.75s.

Unifying our domains eliminated the redirect and led to a 20% improvement in mobile response times.[2] This improvement is both a recovery and a net-improvement because it applies to everyone! It recovers the regression that Google-referred traffic started to experience last year, but also improves response times for all other traffic by the same amount.

The graphs below show how the change was felt worldwide. The “Worldwide p50” corresponds to what you might experience in Germany or Italy, with fast connectivity close to our data centers. The “Worldwide p80” resembles what you might experience in Iran browsing the Persian Wikipedia.

Wordwide p80 regressed 11% from 0.63s to 0.70s, then reduced 18% from 0.73s to 0.60s. Wordwide p75 regressed 13% to 0.61s, then reduced 19% to 0.52s. Wordwide p50 regressed 22% to 0.33s, then reduced 21% to 0.27s. Full table in the linked comment on Phabricator.
Check Perf report to explore the underlying data and for other regions.

SEO

The first site affected was not Wikipedia but Commons. Wikimedia Commons is the free media repository used by Wikipedia and its sister projects. Tim Starling found in June that only half of the 140 million pages on Commons were known to Google.[9] And of these known pages, 20 million were also delisted due to the mobile redirect. This had been growing by one million delisted pages every month.[10] The cause for delisting turned out to be the mobile redirect. You see, the new Google crawler, just like your browser, also has to follow the mobile redirect.

After following the redirect, the crawler reads our page metadata which points back to the standard domain as the preferred one. This creates a loop that can prevent a page from being updated or listed in Google Search. Delisting is not a matter of ranking, but about whether a page is even in the search index.

Tim and myself disabled the mobile redirect for “Googlebot on Commons” through an emergency intervention on June 23rd. Referrals then began to come back, and kept rising for eleven weeks in a row, until reaching a 100% increase in Google-referrals. From a baseline of 3 million weekly pageviews up to 6 million. Google’s data on clickthroughs shows a similar increase from 1M to 1.8M “clicks”.[9]

Pageviews to Wikimedia Commons having type equal to user (meaning not a known bot or spider), and referrer equal to Google. After July 2025, it increases from 3 million to 6 million per week.
Google-referred pageviews in 2025.
Stable 1.0 million clicks per week in June and early July, then increase to 1.8 million clicks per week in mid-July and stayed there.
Weekly clicks (according to Google Search Console).

We reversed last year’s regression and set a new all-time high. We think there’s three reasons Commons reached new highs:

  1. The redirect consumed half of the crawl budget, thus limiting how many pages could be crawled.[10][11]
  2. Google switched Commons to its new crawler some years before Wikipedia.[12] The index had likely been shrinking for two years already.
  3. Pages on Commons have a sparse link graph. Wikipedia has a rich network of links between articles, whereas pages on Commons represent a photo with an image description that rarely links to other files. This unique page structure makes it hard to discover Commons pages through recursive crawling without a sitemap.

Unifying our domains lifted a ceiling we didn’t know was there!

The MediaWiki software has a built-in sitemap generator, but we disabled this on Wikimedia sites over a decade ago.[13] We decided to enable it for Commons and submitted it to Google on August 6th.[14][15] Google has since indexed 70 million new pages for Commons, up 140% since June.[9]

We also found that less than 0.1% of videos on Commons were recognised by Google as video watch pages (for the Google Search “Videos” tab). I raised this in a partnership meeting with Google Search, and it may’ve been a bug on their end. Commons started showing up in Google Videos a week later.[16][17]

Link sharing UX

When sharing links from a mobile device, such link previously hardcoded the mobile domain. Links shared from a mobile device gave you the mobile site, even when received on desktop. The “Desktop” link in the footer of the mobile site pointed to the standard domain and disabled the standard-to-mobile redirect for you, on the assumption you arrived on the mobile site via the redirect. The “Desktop” link did not remember your choice on the mobile domain itself, and there existed no equivalent mobile-to-standard redirect for when you arrive there. This meant a shared mobile link always presented the mobile site, even after opting-out on desktop.

Everyone now shares the same domain which naturally shows the appropiate version.

There is a long tail of stable referrals from news articles, research papers, blogs, talk pages, and mailing lists that refer to the mobile domain. We plan to support this indefinitely. To limit operational complexity, we now serve these through a simple whole-domain redirect. This has the benefit of retroactively fixing the UX issue because old mobile links now redirect to the standard domain.[18]

This resolves a long-standing bug with workarounds in the form of shared user scripts,[19] browser extensions,[20] and personal scripts.[24]

Infrastructure load

After publishing an edit, MediaWiki instructs the Wikimedia CDN to clear the cache of affected articles (“purge”). It has been a perennial concern from SRE teams at WMF that our CDN purge rates are unsustainable. For every purge from MediaWiki core, the MobileFrontend extension would add a copy for the mobile domain.

Daily purge workload.

After unifying our domains we turned off these duplicate purges, and cut the MediaWiki purge rate by 50%. Over the past weeks the Wikimedia CDN processed approximately 4 billion fewer purges a day. MediaWiki used to send purges at a baseline rate of 40K/second with spikes up to 300K/second, and both have been halved. Factoring in other services, the Wikimedia CDN now receives 20% to 40% fewer purges per second overall, depending on the edit activity.[18]

Footnotes

  1. T403510: Main rollout, Wikimedia Phabricator.
  2. T405429: Detailed traffic stats and performance reports, Wikimedia Phabricator.
  3. Running desktop and mobile versions of your site (2009), developers.google.com.
  4. Mobile-first indexing (2016), developers.google.com.
  5. Google makes mobile-first indexing default for new domains (2019), TechCrunch.
  6. Mobile-first indexing has landed (2023), developers.google.com.
  7. Mobile indexing vLast final final (Jun 2024), developers.google.com.
  8. Mobile domain sunsetting RFC § Footnote: Wikimedia pageviews (Feb 2025), mediawiki.org.
  9. T400022: Commons SEO review, Wikimedia Phabricator.
  10. T54647: Image pages not indexed by Google, Wikimedia Phabricator.
  11. Crawl Budget Management For Large Sites, developers.google.com.
  12. I don’t have a guestimate for when Google switched Commons to its new crawler. I pinpointed May 2024 as the switch date for Wikipedia based on the new redirect impacting page load times (i.e. a non-zero fetch delay). For Commons, this fetch delay was already non-zero since at least 2018. This suggests Google’s old crawler linked mobile users to Commons canonical domain, unlike Wikipedia which it linked to the mobile domain until last year. Raw perf data: P73601.
  13. History of sitemaps at Wikimedia by Tim Starling, wikitech.wikimedia.org.
  14. T396684: Develop Sitemap API for MediaWiki
  15. T400023: Deploy Sitemap API for Commons
  16. T396168: Video pages not indexed by Google, Wikimedia Phabricator.
  17. Google Videos Search results for commons.wikimedia.org.
  18. T405931: Clean up and redirect, Wikimedia Phabricator.
  19. Wikipedia:User scripts/List on en.wikipedia.org. Featuring NeverUseMobileVersion, AutoMobileRedirect, and unmobilePlus.
  20. Redirector (10,000 users), Chrome Web Store.
  21. How can I force my desktop browser to never use mobile Wikipedia (2018), StackOverflow.
  22. Skip Mobile Wikipedia (726 users), Firefox Add-ons.
  23. Search for “mobile wikipedia”, Firefox Add-ons.
  24. Mobile domain sunsetting 2025 Announcement § Personal script workarounds (Sep 2025), mediawiki.org.

About this post

Featured image by PierreSelim, CC BY 3.0, via Wikimedia Commons.

The stack circuitry of the Intel 8087 floating point chip, reverse-engineered

Early microprocessors were very slow when operating with floating-point numbers. But in 1980, Intel introduced the 8087 floating-point coprocessor, performing floating-point operations up to 100 times faster. This was a huge benefit for IBM PC applications such as AutoCAD, spreadsheets, and flight simulators. The 8087 was so effective that today's computers still use a floating-point system based on the 8087.1

The 8087 was an extremely complex chip for its time, containing somewhere between 40,000 and 75,000 transistors, depending on the source.2 To explore how the 8087 works, I opened up a chip and took numerous photos of the silicon die with a microscope. Around the edges of the die, you can see the hair-thin bond wires that connect the chip to its 40 external pins. The complex patterns on the die are formed by its metal wiring, as well as the polysilicon and silicon underneath. The bottom half of the chip is the "datapath", the circuitry that performs calculations on 80-bit floating point values. At the left of the datapath, a constant ROM holds important constants such as π. At the right are the eight registers that form the stack, along with the stack control circuitry.

Die of the Intel 8087 floating point unit chip, with main functional blocks labeled. The die is 5mm×6mm.  Click for a larger image.

Die of the Intel 8087 floating point unit chip, with main functional blocks labeled. The die is 5mm×6mm. Click for a larger image.

The chip's instructions are defined by the large microcode ROM in the middle. This ROM is very unusual; it is semi-analog, storing two bits per transistor by using four transistor sizes. To execute a floating-point instruction, the 8087 decodes the instruction and the microcode engine starts executing the appropriate micro-instructions from the microcode ROM. The decode circuitry to the right of the ROM generates the appropriate control signals from each micro-instruction. The bus registers and control circuitry handle interactions with the main 8086 processor and the rest of the system. Finally, the bias generator uses a charge pump to create a negative voltage to bias the chip's substrate, the underlying silicon.

The stack registers and control circuitry (in red above) are the subject of this blog post. Unlike most processors, the 8087 organizes its registers in a stack, with instructions operating on the top of the stack. For instance, the square root instruction replaces the value on the top of the stack with its square root. You can also access a register relative to the top of the stack, for instance, adding the top value to the value two positions down from the top. The stack-based architecture was intended to improve the instruction set, simplify compiler design, and make function calls more efficient, although it didn't work as well as hoped.

The stack on the 8087. From The 8087 Primer, page 60.

The stack on the 8087. From The 8087 Primer, page 60.

The diagram above shows how the stack operates. The stack consists of eight registers, with the Stack Top (ST) indicating the current top of the stack. To push a floating-point value onto the stack, the Stack Top is decremented and then the value is stored in the new top register. A pop is performed by copying the value from the stack top and then incrementing the Stack Top. In comparison, most processors specify registers directly, so register 2 is always the same register.

The registers

The stack registers occupy a substantial area on the die of the 8087 because floating-point numbers take many bits. A floating-point number consists of a fractional part (sometimes called the mantissa or significand), along with the exponent part; the exponent allows floating-point numbers to cover a range from extremely small to extremely large. In the 8087, floating-point numbers are 80 bits: 64 bits of significand, 15 bits of exponent, and a sign bit. An 80-bit register was very large in the era of 8-bit or 16-bit computers; the eight registers in the 8087 would be equivalent to 40 registers in the 8086 processor.

The registers in the 8087 form an 8×80 grid of cells. The close-up shows an 8×8 block. I removed the metal layer with acid to reveal the underlying silicon circuitry.

The registers in the 8087 form an 8×80 grid of cells. The close-up shows an 8×8 block. I removed the metal layer with acid to reveal the underlying silicon circuitry.

The registers store each bit in a static RAM cell. Each cell has two inverters connected in a loop. This circuit forms a stable feedback loop, with one inverter on and one inverter off. Depending on which inverter is on, the circuit stores a 0 or a 1. To write a new value into the circuit, one of the lines is pulled low, flipping the loop into the desired state. The trick is that each inverter uses a very weak transistor to pull the output high, so its output is easily overpowered to change the state.

Two inverters in a loop can store a 0 or a 1.

Two inverters in a loop can store a 0 or a 1.

These inverter pairs are arranged in an 8 × 80 grid that implements eight words of 80 bits. Each of the 80 rows has two bitlines that provide access to a bit. The bitlines provide both read and write access to a bit; the pair of bitlines allows either inverter to be pulled low to store the desired bit value. Eight vertical wordlines enable access to one word, one column of 80 bits. Each wordline turns on 160 pass transistors, connecting the bitlines to the inverters in the selected column. Thus, when a wordline is enabled, the bitlines can be used to read or write that word.

Although the chip looks two-dimensional, it actually consists of multiple layers. The bottom layer is silicon. The pinkish regions below are where the silicon has been "doped" to change its electrical properties, making it an active part of the circuit. The doped silicon forms a grid of horizontal and vertical wiring, with larger doped regions in the middle. On top of the silicon, polysilicon wiring provides two functions. First, it provides a layer of wiring to connect the circuit. But more importantly, when polysilicon crosses doped silicon, it forms a transistor. The polysilicon provides the gate, turning the transistor on and off. In this photo, the polysilicon is barely visible, so I've highlighted part of it in red. Finally, horizontal metal wires provide a third layer of interconnecting wiring. Normally, the metal hides the underlying circuitry, so I removed the metal with acid for this photo. I've drawn blue lines to represent the metal layer. Contacts provide connections between the various layers.

A close-up of a storage cell in the registers. The metal layer and most of the polysilicon have been removed to show the underlying silicon.

A close-up of a storage cell in the registers. The metal layer and most of the polysilicon have been removed to show the underlying silicon.

The layers combine to form the inverters and selection transistors of a memory cell, indicated with the dotted line below. There are six transistors (yellow), where polysilicon crosses doped silicon. Each inverter has a transistor that pulls the output low and a weak transistor to pull the output high. When the word line (vertical polysilicon) is active, it connects the selected inverters to the bit lines (horizontal metal) through the two selection transistors. This allows the bit to be read or written.

The function of the circuitry in a storage cell.

The function of the circuitry in a storage cell.

Each register has two tag bits associated with it, an unusual form of metadata to indicate if the register is empty, contains zero, contains a valid value, or contains a special value such as infinity. The tag bits are used to optimize performance internally and are mostly irrelevant to the programmer. As well as being accessed with a register, the tag bits can be accessed in parallel as a 16-bit "Tag Word". This allows the tags to be saved or loaded as part of the 8087's state, for instance, during interrupt handling.

The decoder

The decoder circuit, wedged into the middle of the register file, selects one of the registers. A register is specified internally with a 3-bit value. The decoder circuit energizes one of the eight register select lines based on this value.

The decoder circuitry is straightforward: it has eight 3-input NOR gates to match one of the eight bit patterns. The select line is then powered through a high-current driver that uses large transistors. (In the photo below, you can compare the large serpentine driver transistors to the small transistors in a bit cell.)

The decoder circuitry has eight similar blocks to drive the eight select lines.

The decoder circuitry has eight similar blocks to drive the eight select lines.

The decoder has an interesting electrical optimization. As shown earlier, the register select lines are eight polysilicon lines running vertically, the length of the register file. Unfortunately, polysilicon has fairly high resistance, better than silicon but much worse than metal. The problem is that the resistance of a long polysilicon line will slow down the system. That is, the capacitance of transistor gates in combination with high resistance causes an RC (resistive-capacitive) delay in the signal.

The solution is that the register select lines also run in the metal layer, a second set of lines immediately to the right of the register file. These lines branch off from the register file about 1/3 of the way down, run to the bottom, and then connect back to the polysilicon select lines at the bottom. This reduces the maximum resistance through a select line, increasing the speed.

A diagram showing how 8 metal lines run parallel to the main select lines. The register file is much taller than shown; the middle has been removed to make the diagram fit.

A diagram showing how 8 metal lines run parallel to the main select lines. The register file is much taller than shown; the middle has been removed to make the diagram fit.

The stack control circuitry

A stack needs more control circuitry than a regular register file, since the circuitry must keep track of the position of the top of the stack.3 The control circuitry increments and decrements the top of stack (TOS) pointer as values are pushed or popped (purple).4 Moreover, an 8087 instruction can access a register based on its offset, for instance the third register from the top. To support this, the control circuitry can temporarily add an offset to the top of stack position (green). A multiplexer (red) selects either the top of stack or the adder output, and feeds it to the decoder (blue), which selects one of the eight stack registers in the register file (yellow), as described earlier.

The register stack in the 8087. Adapted from Patent USRE33629E. I don't know what the GRX field is. I also don't know why this shows a subtractor and not an adder.

The register stack in the 8087. Adapted from Patent USRE33629E. I don't know what the GRX field is. I also don't know why this shows a subtractor and not an adder.

The physical implementation of the stack circuitry is shown below. The logic at the top selects the stack operation based on the 16-bit micro-instruction.5 Below that are the three latches that hold the top of stack value. (The large white squares look important, but they are simply "jumpers" from the ground line to the circuitry, passing under metal wires.)

The stack control circuitry. The blue regions on the right are oxide residue that remained when I dissolved the metal rail for the 5V power.

The stack control circuitry. The blue regions on the right are oxide residue that remained when I dissolved the metal rail for the 5V power.

The three-bit adder is at the bottom, along with the multiplexer. You might expect the adder to use a simple "full adder" circuit. Instead, it is a faster carry-lookahead adder. I won't go into details here, but the summary is that at each bit position, an AND gate produces a Carry Generate signal while an XOR gate produces a Carry Propagate signal. Logic gates combine these signals to produce the output bits in parallel, avoiding the slowdown of the carry rippling through the bits.

The incrementer/decrementer uses a completely different approach. Each of the three bits uses a toggle flip-flop. A few logic gates determine if each bit should be toggled or should keep its previous value. For instance, when incrementing, the top bit is toggled if the lower bits are 11 (e.g. incrementing from 011 to 100). For decrementing, the top bit is toggled if the lower bits are 00 (e.g. 100 to 011). Simpler logic determines if the middle bit should be toggled. The bottom bit is easier, toggling every time whether incrementing or decrementing.

The schematic below shows the circuitry for one bit of the stack. Each bit is implemented with a moderately complicated flip-flop that can be cleared, loaded with a value, or toggled, based on control signals from the microcode. The flip-flop is constructed from two set-reset (SR) latches. Note that the flip-flop outputs are crossed when fed back to the input, providing the inversion for the toggle action. At the right, the multiplexer selects either the register value or the sum from the adder (not shown), generating the signals to the decoder.

Schematic of one bit of the stack.

Schematic of one bit of the stack.

Drawbacks of the stack approach

According to the designers of the 8087,7 the main motivation for using a stack rather than a flat register set was that instructions didn't have enough bits to address multiple register operands. In addition, a stack has "advantages over general registers for expression parsing and nested function calls." That is, a stack works well for a mathematical expression since sub-expressions can be evaluated on the top of the stack. And for function calls, you avoid the cost of saving registers to memory, since the subroutine can use the stack without disturbing the values underneath. At least that was the idea.

The main problem is "stack overflow". The 8087's stack has eight entries, so if you push a ninth value onto the stack, the stack will overflow. Specifically, the top-of-stack pointer will wrap around, obliterating the bottom value on the stack. The 8087 is designed to detect a stack overflow using the register tags: pushing a value to a non-empty register triggers an invalid operation exception.6

The designers expected that stack overflow would be rare and could be handled by the operating system (or library code). After detecting a stack overflow, the software should dump the existing stack to memory to provide the illusion of an infinite stack. Unfortunately, bad design decisions made it difficult "both technically and commercially" to handle stack overflow.

One of the 8087's designers (Kahan) attributes the 8087's stack problems to the time difference between California, where the designers lived, and Israel, where the 8087 was implemented. Due to a lack of communication, each team thought the other was implementing the overflow software. It wasn't until the 8087 was in production that they realized that "it might not be possible to handle 8087 stack underflow/overflow in a reasonable way. It's not impossible, just impossible to do it in a reasonable way."

As a result, the stack was largely a problem rather than a solution. Most 8087 software saved the full stack to memory before performing a function call, creating more memory traffic. Moreover, compilers turned out to work better with regular registers than a stack, so compiler writers awkwardly used the stack to emulate regular registers. The GCC compiler reportedly needs 3000 lines of extra code to support the x87 stack.

In the 1990s, Intel introduced a new floating-point system called SSE, followed by AVX in 2011. These systems use regular (non-stack) registers and provide parallel operations for higher performance, making the 8087's stack instructions largely obsolete.

The success of the 8087

At the start, Intel was unenthusiastic about producing the 8087, viewing it as unlikely to be a success. John Palmar, a principal architect of the chip, had little success convincing skeptical Intel management that the market for the 8087 was enormous. Eventually, he said, "I'll tell you what. I'll relinquish my salary, provided you'll write down your number of how many you expect to sell, then give me a dollar for every one you sell beyond that."7 Intel didn't agree to the deal—which would have made a fortune for Palmer—but they reluctantly agreed to produce the chip.

Intel's Santa Clara engineers shunned the 8087, considering it unlikely to work: the 8087 would be two to three times more complex than the 8086, with a die so large that a wafer might not have a single working die. Instead, Rafi Nave, at Intel's Israel site, took on the risky project: “Listen, everybody knows it's not going to work, so if it won't work, I would just fulfill their expectations or their assessment. If, by chance, it works, okay, then we'll gain tremendous respect and tremendous breakthrough on our abilities.”

A small team of seven engineers developed the 8087 in Israel. They designed the chip on Mylar sheets: a millimeter on Mylar represented a micron on the physical chip. The drawings were then digitized on a Calma system by clicking on each polygon to create the layout. When the chip was moved into production, the yield was very low but better than feared: two working dies per four-inch wafer.

The 8087 ended up being a large success, said to have been Intel's most profitable product line at times. The success of the 8087 (along with the 8088) cemented the reputation of Intel Israel, which eventually became Israel's largest tech employer. The benefits of floating-point hardware proved to be so great that Intel integrated the floating-point unit into later processors starting with the 80486 (1989). Nowadays, most modern computers, from cellphones to mainframes, provide floating point based on the 8087, so I consider the 8087 one of the most influential chips ever created.

For more, follow me on Bluesky (@righto.com), Mastodon (@kenshirriff@oldbytes.space), or RSS. I wrote some articles about the 8087 a few years ago, including the die, the ROM, the bit shifter, and the constants, so you may have seen some of this material before.

Notes and references

  1. Most computers now use the IEEE 754 floating-point standard, which is based on the 8087. This standard has been awarded a milestone in computation. 

  2. Curiously, reliable sources differ on the number of transistors in the 8087 by almost a factor of 2. Intel says 40,000, as does designer William Kahan (link). But in A Numeric Data Processor, designers Rafi Nave and John Palmer wrote that the chip contains "the equivalent of over 65,000 devices" (whatever "equivalent" means). This number is echoed by a contemporary article in Electronics (1980) that says "over 65,000 H-MOS transistors on a 78,000-mil2 die." Many other sources, such as Upgrading & Repairing PCs, specify 45,000 transistors. Designer Rafi Nave stated that the 8087 has 63,000 or 64,000 transistors if you count the ROM transistors directly, but if you count ROM transistors as equivalent to two transistors, then you get about 75,000 transistors. 

  3. The 8087 has a 16-bit Status Word that contains the stack top pointer, exception flags, the four-bit condition code, and other values. Although the Status Word appears to be a 16-bit register, it is not implemented as a register. Instead, parts of the Status Word are stored in various places around the chip: the stack top pointer is in the stack circuitry, the exception flags are part of the interrupt circuitry, the condition code bits are next to the datapath, and so on. When the Status Word is read or written, these various circuits are connected to the 8087's internal data bus, making the Status Word appear to be a monolithic entity. Thus, the stack circuitry includes support for reading and writing it. 

  4. Intel filed several patents on the 8087, including Numeric data processor, another Numeric data processor, Programmable bidirectional shifter, Fraction bus for use in a numeric data processor, and System bus arbitration, circuitry and methodology

  5. I started looking at the stack in detail to reverse engineer the micro-instruction format and determine how the 8087's microcode works. I'm working with the "Opcode Collective" on Discord on this project, but progress is slow due to the complexity of the micro-instructions. 

  6. The 8087 detects stack underflow in a similar manner. If you pop more values from the stack than are present, the tag will indicate that the register is empty and shouldn't be accessed. This triggers an invalid operation exception. 

  7. The 8087 is described in detail in The 8086 Family User's Manual, Numerics Supplement. An overview of the stack is on page 60 of The 8087 Primer by Palmer and Morse. More details are in Kahan's On the Advantages of the 8087's Stack, an unpublished course note (maybe for CS 279?) with a date of Nov 2, 1990 or perhaps August 23, 1994. Kahan discusses why the 8087's design makes it hard to handle stack overflow in How important is numerical accuracy, Dr. Dobbs, Nov. 1997. Another information source is the Oral History of Rafi Nave 

RuBee

I have at least a few readers for which the sound of a man's voice saying "government cell phone detected" will elicit a palpable reaction. In Department of Energy facilities across the country, incidences of employees accidentally carrying phones into secure areas are reduced through a sort of automated nagging. A device at the door monitors for the presence of a tag; when the tag is detected it plays an audio clip. Because this is the government, the device in question is highly specialized, fantastically expensive, and says "government cell phone" even though most of the phones in question are personal devices. Look, they already did the recording, they're not changing it now!

One of the things that I love is weird little wireless networks. Long ago I wrote about ANT+, for example, a failed personal area network standard designed mostly around fitness applications. There's tons of these, and they have a lot of similarities---so it's fun to think about the protocols that went down a completely different path. It's even better, of course, if the protocol is obscure outside of an important niche. And a terrible website, too? What more could I ask for.

The DoE's cell-phone nagging boxes, and an array of related but more critical applications, rely on an unusual personal area networking protocol called RuBee.

RuBee is a product of Visible Assets Inc., or VAI, founded in 2004 1 by John K. Stevens. Stevens seems a somewhat improbable founder, with a background in biophysics and eye health, but he's a repeat entrepreneur. He's particularly fond of companies called Visible: he founded Visible Assets after his successful tenure as CEO of Visible Genetics. Visible Genetics was an early innovator in DNA sequencing, and still provides a specialty laboratory service that sequences samples of HIV in order to detect vulnerabilities to antiretroviral medications.

Clinical trials in the early 2000s exposed Visible Genetics to one of the more frustrating parts of health care logistics: refrigeration. Samples being shipped to the lab and reagents shipped out to clinics were both temperature sensitive. Providers had to verify that these materials had stayed adequately cold throughout shipping and handling, otherwise laboratory results could be invalid or incorrect. Stevens became interested in technical solutions to these problems; he wanted some way to verify that samples were at acceptable temperatures both in storage and in transit.

Moreover, Stevens imagined that these sensors would be in continuous communication. There's a lot of overlap between this application and personal area networks (PANs), protocols like Bluetooth that provide low-power communications over short ranges. There is also clear overlap with RFID; you can buy RFID temperature sensors. VAI, though, coined the term visibility network to describe RuBee. That's visibility as in asset visibility: somewhat different from Bluetooth or RFID, RuBee as a protocol is explicitly designed for situations where you need to "keep tabs" on a number of different objects. Despite the overlap with other types of wireless communications, the set of requirements on a visibility network have lead RuBee down a very different technical path.

Visibility networks have to be highly reliable. When you are trying to keep track of an asset, a failure to communicate with it represents a fundamental failure of the system. For visibility networks, the ability to actually convey a payload is secondary: the main function is just reliably detecting that endpoints exist. Visibility networks have this in common with RFID, and indeed, despite its similarities to technologies like BLE RuBee is positioned mostly as a competitor to technologies like UHF RFID.

There are several differences between RuBee and RFID; for example, RuBee uses active (battery-powered) tags and the tags are generally powered by a complete 4-bit microcontroller. That doesn't necessarily sound like an advantage, though. While RuBee tags advertise a battery life of "5-25 years", the need for a battery seems mostly like a liability. The real feature is what active tags enable: RuBee operates in the low frequency (LF) band, typically at 131 kHz.

At that low frequency, the wavelength is very long, about 2.5 km. With such a long wavelength, RuBee communications all happen at much less than one wavelength in range. RF engineers refer to this as near-field operation, and it has some properties that are intriguingly different from more typical far-field RF communications. In the near-field, the magnetic field created by the antenna is more significant than the electrical field. RuBee devices are intentionally designed to emit very little electrical RF signal. Communications within a RuBee network are achieved through magnetic, not electrical fields. That's the core of RuBee's magic.

The idea of magnetic coupling is not unique to RuBee. Speaking of the near-field, there's an obvious comparison to NFC which works much the same way. The main difference, besides the very different logical protocols, is that NFC operates at 13.56 MHz. At this higher frequency, the wavelength is only around 20 meters. The requirement that near-field devices be much closer than a full wavelength leads naturally to NFC's very short range, typically specified as 4 cm.

At LF frequencies, RuBee can achieve magnetic coupling at ranges up to about 30 meters. That's a range comparable to, and often much better than, RFID inventory tracking technologies. Improved range isn't RuBee's only benefit over RFID. The properties of magnetic fields also make it a more robust protocol. RuBee promises significantly less vulnerability to shielding by metal or water than RFID.

There are two key scenarios where this comes up: the first is equipment stored in metal containers or on metal shelves, or equipment that is itself metallic. In that scenario, it's difficult to find a location for an RFID tag that won't suffer from shielding by the container. The case of water might seem less important, but keep in mind that people are made mostly of water. RFID reading is often unreliable for objects carried on a person, which are likely to be shielded from the reader by the water content of the body.

These problems are not just theoretical. WalMart is a major adopter of RFID inventory technology, and in early rollouts struggled with low successful read rates. Metal, moisture (including damp cardboard boxes), antenna orientation, and multipath/interference effects could cause read failure rates as high as 33% when scanning a pallet of goods. Low read rates are mostly addressed by using RFID "portals" with multiple antennas. Eight antennas used as an array greatly increase read rate, but at a cost of over ten thousand dollars per portal system. Even so, WalMart seems to now target a success rate of only 95% during bulk scanning.

95% might sound pretty good, but there are a lot of visibility applications where a failure rate of even a couple percent is unacceptable. These mostly go by the euphemism "high value goods," which depending on your career trajectory you may have encountered in corporate expense and property policies. High-value goods tend to be items that are both attractive to theft and where theft has particularly severe consequences. Classically, firearms and explosives. Throw in classified material for good measure.

I wonder if Stevens was surprised by RuBee's market trajectory. He came out of the healthcare industry and, it seems, originally developed RuBee for cold chain visibility... but, at least in retrospect, it's quite obvious that its most compelling application is in the armory.

Because RuBee tags are small and largely immune to shielding by metals, you can embed them directly in the frames of firearms, or as an aftermarket modification you can mill out some space under the grip. RuBee tags in weapons will read reliably when they are stored in metal cases or on metal shelving, as is often the case. They will even read reliably when a weapon is carried holstered, close to a person's body.

Since RuBee tags incorporate an active microcontroller, there are even more possibilities. Temperature logging is one thing, but firearm-embedded RuBee tags can incorporate an accelerometer (NIST-traceable, VAI likes to emphasize) and actually count the rounds fired.


Sidebar time: there is a long history of political hazard around "smart guns." The term "smart gun" is mostly used more specifically for firearms that identify their user, for example by fingerprint authentication or detection of an RFID fob. The idea has become vague enough, though, that mention of a firearm with any type of RFID technology embedded would probably raise the specter of the smart gun to gun-rights advocates.

Further, devices embedded in firearms that count the number of rounds fired have been proposed for decades, if not a century, as a means of accountability. The holder of a weapon could, in theory, be required to positively account for every round fired. That could eliminate incidents of unreported use of force by police, for example. In practice I think this is less compelling than it sounds, simple counting of rounds leaves too many opportunities to fudge the numbers and conceal real-world use of a weapon as range training, for example.

That said, the NRA has long been vehemently opposed to the incorporation of any sort of technology into weapons that could potentially be used as a means of state control or regulation. The concern isn't completely unfounded; the state of New Jersey did, for a time, have legislation that would have made user-identifying "smart guns" mandatory if they were commercially available. The result of the NRA's strident lobbying is that no such gun has ever become commercially available; "smart guns" have been such a political third rail that any firearms manufacturer that dared to introduce one would probably face a boycott by most gun stores. For better or worse, a result of the NRA's powerful political advocacy in this area is that the concept of embedding security or accountability technology into weapons has never been seriously pursued in the US. Even a tentative step in that direction can produce a huge volume of critical press for everyone involved.

I bring this up because I think it explains some of why VAI seems a bit vague and cagey about the round-counting capabilities of their tags. They position it as purely a maintenance feature, allowing the armorer to keep accurate tabs on the preventative maintenance schedule for each individual weapon (in armory environments, firearm users are often expected to report how many rounds they fired for maintenance tracking reasons). The resistance of RuBee tags to concealment is only positioned as a deterrent to theft, although the idea of RuBee-tagged firearms creates obvious potential for security screening. Probably the most profitable option for VAI would be to promote RuBee-tagged firearms as tool for enforcement of gun control laws, but this is a political impossibility and bringing it up at all could cause significant reputational harm, especially with the government as a key customer. The result is marketing copy that is a bit odd, giving a set of capabilities that imply an application that is never mentioned.


VAI found an incredible niche with their arms-tracking application. Institutional users of firearms, like the military, police, and security forces, are relatively price-insensitive and may have strict accounting requirements. By the mid-'00s, VAI was into the long sales cycle of proposing the technology to the military. That wasn't entirely unsuccessful. RuBee shot-counting weapon inventory tags were selected by the Naval Surface Warfare Center in 2010 for installation on SCAR and M4 rifles. That contract had a five-year term, it's unclear to me if it was renewed. Military contracting opened quite a few doors to VAI, though, and created a commercial opportunity that they eagerly pursued.

Perhaps most importantly, weapons applications required an impressive round of safety and compatibility testing. RuBee tags have the fairly unique distinction of military approval for direct attachment to ordnance, something called "zero separation distance" as the tags do not require a minimum separation from high explosives. Central to that certification are findings of intrinsic safety of the tags (that they do not contain enough energy to trigger explosives) and that the magnetic fields involved cannot convey enough energy to heat anything to dangerous temperatures.

That's not the only special certification that RuBee would acquire. The military has a lot of firearms, but military procurement is infamously slow and mercurial. Improved weapon accountability is, almost notoriously, not a priority for the US military which has often had stolen weapons go undetected until their later use in crime. The Navy's interest in RuBee does not seem to have translated to more widespread military applications.

Then you have police departments, probably the largest institutional owners of firearms and a very lucrative market for technology vendors. But here we run into the political hazard: the firearms lobby is very influential on police departments, as are police unions which generally oppose technical accountability measures. Besides, most police departments are fairly cash-poor and are not likely to make a major investment in a firearms inventory system.

That leaves us with institutional security forces. And there is one category of security force that are particularly well-funded, well-equipped, and beholden to highly R&D-driven, almost pedantic standards of performance: the protection forces of atomic energy facilities.

Protection forces at privately-operated atomic energy facilities, such as civilian nuclear power plants, are subject to licensing and scrutiny by the Nuclear Regulatory Commission. Things step up further at the many facilities operated by the National Nuclear Security Administration (NNSA). Protection forces for NNSA facilities are trained at the Department of Energy's National Training Center, at the former Manzano Base here in Albuquerque. Concern over adequate physical protection of NNSA facilities has lead Sandia National Laboratories to become one of the premier centers for R&D in physical security. Teams of scientists and engineers have applied sometimes comical scientific rigor to "guns, gates, and guards," the traditional articulation of physical security in the nuclear world.

That scope includes the evaluation of new technology for the management of protection forces, which is why Oak Ridge National Laboratory launched an evaluation program for the RuBee tagging of firearms in their armory. The white paper on this evaluation is curiously undated, but citations "retrieved 2008" lead me to assume that the evaluation happened right around the middle of the '00s. At the time, VAI seems to have been involved in some ultimately unsuccessful partnership with Oracle, leading to the branding of the RuBee system as Oracle Dot-Tag Server. The term "Dot-Tag" never occurs outside of very limited materials around the Oracle partnership, so I'm not sure if it was Oracle branding for RuBee or just some passing lark. In any case, Oracle's involvement seems to have mainly just been the use of the Oracle database for tracking inventory data---which was naturally replaced by PostgreSQL at Oak Ridge.

The Oak Ridge trial apparently went well enough, and around the same time, the Pantex Plant in Texas launched an evaluation of RuBee for tracking classified tools. Classified tools are a tricky category, as they're often metallic and often stored in metallic cases. During the trial period, Pantex tagged a set of sample classified tools with RuBee tags and then transported them around the property, testing the ability of the RuBee controllers to reliably detect them entering and exiting areas of buildings. Simultaneously, Pantex evaluated the use of RuBee tags to track containers of "chemical products" through the manufacturing lifecycle. Both seem to have produced positive results.

There are quite a few interesting and strange aspects of the RuBee system, a result of its purpose-built Visibility Network nature. A RuBee controller can have multiple antennas that it cycles through. RuBee tags remain in a deep-sleep mode for power savings until they detect a RuBee carrier during their periodic wake cycle. When a carrier is detected, they fully wake and listen for traffic. A RuBee controller can send an interrogate message and any number of tags can respond, with an interesting and novel collision detection algorithm used to ensure reliable reading of a large number of tags.

The actual RuBee protocol is quite simple, and can also be referred to as IEEE 1902.1 since the decision of VAI to put it through the standards process. Packets are small and contain basic addressing info, but they can also contain arbitrary payload in both directions, perfect for data loggers or sensors. RuBee tags are identified by something that VAI oddly refers to as an "IP address," causing some confusion over whether or not VAI uses IP over 1902.1. They don't, I am confident saying after reading a whole lot of documents. RuBee tags, as standard, have three different 4-byte addresses. VAI refers to these as "IP, subnet, and MAC," 2 but these names are more like analogies. Really, the "IP address" and "subnet" are both configurable arbitrary addresses, with the former intended for unicast traffic and the latter for broadcast. For example, you would likely give each asset a unique IP address, and use subnet addresses for categories or item types. The subnet address allows a controller to interrogate for every item within that category at once. The MAC address is a fixed, non-configurable address derived from the tag's serial number. They're all written in the formats we associate with IP networks, dotted-quad notation, as a matter of convenience.

And that's about it as far as the protocol specification, besides of course the physical details which are a 131,072 Hz carrier, 1024 Hz data clock, either ASK or BPSK modulation. The specification also describes an interesting mode called "clip," in which a set of multiple controllers interrogate in exact synchronization and all tags then reply in exact synchronization. Somewhat counter-intuitively, because of the ability of RuBee controllers to separate out multiple simultaneous tag transmissions using an anti-collision algorithm based on random phase shifts by each tag, this is ideal. It allows a room, say an armory, full of RuBee controllers to rapidly interrogate the entire contents of the room. I think this feature may have been added after the Oak Ridge trials...

RuBee is quite slow, typically 1,200 baud, so inventorying a large number of assets can take a while (Oak Ridge found that their system could only collect data on 2-7 tags per second per controller). But it's so robust that it an achieve a 100% read rate in some very challenging scenarios. Evaluation by the DoE and the military produced impressive results. You can read, for example, of a military experiment in which a RuBee antenna embedded in a roadway reliably identified rifles secured in steel containers in passing Humvees.

Paradoxically, then, one of the benefits of RuBee in the military/defense context is that it is also difficult to receive. Here is RuBee's most interesting trick: somewhat oversimplified, the strength of an electrical radio signal goes as 1/r, while the strength of a magnetic field goes as 1/r^3. RuBee equipment is optimized, by antenna design, to produce a minimal electrical field. The result is that RuBee tags can very reliably be contacted at short range (say, around ten feet), but are virtually impossible to contact or even detect at ranges over a few hundred feet. To the security-conscious buyer, this is a huge feature. RuBee tags are highly resistant to communications or electronic intelligence collection.

Consider the logical implications of tagging the military's rifles. With conventional RFID, range is limited by the size and sensitivity of the antenna. Particularly when tags are incidentally powered by a nearby reader, an adversary with good equipment can detect RFID tags at very long range. VAI heavily references a 2010 DEFCON presentation, for example, that demonstrated detection of RFID tags at a range of 80 miles. One imagines that opportunistic detection by satellite is feasible for a state intelligence agency. That means that your rifle asset tracking is also revealing the movements of soldiers in the field, or at least providing a way to detect their approach.

Most RuBee tags have their transmit power reduced by configuration, so even the maximum 100' range of the protocol is not achievable. VAI suggests that typical RuBee tags cannot be detected by radio direction finding equipment at ranges beyond 20', and that this range can be made shorter by further reducing transmit power.

Once again, we have caught the attention of the Department of Energy. Because of the short range of RuBee tags, they have generally been approved as not representing a COMSEC or TEMPEST hazard to secure facilities. And that brings us back to the very beginning: why does the DoE use a specialized, technically interesting, and largely unique radio protocol to fulfill such a basic function as nagging people that have their phones? Because RuBee's security properties have allowed it to be approved for use adjacent to and inside of secure facilities. A RuBee tag, it is thought, cannot be turned into a listening device because the intrinsic range limitation of magnetic coupling will make it impossible to communicate with the tag from outside of the building. It's a lot like how infrared microphones still see some use in secure facilities, but so much more interesting!

VAI has built several different product lines around RuBee, with names like Armory 20/20 and Shot Counting Allegro 20/20 and Store 20/20. The founder started his career in eye health, remember. None of them are that interesting, though. They're all pretty basic CRUD applications built around polling multiple RuBee controllers for tags in their presence.

And then there's the "Alert 20/20 DoorGuard:" a metal pedestal with a RuBee controller and audio announcement module, perfect for detecting government cell phones.


I put a lot of time into writing this, and I hope that you enjoy reading it. If you can spare a few dollars, consider supporting me on ko-fi. You'll receive an occasional extra, subscribers-only post, and defray the costs of providing artisanal, hand-built world wide web directly from Albuquerque, New Mexico.


One of the strangest things about RuBee is that it's hard to tell if it's still a going concern. VAI's website has a press release section, where nothing has been posted since 2019. The whole website feels like it was last revised even longer ago. When RuBee was newer, back in the '00s, a lot of industry journals covered it with headlines like "the new RFID." I think VAI was optimistic that RuBee could displace all kinds of asset tracking applications, but despite some special certifications in other fields (e.g. approval to use RuBee controllers and tags around pacemakers in surgical suites), I don't think RuBee has found much success outside of military applications.

RuBee's resistance to shielding is impressive, but RFID read rates have improved considerably with new DSP techniques, antenna array designs, and the generally reduced cost of modern RFID equipment. RuBee's unique advantages, its security properties and resistance to even intentional exfiltration, are interesting but not worth much money to buyers other than the military.

So that's the fate of RuBee and VAI: defense contracting. As far as I can tell, RuBee and VAI are about as vital as they have ever been, but RuBee is now installed as just one part of general defense contracts around weapons systems, armory management, and process safety and security. IEEE standardization has opened the door to use of RuBee by federal contractors under license, and indeed, Lockheed Martin is repeatedly named as a licensee, as are firearms manufacturers with military contracts like Sig Sauer.

Besides, RuBee continues to grow closer to the DoE. In 2021, VAI appointed Lisa Gordon-Hagerty to it board of directors. Gordon-Hagerty was undersecretary of Energy and had lead the NNSA until the year before. This year, the New Hampshire Small Business Development Center wrote a glowing profile of VAI. They described it as a 25-employee company with a goal of hitting $30 million in annual revenue in the next two years.

Despite the outdated website, VAI claims over 1,200 RuBee sites in service. I wonder how many of those are Alert 20/20 DoorGuards? Still, I do believe there are military weapons inventory systems currently in use. RuBee probably has a bright future, as a niche technology for a niche industry. If nothing else, they have legacy installations and intellectual property to lean on. A spreadsheet of VAI-owned patents on RuBee, with nearly 200 rows, encourages would-be magnetically coupled visibility network inventors not to go it on their own. I just wish I could get my hands on a controller....

  1. I have found some conflicting information on the date, it could have been as early as 2002. 2004 is the year I have the most confidence in.

  2. The documentation is confusing enough about these details that I am actually unclear on whether the RuBee "MAC address" is 4 bytes or 6. Examples show 6 byte addresses, but the actual 1902.1 specification only seems to allow 4 byte addresses in headers. Honestly all of the RuBee documentation is a mess like this. I suspect that part of the problem is that VAI has actually changed parts of the protocol and not all of their products are IEEE 1902.1 compliant.

Error'd: A Horse With No Name

Scared Stanley stammered "I'm afraid of how to explain to the tax authority that I received $NaN."

1

 

Our anonymous friend Anon E. Mous wrote "I went to look up some employee benefits stuff up and ... This isn't a good sign."

0

 

Regular Michael R. is not actually operating under an alias, but this (allegedly scamming?) site doesn't know.

2

 

Graham F. gloated "I'm glad my child 's school have followed our naming convention for their form groups as well!"

3

 

Adam R. is taking his anonymous children on a roadtrip to look for America. "I'm planning a trip to St. Louis. While trying to buy tickets for the Gateway Arch, I noticed that their ticketing website apparently doesn't know how to define adults or children (or any of the other categories of tickets, for that matter)."

4

 

[Advertisement] Plan Your .NET 9 Migration with Confidence
Your journey to .NET 9 is more than just one decision.Avoid migration migraines with the advice in this free guide. Download Free Guide Now!

CodeSOD: Pawn Pawn in in Game Game of of Life Life

It feels like ages ago, when document databases like Mongo were all the rage. That isn't to say that they haven't stuck around and don't deliver value, but gone is the faddish "RDBMSes are dead, bro." The "advantage" they offer is that they turn data management problems into serialization problems.

And that's where today's anonymous submission takes us. Our submitter has a long list of bugs around managing lists of usernames. These bugs largely exist because the contract developer who wrote the code didn't write anything, and instead "vibe coded too close to the sun", according to our submitter.

Here's the offending C# code:

   [JsonPropertyName("invitedTraders")]
   [BsonElement("invitedTraders")]
   [BsonIgnoreIfNull]
   public InvitedTradersV2? InvitedTraders { get; set; }

   [JsonPropertyName("invitedTradersV2")]
   [BsonElement("invitedTradersV2")]
   [BsonIgnoreIfNull]
   public List<string>? InvitedTradersV2 { get; set; }

Let's start with the type InvitedTradersV2. This type contains a list of strings which represent usernames. The field InvitedTradersV2 is a list of strings which represent usernames. Half of our submitter's bugs exist simply because these two lists get out of sync- they should contain the same data, but without someone enforcing that correctly, problems accrue.

This is made more frustrating by the MongoDB attribute, BsonIgnoreIfNull, which simply means that the serialized object won't contain the key if the value is null. But that means the consuming application doesn't know which key it should check.

For the final bonus fun, note the use of JsonPropertyName. This comes from the built-in class library, which tells .NET how to serialize the object to JSON. The problem here is that this application doesn't use the built-in serializer, and instead uses Newtonsoft.JSON, a popular third-party library for solving the problem. While Newtonsoft does recognize some built-in attributes for serialization, JsonPropertyName is not among them. This means that property does nothing in this example, aside from add some confusion to the code base.

I suspect the developer responsible, if they even read this code, decided that the duplicated data was okay, because isn't that just a normal consequence of denormalization? And document databases are all about denormalization. It makes your queries faster, bro. Just one more shard, bro.

[Advertisement] Keep all your packages and Docker containers in one place, scan for vulnerabilities, and control who can access different feeds. ProGet installs in minutes and has a powerful free version with a lot of great features that you can upgrade when ready.Learn more.

The Thanksgiving Shakedown

On Thanksgiving Day, Ellis had cuddled up with her sleeping cat on the couch to send holiday greetings to friends. There in her inbox, lurking between several well wishes, was an email from an unrecognized sender with the subject line, Final Account Statement. Upon opening it, she read the following:

1880s stock delivery form agreement

Dear Ellis,

Your final account statement dated -1 has been sent to you. Please log into your portal and review your balance due totaling #TOTAL_CHARGES#.

Payment must be received within 30 days of this notice to avoid collection. You may submit payment online via [Payment Portal Link] or by mail to:

Chamberlin Apartments
123 Main Street
Anytown US 12345

If you believe there is an error on your account, please contact us immediately at 212-555-1212.

Thank you for your prompt attention to this matter.

Chamberlin Apartments

Ellis had indeed rented an apartment managed by this company, but had moved out 16 years earlier. She'd never been late with a payment for anything in her life. What a time to receive such a thing, at the start of a long holiday weekend when no one would be able to do anything about it for the next 4 days!

She truly had so much to be grateful for that Thanksgiving, and here was yet more for her list: her broad technical knowledge, her experience working in multiple IT domains, and her many years of writing up just these sorts of stories for The Daily WTF. All of this added up to her laughing instead of panicking. She could just imagine the poor intern who'd hit "Send" by mistake. She also imagined she wasn't the only person who'd received this message. Rightfully scared and angry callers would soon be hammering that phone number, and Ellis was further grateful that she wasn't the one who had to pick up.

"I'll wait for the apology email!" she said out loud with a knowing smile on her face, closing out the browser tab.

Ellis moved on physically and mentally, going forward with her planned Thanksgiving festivities without giving it another thought. The next morning, she checked her inbox with curious anticipation. Had there been a retraction, a please disregard?

No. Instead, there were still more emails from the same sender. The second, sent 7 hours after the first, bore the subject line Second Notice - Outstanding Final Balance:

Dear Ellis,

Our records show that your final balance of #TOTAL_CHARGES# from your residency at your previous residence remains unpaid.

This is your second notice. Please remit payment in full or contact us to discuss the balance to prevent your account from being sent to collections.

Failure to resolve the balance within the next 15 days may result in your account being referred to a third-party collections agency, which could impact your credit rating.

To make payment or discuss your account, please contact us at 212-555-1212 or accounting@chamapts.com.

Sincerely,

Chamberlin Apartments

The third, sent 6 and a half hours later, threatened Final Notice - Account Will Be Sent to Collections.

Dear Ellis,

Despite previous notices, your final account balance remains unpaid.

This email serves as final notice before your account is forwarded to a third-party collections agency for recovery. Once transferred, we will no longer be able to accept payment directly or discuss the account.

To prevent this, payment of #TOTAL_CHARGES# must be paid in full by #CRITICALDATE#.

Please submit payment immediately. Please contact 212-555-1212 to confirm your payment.

Sincerely,

Chamberlin Apartments

It was almost certainly a mistake, but still rather spooky to someone who'd never been in such a situation. There was solace in the thought that, if they really did try to force Ellis to pay #TOTAL_CHARGES# on the basis of these messages, anyone would find it absurd that all 3 notices were sent mere hours apart, on a holiday no less. The first two had also mentioned 30 and 15 days to pay up, respectively.

Suddenly remembering that she probably wasn't the only recipient of these obvious form emails, Ellis thought to check her local subreddit. Sure enough, there was already a post revealing the range of panic and bewilderment they had wrought among hundreds, if not thousands. Current and more recent former tenants had actually seen #TOTAL_CHARGES# populated with the correct amount of monthly rent. People feared everything from phishing attempts to security breaches.

It wasn't until later that afternoon that Ellis finally received the anticipated mea culpa:

We are reaching out to sincerely apologize for the incorrect collection emails you received. These messages were sent in error due to a system malfunction that released draft messages to our entire database.

Please be assured of the following:
The recent emails do not reflect your actual account status.
If your account does have an outstanding balance, that status has not changed, and you would have already received direct and accurate communication from our office.
Please disregard all three messages sent in error. They do not require any action from you.

We understand that receiving these messages, especially over a holiday, was upsetting and confusing, and we are truly sorry for the stress this caused. The issue has now been fully resolved, and our team has worked with our software provider to stop all queued messages and ensure this does not happen again.

If you have any questions or concerns, please feel free to email leasing@chamapts.com. Thank you for your patience and understanding.

All's well that ends well. Ellis thanked the software provider's "system malfunction," whoever or whatever it may've been, that had granted the rest of us a bit of holiday magic to take forward for all time.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

CodeSOD: The Destination Dir

Darren is supporting a Delphi application in the current decade. Which is certainly a situation to be in. He writes:

I keep trying to get out of doing maintenance on legacy Delphi applications, but they keep pulling me back in.

The bit of code Darren sends us isn't the largest WTF, but it's a funny mistake, and it's a funny mistake that's been sitting in the codebase for decades at this point. And as we all know, jokes only get funnier with age.

FileName := DestDir + ExtractFileName(FileName);
if FileExists(DestDir + ExtractFileName(FileName)) then
begin
  ...
end;

This code is inside of a module that copies a file from a remote server to the local host. It starts by sanitizing the FileName, using ExtractFileName to strip off any path components, and replace them with DestDir, storing the result in the FileName variable.

And they liked doing that so much, they go ahead and do it again in the if statement, repeating the exact same process.

Darren writes:

As Homer Simpson said "Lather, rinse, and repeat. Always repeat."

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!

CodeSOD: Formula Length

Remy's Law of Requirements Gathering states "No matter what the requirements document says, what your users really wanted was Excel." This has a corrolary: "Any sufficiently advanced Excel file is indistingushable from software."

Given enough time, any Excel file whipped up by any user can transition from "useful" to "mission critical software" before anyone notices. That's why Nemecsek was tasked with taking a pile of Excel spreadsheets and converting them into "real" software, which could be maintained and supported by software engineers.

Nemecsek writes:

This is just one of the formulas they asked me to work on, and not the longest one.

Nemecsek says this is a "formula", but I suspect it's a VBA macro. In reality, it doesn't matter.

InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).
InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).Losses = 
calcLossesInPart(InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).RatedFrequency, InitechNeoDTMachineDevice.
InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).
InitechNeoDTActivePartPart(iPart).RadialPositionToMainDuct, InitechNeoDTMachineDevice.
InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).
InitechNeoDTActivePartPart(iPart).InitechNeoDTActivePartPartSectionContainer(0).
InitechNeoDTActivePartPartSection(0).InitechNeoDTActivePartPartConductorComposition(0).IsTransposed, 
InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).
InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).ParallelRadialCount, InitechNeoDTMachineDevice.
InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).
InitechNeoDTActivePartPart(iPart).InitechNeoDTActivePartPartSectionContainer(0).
InitechNeoDTActivePartPartSection(0).InitechNeoDTActivePartPartConductorComposition(0).
ParallelAxialCount, InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).InitechNeoDTActivePartPartConductor(0).Type, 
InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).
InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).InitechNeoDTActivePartPartConductor(0).
DimensionRadialElectric, InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).InitechNeoDTActivePartPartConductor(0).
DimensionAxialElectric + InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).InitechNeoDTActivePartPartConductor(0).InsulThickness, 
getElectricConductivityAtTemperatureT1(InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).InitechNeoDTActivePartPartConductor(0).
InitechNeoDTActivePartPartConductorRawMaterial(0).ElectricConductivityT0, InitechNeoDTMachineDevice.
InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).
InitechNeoDTActivePartPart(iPart).InitechNeoDTActivePartPartSectionContainer(0).
InitechNeoDTActivePartPartSection(0).InitechNeoDTActivePartPartConductorComposition(0).
InitechNeoDTActivePartPartConductor(0).InitechNeoDTActivePartPartConductorRawMaterial(0).MaterialFactor, 
InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).InitechNeoDTActivePart(0).
InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartSectionContainer(0).InitechNeoDTActivePartPartSection(0).
InitechNeoDTActivePartPartConductorComposition(0).InitechNeoDTActivePartPartConductor(0).
InitechNeoDTActivePartPartConductorRawMaterial(0).ReferenceTemperatureT0, InitechNeoDTMachineDevice.
ReferenceTemperature), LayerNumberRatedVoltage, InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).InitechNeoDTActivePartPartContainer(0).InitechNeoDTActivePartPart(iPart).
InitechNeoDTActivePartPartLayerContainer(0),InitechNeoDTMachineDevice.InitechNeoDTActivePartContainer(0).
InitechNeoDTActivePart(0).RFactor)

Line breaks added to try and keep horizontal scrolling sane. This arguably hurts readability, in the same way that beating a dead horse arguably hurts the horse.

This may not be the longest one, but it's certainly painful. I do not know exactly what this is doing, and frankly, I do not want to.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

Error'd: On the Dark Side

...matter of fact, it's all dark.

Gitter Hubber checks in on the holidays: "This is the spirit of the Black Friday on GitHub. That's because I'm using dark mode. Otherwise, it would have a different name… You know what? Let's just call it Error Friday!"

1

 

"Best get typing!" self-admonishes. Jason G. Suffering a surfeit of snark, he proposes "Not sure my battery will last long enough.
Finally, quantum resistant security.
I can't remember my number after the 5000th digit. " Any of those will do just fine.

2

 

Don't count Calle L. out. "This is for a calorie tracking app, on Thanksgiving. Offer was so delicious it wasn't even a number any more! Sadly it did not slim the price down more than expected."

0

 

"Snow and rain and rain and snow!" exclaims Paul N. "Weather so astounding, they just had to trigger three separate notifications at the same time."

3

 

It's not a holiday for everyone though, is it? Certainly not for Michael R. , who is back with a customer service complaint about custom deliveries. "I am unlucky with my deliveries. This time it's DPD. "

4

 

[Advertisement] Plan Your .NET 9 Migration with Confidence
Your journey to .NET 9 is more than just one decision.Avoid migration migraines with the advice in this free guide. Download Free Guide Now!

Classic WTF: Teleported Release

It's a holiday in the US today, one where we give thanks. And today, we give thanks to not have this boss. Original. --Remy

Matt works at an accounting firm, as a data engineer. He makes reports for people who don’t read said reports. Accounting firms specialize in different areas of accountancy, and Matt’s firm is a general firm with mid-size clients.

The CEO of the firm is a legacy from the last century. The most advanced technology on his desk is a business calculator and a pencil sharpener. He still doesn’t use a cellphone. But he does have a son, who is “tech savvy”, which gives the CEO a horrible idea of how things work.

Usually, this is pretty light, in that it’s sorting Excel files or sorting the output of an existing report. Sometimes the requests are bizarre or utter nonsense. And, because the boss doesn’t know what the technical folks are doing, some of the IT staff may be a bit lazy about following best practices.

This means that most of Matt’s morning is spent doing what is essentially Tier 1 support before he gets into doing his real job. Recently, there was a worse crunch, as actual support person Lucinda was out for materinity leave, and Jackie, the one other developer, was off on vacation on a foreign island with no Internet. Matt was in the middle of eating a delicious lunch of take-out lo mein when his phone rang. He sighed when he saw the number.

“Matt!” the CEO exclaimed. “Matt! We need to do a build of the flagship app! And a deploy!”

The app was rather large, and a build could take upwards of 45 minutes, depending on the day and how the IT gods were feeling. But the process was automated, the latest changes all got built and deployed each night. Anything approved was released within 24 hours. With everyone out of the office, there hadn’t been any approved changes for a few weeks.

Matt checked the Github to see if something went wrong with the automated build. Everything was fine.

“Okay, so I’m seeing that everything built on GitHub and everything is available in production,” Matt said.

“I want you to do a manual build, like you used to.”

“If I were to compile right now, it could take quite awhile, and redeploying runs the risk of taking our clients offline, and nothing would be any different.”

“Yes, but I want a build that has the changes which Jackie was working on before she left for vacation.”

Matt checked the commit history, and sure enough, Jackie hadn’t committed any changes since two weeks before leaving on vacation. “It doesn’t looked like she pushed those changes to Github.”

“Githoob? I thought everything was automated. You told me the process was automated,” the CEO said.

“It’s kind of like…” Matt paused to think of an analogy that could explain this to a golden retriever. “Your dishwasher, you could put a timer on it to run it every night, but if you don’t load the dishwasher first, nothing gets cleaned.”

There was a long pause as the CEO failed to understand this. “I want Jackie’s front-page changes to be in the demo I’m about to do. This is for Initech, and there’s millions of dollars riding on their account.”

“Well,” Matt said, “Jackie hasn’t pushed- hasn’t loaded her metaphorical dishes into the dishwasher, so I can’t really build them.”

“I don’t understand, it’s on her computer. I thought these computers were on the cloud. Why am I spending all this money on clouds?”

“If Jackie doesn’t put it on the cloud, it’s not there. It’s uh… like a fax machine, and she hasn’t sent us the fax.”

“Can’t you get it off her laptop?”

“I think she took it home with her,” Matt said.

“So?”

“Have you ever seen Star Trek? Unless Scotty can teleport us to Jackie’s laptop, we can’t get at her files.”

The CEO locked up on that metaphor. “Can’t you just hack into it? I thought the NSA could do that.”

“No-” Matt paused. Maybe Matt could try and recreate the changes quickly? “How long before this meeting?” he asked.

“Twenty minutes.”

“Just to be clear, you want me to do a local build with files I don’t have by hacking them from a computer which may or may not be on and connected to the Internet, and then complete a build process which usually takes 45 minutes- at least- deploy to production, so you can do a demo in twenty minutes?”

“Why is that so difficult?” the CEO demanded.

“I can call Jackie, and if she answers, maybe we can figure something out.”

The CEO sighed. “Fine.”

Matt called Jackie. She didn’t answer. Matt left a voicemail and then went back to eating his now-cold lo mein.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

Announcements: We Want Your Holiday Horrors

As we enter into the latter portion of the year, folks are traveling to visit family, logging off of work in hopes that everything can look after itself for a month, and somewhere, someone, is going to make the choice "yes, I can push to prod on Christmas Eve, and it'll totally work out for me!"

Over the next few weeks, I'm hoping to get a chance to get some holiday support horrors up on the site, in keeping with the season. Whether it's the absurd challenges of providing family tech support, the last minute pushes to production, the five alarm fires caused by a pointy-haired-bosses's incompetence, we want your tales of holiday IT woe.

So hit that submit button on the side bar, and tell us who's on Santa's naughty list this year.

[Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.

Tales from the Interview: Interview Smack-Talk

In today's Tales from the Interview, our Anonymous submitter relates their experience with an anonymous company:

I had made it through the onsite, but along the way I had picked up some toxic work environment red flags. Since I had been laid off a couple months prior, I figured I wasn't in a position to be picky, so I decided I would still give it my best shot and take the job if I got it, but I'd continue looking for something better.

Then they brought me back onsite a second time for one final interview with 2 senior managers. I went in and they were each holding a printout of my resume. They proceeded to go through everything on it. First they asked why I chose the university I went to, then the same for grad school, which was fine.

WWF SmackDown Logo (1999-2001)

Then they got to my first internship. I believe the conversation went something like this:

Manager: "How did you like it?"

Me: "Oh, I loved it!"

Manager: "Were there any negatives?"

Me: "No, not that I can think of."

Manager: "So it was 100% positive?"

Me: "Yep!"

And then they got to my first full-time job, where the same manager repeated the same line of questioning but pushed even harder for me to say something negative, at one point saying "Well, you left for (2nd company on my resume), so there must have been something negative."

I knew better than to bad-mouth a previous employer in an interview, it's like going into a first date and talking smack about your ex. But what do you do when your date relentlessly asks you to talk smack about all your exes and refuses to let the subject turn to anything else? This not only confirmed my suspicions of a toxic work environment, I also figured *they* probably knew it was toxic and were relentlessly testing every candidate to make sure they wouldn't blow the whistle on them.

That was the most excruciatingly awkward interview I've ever had. I didn't get the job, but at that point I didn't care anymore, because I was very, very sure I didn't want to work there in the long term.

I'm glad Subby dodged that bullet, and I hope they're in a better place now.

It seems like this might be some stupid new trend. I recently bombed an interview where I could tell I wasn't giving the person the answer on their checklist, no matter how many times I tried. It was a question about how I handled it when someone opposed what I was doing at work or gave me negative feedback. It felt like they wanted me to admit to more fur-flying drama and fireworks than had ever actually occurred.

I actively ask for and welcome critique on my writing, it makes my work so much better. And if my work is incorrect and needs to be redone, or someone has objections to a project I'm part of, I seek clarification and (A) implement the requested changes, (B) explain why things are as they are and offer alternate suggestions/solutions, (C) seek compromise, depending on the situation. I don't get personal about it.

So, why this trend? Subby believed it was a way to test whether the candidate would someday badmouth the employer. That's certainly feasible, though if that were the goal, you'd think Subby would've passed their ordeal with flying colors. I'm not sure myself, but I have a sneaking suspicion that the nefarious combination of AI and techbro startup culture have something to do with it.

So perhaps I also dodged a bullet: one of the many things I'm grateful for this Thanksgiving.

Feel free to share your ideas, and any and all bullets you have dodged, in the comments.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

CodeSOD: The Map to Your Confession

Today, Reginald approaches us for a confession.

He writes:

I've no idea where I "copied" this code from five years ago. The purpose of this code was to filter out Maps and Collections Maybe the intention was to avoid a recursive implementation by an endless loop? I am shocked that I wrote such code.

Well, that doesn't bode well, Reginald. Let's take a look at this Java snippet:

/**
 * 
 * @param input
 * @return
 */
protected Map rearrangeMap(Map input) {
	Map retMap = new HashMap();

	if (input != null && !input.isEmpty()) {

		Iterator it = input.keySet().iterator();
		while (true) {
			String key;
			Object obj;
			do {
				do {
					if (!it.hasNext()) {
					}
					key = (String) it.next();

				} while (input.get(key) instanceof Map);

				obj = input.get(key);

			} while (obj instanceof Boolean && ((Boolean) obj).equals(Boolean.FALSE));

			if (obj != null) {
				retMap.put(key, obj);
				return retMap;
			}
		}
	} else {
		return retMap;
	}
}

The first thing that leaps out is that this is a non-generic Map, which is always a code smell, but I suspect that's the least of our problems.

We start by verifying that the input Map exists and contains data. If the input is null or empty, we return it. In our main branch, we create an iterator across the keys, before ethering a while(true) loop. So far so bad

Then we enter a pair of nested do loops. Which definitely hints that we've gone off the edge of the map here. In the inner most loop, we do a check- if there isn't a next element in the iterator, we… do absolutely nothing. Whether there is or isn't an element, we advance to the next element, risking a NoSuchElementException. We do this while the key points to an instance of Map. As always, an instanceof check is a nauseating code stench.

Okay, so the inner loop skips across any keys that point to maps, and throws an exception when it gets to the end of the list.

The surrounding loop skips over every key that is a boolean value that is also false.

If we find anything which isn't a Map and isn't a false Boolean and isn't null, we put it in our retMap and return it.

This function finds the first key that points to a non-map, non-false value and creates a new map that contains only that key/value. Which it's a hard thing to understand why I'd want that, especially since some Map implementations make no guarantee about order. And even if I did want that, I definitely wouldn't want to do that this way. A single for loop could have solved this problem.

Reginald, I don't think there's any absolution for this. Instead, my advice would be to install a carbon monoxide detector in your office, because I have some serious concerns about whether or not your brain is getting enough oxygen.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

CodeSOD: Copied Homework

Part of the "fun" of JavaScript is dealing with code which comes from before sensible features existed. For example, if you wanted to clone an object in JavaScript, circa 2013, that was a wheel you needed to invent for yourself, as this StackOverflow thread highlights.

There are now better options, and you'd think that people would use them. However, the only thing more "fun" than dealing with code that hasn't caught up with the times is dealing with developers who haven't, and still insist on writing their own versions of standard methods.

  const objectReplace = (oldObject, newObject) => {
    let keys = Object.keys(newObject)
    try {
      for (let key of keys) {
        oldObject[key] = newObject[key]
      }
    } catch (err) {
      console.log(err, oldObject)
    }     

    return oldObject
  }

It's worth noting that Object.entries returns an array containing both the keys and values, which would be a more sensible for this operation, but then again, if we're talking about using correct functions, Object.assign would replace this function.

There's no need to handle errors here, as nothing about this assignment should throw an exception.

The thing that really irks me about this though is that it pretends to be functional (in the programming idiom sense) by returning the newly modified value, but it's also just changing that value in place because it's a reference. So it has side effects, in a technical sense (changing the value of its input parameters) while pretending not to. Now, I probably shouldn't get too hung up on that, because that's also exactly how Object.assign behaves, but dammit, I'm going to be bothered by it anyway. If you're going to reinvent the wheel, either make one that's substantially worse, or fix the problems with the existing wheel.

In any case, the real WTF here is that this function is buried deep in a 15,000 line file, written by an offshore contract team, and there are at least 5 other versions of this function, all with slightly different names, but all basically doing the same thing, because everyone on the team is just copy/pasting until they get enough code to submit a pull request.

Our submitter wonders, "Is there a way to train an AI to not let people type this?"

No, there isn't. You can try rolling that boulder up a hill, but it'll always roll right back down. Always and forever, people are going to write bad code.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

Error'd: Untimely

Sometimes, it's hard to know just when you are. This morning, I woke up to a Macbook that thinks it's in Paris, four hours ago. Pining for pain chocolate. A bevy of anonyms have had similar difficulties.

First up, an unarabian anonym observes "They say that visiting Oman feels like traveling back in time to before the rapid modernization of the Arab states. I just think their eVisa application system is taking this "time travel" thing a bit too far... "

0

 

Snecod, an unretired (anteretired?) anonym finds it hard to plan when the calendar is unfixed. "The company's retirement plan was having a rough time prior to Second June." Looks like the first wtf was second March.

2

 

And an unamerican anonym sent us this (uh, back in first March) "Was looking to change the cable package I have from them. Apparently my discounts are all good until 9th October 1930, and a second one looking good until 9th January 2024."

3

 

On a different theme, researcher Jennifer E. exclaimed "Those must have been BIG divorces! Guy was so baller Wikipedia couldn’t figure out when he divorced either of these women." Or so awful they divorced him continuously.

4

 

Finally, parsimonious Greg L. saved this for us. "I don't remember much about #Error!, but I guess it was an interesting day."

1

 

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

CodeSOD: Invalid Route and Invalid Route

Someone wanted to make sure that invalid routes logged an error in their Go web application. Artem found this when looking at production code.

if (requestUriPath != "/config:system") &&
    (requestUriPath != "/config:system/ntp") &&
    (requestUriPath != "/config:system/ntp/servers") &&
    (requestUriPath != "/config:system/ntp/servers/server") &&
    (requestUriPath != "/config:system/ntp/servers/server/config") &&
    (requestUriPath != "/config:system/ntp/servers/server/config/address") &&
    (requestUriPath != "/config:system/ntp/servers/server/config/key-id") &&
    (requestUriPath != "/config:system/ntp/servers/server/config/minpoll") &&
    (requestUriPath != "/config:system/ntp/servers/server/config/maxpoll") &&
    (requestUriPath != "/config:system/ntp/servers/server/config/version") &&
    (requestUriPath != "/config:system/ntp/servers/server/state") &&
    (requestUriPath != "/config:system/ntp/servers/server/state/address") &&
    (requestUriPath != "/config:system/ntp/servers/server/state/key-id") &&
    (requestUriPath != "/config:system/ntp/servers/server/state/minpoll") &&
    (requestUriPath != "/config:system/ntp/servers/server/state/maxpoll") &&
    (requestUriPath != "/config:system/ntp/servers/server/state/version") {
    log.Info("ProcessGetNtpServer: no return of ntp server state for ", requestUriPath)
    return nil
}

The most disturbing part of this, for Artem, isn't that someone wrote this code and pushed it to production. It's that, according to git blame, two people wrote this code, because the first developer didn't include all the cases.

For the record, the application does have an actual router module, which can trigger logging on invalid routes.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

CodeSOD: Are You Mocking Me?

Today's representative line comes from Capybara James (most recently previously). It's representative, not just of the code base, but of Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Or, "you get what you measure".

If, for example, you decide that code coverage metrics are how you're going to judge developers, then your developers are going to ensure that the code coverage looks great. If you measure code coverage, then you will get code coverage- and nothing else.

That's how you get tests like this:

Mockito.verify(exportRequest, VerificationModeFactory.atLeast(0)).failedRequest(any(), any(), any());

This test passes if the function exportRequest.failedRequest is called at least zero times, with any input parameters.

Which, as you might imagine, is a somewhat useless thing to test. But what's important is that there is a test. The standards for code coverage are met, the metric is satisfied, and Goodhart marks up another win on the board.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

Using an ADE: Ancient Development Environment

One of the things that makes legacy code legacy is that code, over time, rots. Some of that rot comes from the gradual accumulation of fixes, hacks, and kruft. But much of the rot also comes from the tooling going unsupported or entirely out of support.

For example, many years ago, I worked in a Visual Basic 6 shop. The VB6 IDE went out of support in April, 2008, but we continued to use it well into the next decade. This made it challenging to support the existing software, as the IDE frequently broke in response to OS updates. Even when we started running it inside of a VM running an antique version of Windows 2000, we kept running into endless issues getting projects to compile and build.

A fun side effect of that: the VB6 runtime remains supported. So you can run VB6 software on modern Windows. You just can't modify that software.

Greta has inherited an even more antique tech stack. She writes, "I often wonder if I'm the last person on Earth encumbered with this particular stack." She adds, "The IDE is long-deprecated from a vendor that no longer exists- since 2002." Given the project started in the mid 2010s, it may have been a bad choice to use that tech-stack.

It's not as bad as it sounds- while the technology and tooling is crumbling ruins, the team culture is healthy and the C-suite has given Greta wide leeway to solve problems. But that doesn't mean that the tooling isn't a cause of anguish, and even worse than the tooling- the code itself.

"Some things," Greta writes, "are 'typical bad'" and some things "are 'delightfully unique' bad."

For example, the IDE has a concept of "designer" files, for the UI, and "code behind" files, for the logic powering the UI. The IDE frequently corrupts its own internal state, and loses the ability to properly update the designer files. When this happens, if you attempt to open, save, or close a designer file, the IDE pops up a modal dialog box complaining about the corruption, with a "Yes" and "No" option. If you click "No", the modal box goes away- and then reappears because you're seeing this message because you're on a broken designer file. If you click "Yes", the IDE "helpfully" deletes pretty much everything in your designer file.

Nothing about the error message indicates that this might happen.

The language used is a dialect of C++. I say "dialect" because the vendor-supplied compiler implements some cursed feature set between C++98 and C++11 standards, but doesn't fully conform to either. It's only capable of outputting 32-bit x86 code up to a Pentium Pro. Using certain C++ classes, like std::fstream, causes the resulting executable to throw a memory protection fault on exit.

Worse, the vendor supplied class library is C++ wrappers on top of an even more antique Pascal library. The "class" library is less an object-oriented wrapper and more a collection of macros and weird syntax hacks. No source for the Pascal library exists, so forget about ever updating that.

Because the last release of the IDE was circa 2002, running it on any vaguely modern environment is prone to failures, but it also doesn't play nicely inside of a VM. At this point, the IDE works for one session. If you exit it, reboot your computer, or try to close and re-open the project, it breaks. The only fix is to reinstall it. But the reinstall requires you to know which set of magic options actually lets the install proceed. If you make a mistake and accidentally install, say, CORBA support, attempting to open the project in the IDE leads to a cascade of modal error boxes, including one that simply says, "ABSTRACT ERROR" ("My favourite", writes Greta). And these errors don't limit themselves to the IDE; attempting to run the compiler directly also fails.

But, if anything, it's the code that makes the whole thing really challenging to work with. While the UI is made up of many forms, the "main" form is 18,000 lines of code, with absolutely no separation of concerns. Actually, the individual forms don't have a lot of separation of concerns; data is shared between forms via global variables declared in one master file, and then externed into other places. Even better, the various sub-forms are never destroyed, just hidden and shown, which means they remember their state whether you want that or not. And since much of the state is global, you have to be cautious about which parts of the state you reset.

Greta adds:

There are two files called main.cpp, a Station.cpp, and a Station1.cpp. If you were to guess which one owns the software's entry point, you would probably be wrong.

But, as stated, it's not all as bad as it sounds. Greta writes: "I'm genuinely happy to be here, which is perhaps odd given how terrible the software is." It's honestly not that odd; a good culture can go a long way to making wrangling a difficult tech stack happy work.

Finally, Greta has this to say:

We are actively working on a .NET replacement. A nostalgic, perhaps masochistic part of me will miss the old stack and its daily delights.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

Why I (still) love Linux

I usually publish articles about how much I love the BSDs or illumos distributions, but today I want to talk about Linux (or, better, GNU/Linux) and why, despite everything, it still holds a place in my heart.

Meta Plans Deep Cuts to Metaverse Efforts

By: Nick Heer

Kurt Wagner, Bloomberg:

Meta Platforms Inc.’s Mark Zuckerberg is expected to meaningfully cut resources for building the so-called metaverse, an effort that he once framed as the future of the company and the reason for changing its name from Facebook Inc.

Executives are considering potential budget cuts as high as 30% for the metaverse group next year, which includes the virtual worlds product Meta Horizon Worlds and its Quest virtual reality unit, according to people familiar with the talks, who asked not to be named while discussing private company plans. Cuts that high would most likely include layoffs as early as January, according to the people, though a final decision has not yet been made.

Wagner’s reporting was independently confirmed by Mike Isaac, of the New York Times, and Meghan Bobrowsky and Georgia Wells, of the Wall Street Journal, albeit in slightly different ways. While Wagner wrote it “would most likely include layoffs as early as January”, Isaac apparently confirmed the budget cuts are likely large-scale personnel cuts, which makes sense:

The cuts could come as soon as next month and amount to 10 to 30 percent of employees in the Metaverse unit, which works on virtual reality headsets and a V.R.-based social network, the people said. The numbers of potential layoffs are still in flux, they said. Other parts of the Reality Labs division develop smart glasses, wristbands and other wearable devices. The total number of employees in Reality Labs could not be learned.

Alan Dye is just about to join Reality Labs. I wonder if this news comes as a fun surprise for him.

At Meta Connect a few months ago, the company spent basically the entire time on augmented reality glasses, but it swore up and down it was all related to its metaverse initiatives:

We’re hard at work advancing the state of the art in augmented and virtual reality, too, and where those technologies meet AI — that’s where you’ll find the metaverse.

The metaverse is whatever Meta needs it to be in order to justify its 2021 rebrand.

Our vision for the future is a world where anyone anywhere can imagine a character, a scene, or an entire world and create it from scratch. There’s still a lot of work to do, but we’re making progress. In fact, we’re not far off from being able to create compelling 3D content as easily as you can ask Meta AI a question today. And that stands to transform not just the imagery and videos we see on platforms like Instagram and Facebook, but also the possibilities of VR and AR, too.

You know, whenever I am unwinding and chatting with friends after a long day at work, I always get this sudden urge to create compelling 3D content.

⌥ Permalink

Lisa Jackson and Kate Adams Out at Apple, Jennifer Newstead to Join

By: Nick Heer

Apple:

Apple today announced that Jennifer Newstead will become Apple’s general counsel on March 1, 2026, following a transition of duties from Kate Adams, who has served as Apple’s general counsel since 2017. She will join Apple as senior vice president in January, reporting to CEO Tim Cook and serving on Apple’s executive team.

In addition, Lisa Jackson, vice president for Environment, Policy, and Social Initiatives, will retire in late January 2026. The Government Affairs organization will transition to Adams, who will oversee the team until her retirement late next year, after which it will be led by Newstead. Newstead’s title will become senior vice president, General Counsel and Government Affairs, reflecting the combining of the two organizations. The Environment and Social Initiatives teams will report to Apple chief operating officer Sabih Khan.

What will tomorrow bring, I wonder?

Newstead has spent the past year working closely with Joel Kaplan, and fighting the FTC’s case against Meta — successfully, I should add. Before that, she was a Trump appointee at the U.S. State Department. Well positioned, then, to fight Apple’s U.S. antitrust lawsuit against a second-term Trump government that has successfully solicited Apple’s money.

John Voorhees, MacStories:

Although Apple doesn’t say so in its press release, it’s pretty clear that a few things are playing out among its executive ranks. First, a large number of them are approaching retirement age, and Apple is transitioning and changing roles internally to account for those who are retiring. Second, the company is dealing with departures like Alan Dye’s and what appears to be the less-than-voluntary retirement of John Giannandrea. Finally, the company is reducing the number of Tim Cook’s direct reports, which is undoubtedly to simplify the transition to a new CEO in the relatively near future.

A careful reader will notice Apple’s newsroom page currently has press releases for these departures and, from earlier this week, John Giannandrea’s, but there is nothing about Alan Dye’s. In fact, even in the statement quoted by Bloomberg, Dye is not mentioned. In fairness, Adams, Giannandrea, and Jackson all have bios on Apple’s leadership page. Dye’s was removed between 2017 and 2018.

Starting to think Mark Gurman might be wrong about that FT report.

⌥ Permalink

Waymo Data Indicates Dramatic Safety Improvements Over Human Drivers, So It Is Making Its Cars More Human

By: Nick Heer

Jonathan Slotkin, a surgeon and venture capital investor, wrote for the New York Times about data released by Waymo indicating impressive safety improvements over human drivers through June 2025:

If Waymo’s results are indicative of the broader future of autonomous vehicles, we may be on the path to eliminating traffic deaths as a leading cause of mortality in the United States. While many see this as a tech story, I view it as a public health breakthrough.

[…]

There’s a public health imperative to quickly expand the adoption of autonomous vehicles. […]

We should be skeptical of all self-reported stats, but these figures look downright impressive.

Slotkin responsibly notes several caveats, though neglects to mention the specific cities in which Waymo operates: Austin, Los Angeles, Phoenix, and San Francisco. These are warm cities with relatively low annual precipitation, almost none of which is ever snow. Slotkin’s enthusiasm for widespread adoption should be tempered somewhat by this narrow range of climate data. Still, its data is compelling. These cars seem to crash less often than those driven by people in the same cities and, in particular, avoid causing serious injuries at an impressive rate.

It is therefore baffling to me that Waymo appears to be treating this as a cushion for experimentation.

Katherine Bindley, in a Wall Street Journal article published the very same day as Slotkin’s Times piece:

The training wheels are off. Like the rule-following nice guy who’s tired of being taken advantage of, Waymos are putting their own needs first. They’re bending traffic laws, getting impatient with pedestrians and embracing the idea that when it comes to city driving, politeness doesn’t pay: It’s every car for itself.

[…]

Waymo has been trying to make its cars “confidently assertive,” says Chris Ludwick, a senior director of product management with Waymo, which is owned by Google parent Alphabet. “That was really necessary for us to actually scale this up in San Francisco, especially because of how busy it gets.”

A couple years ago, Tesla’s erroneously named “Full Self-Driving” feature began cruising through crosswalks if it judged it could pass a crossing pedestrian in time, and I wrote:

Advocates of autonomous vehicles often say increased safety is one of its biggest advantages over human drivers. Compliance with the law may not be the most accurate proxy for what constitutes safe driving, but not to a disqualifying extent. Right now, it is the best framework we have, and autonomous vehicles should follow the law. That should not be a controversial statement.

I stand by that. A likely reason for Waymo’s impressive data is that its cars behave with caution and deference. Substituting that with “confidently assertive” driving is a move in entirely the wrong direction. It should not roll through stop signs, even if its systems understand nobody is around. It should not mess up the order of an all-way stop intersection. I have problems with the way traffic laws are written, but it is not up to one company in California to develop a proprietary interpretation. Just follow the law.

Slotkin:

This is not a call to replace every vehicle tomorrow. For one thing, self-driving technology is still expensive. Each car’s equipment costs $100,000 beyond the base price, and Waymo doesn’t yet sell cars for personal use. Even once that changes, many Americans love driving; some will resist any change that seems to alter that freedom.

[…]

There is likely to be some initial public trepidation. We do not need everyone to use self-driving cars to realize profound safety gains, however. If 30 percent of cars were fully automated, it might prevent 40 percent of crashes, as autonomous vehicles both avoid causing crashes and respond better when human drivers err. Insurance markets will accelerate this transition, as premiums start to favor autonomous vehicles.

Slotkin is entirely correct in writing that “Americans love driving” — the U.S. National Household Travel Survey, last conducted in 2022, found 90.5% of commuters said they primarily used a car of some kind (table 7-2, page 50). 4.1% said they used public transit, 2.9% said they walked, and just 2.5% said they chose another mode of transportation in which taxicabs are grouped along with bikes and motorcycles. Those figures are about the same in 2017, though with an unfortunate decline in the number of transit commuters. Commuting is not the only reason for travelling, of course, but this suggests to me that even if every taxicab ride was in an autonomous Waymo, there would still be a massive gap to achieve that 30% adoption rate Slotkin wants. And, if insurance companies begin incentivizing autonomous vehicles, it really means rich people will reap the reward of being able to buy a new car.

Any argument about road safety has to be more comprehensive than what Slotkin is presenting in this article. Regardless of how impressive Waymo’s stats are, it is a vision of the future that is an individualized solution to a systemic problem. I have no specialized knowledge in this area, but I am fascinated by it. I read about this stuff obsessively. The things I want to see are things everyone can benefit from: improvements to street design that encourage drivers to travel at lower speeds, wider sidewalks making walking more comfortable, and generous wheeling infrastructure for bicycles, wheelchairs, and scooters. We can encourage the adoption of technological solutions, too; if this data holds up, it would seem welcome. But we can do so much better for everyone, and on a more predictable timeline.

This is, as Slotkin writes, a public health matter. Where I live, record numbers of people are dying, in part because more people than ever are driving bigger and heavier vehicles with taller fronts while they are distracted. Many of those vehicles will still be on the road in twenty years’ time, even if we accelerate the adoption pace of more autonomous vehicles. We do not need to wait for a headline-friendly technological upgrade. There are boring things cities can start doing tomorrow that would save lives.

⌥ Permalink

Alan Dye Out at Apple

By: Nick Heer

Mark Gurman, Bloomberg:

Meta Platforms Inc. has poached Apple Inc.’s most prominent design executive in a major coup that underscores a push by the social networking giant into AI-equipped consumer devices.

The company is hiring Alan Dye, who has served as the head of Apple’s user interface design team since 2015, according to people with knowledge of the matter. Apple is replacing Dye with longtime designer Stephen Lemay, according to the people, who asked not to be identified because the personnel changes haven’t been announced.

Big week for changes in Apple leadership.

I am sure more will trickle out about this, but one thing notable to me is that Lemay has been a software designer for over 25 years at Apple. Dye, on the other hand, came from marketing and print design. I do not want to put too much weight on that — someone can be a sufficiently talented multidisciplinary designer — but I am curious to see what Lemay might do in a more senior role.

Admittedly I also have some (perhaps morbid) curiosity about what Dye will do at Meta.

One more note from Gurman’s report:

Dye had taken on a more significant role at Apple after Ive left, helping define how the company’s latest operating systems, apps and devices look and feel. The executive informed Apple this week that he’d decided to leave, though top management had already been bracing for his departure, the people said. Dye will join Meta as chief design officer on Dec. 31.

Let me get this straight: Dye personally launches an overhaul of Apple’s entire visual interface language, then leaves. Is that a good sign for its reception, either internally or externally?

⌥ Permalink

Microsoft Lowers A.I. Software Growth Targets

By: Nick Heer

Benj Edwards, Ars Technica:

Microsoft has lowered sales growth targets for its AI agent products after many salespeople missed their quotas in the fiscal year ending in June, according to a report Wednesday from The Information. The adjustment is reportedly unusual for Microsoft, and it comes after the company missed a number of ambitious sales goals for its AI offerings.

Based on Edwards’ summary — I still have no interest in paying for the Information — it sounds like this mostly affects sales of A.I. “agents”, a riskier technology proposition for businesses. This sounds to me like more concrete evidence of a plateau in corporate interest than the surveys reported on by the Economist.

⌥ Permalink

‘Mad Men’ on HBO Max, in 4K, Somehow Lacking VFX

By: Nick Heer

Todd Vaziri:

As far as I can tell, Paul Haine was the first to notice something weird going on with HBO Max’ presentation. In one of season one’s most memorable moments, Roger Sterling barfs in front of clients after climbing many flights of stairs. As a surprise to Paul, you can clearly see the pretend puke hose (that is ultimately strapped to the back side of John Slattery’s face) in the background, along with two techs who are modulating the flow. Yeah, you’re not supposed to see that.

It appears as though this represents the original photography, unaltered before digital visual effects got involved. Somehow, this episode (along with many others) do not include all the digital visual effects that were in the original broadcasts and home video releases. It’s a bizarro mistake for Lionsgate and HBO Max to make and not discover until after the show was streaming to customers.

Eric Vilas-Boas, Vulture:

How did this happen? Apparently, this wasn’t actually HBO Max’s fault — the streamer received incorrect files from Lionsgate Television, a source familiar with the exchange tells Vulture. Lionsgate is now in the process of getting HBO Max the correct files, and the episodes will be updated as soon as possible.

It just feels clumsy and silly for Lionsgate to supply the wrong files in the first place, and for nobody at HBO to verify they are the correct work. An amateur mistake, frankly, for an ostensibly premium service costing U.S. $11–$23 per month. If I were king for a day, it would be illegal to sell or stream a remastered version of something — a show, an album, whatever — without the original being available alongside it.

⌥ Permalink

John Giannandrea Out at Apple

By: Nick Heer

Apple:

Apple today announced John Giannandrea, Apple’s senior vice president for Machine Learning and AI Strategy, is stepping down from his position and will serve as an advisor to the company before retiring in the spring of 2026. Apple also announced that renowned AI researcher Amar Subramanya has joined Apple as vice president of AI, reporting to Craig Federighi. Subramanya will be leading critical areas, including Apple Foundation Models, ML research, and AI Safety and Evaluation. The balance of Giannandrea’s organization will shift to Sabih Khan and Eddy Cue to align closer with similar organizations.

When Apple hired Giannandrea from Google in 2018, the New York Times called it a “major coup”, given that Siri was “less effective than its counterparts at Google and Amazon”. The world changed a lot in the past six-and-a-half years, though: Siri is now also worse than a bunch of A.I. products. Of course, Giannandrea’s role at Apple was not limited to Siri. He spent time on the Project Titan autonomous car, which was cancelled early last year, before moving to generative A.I. projects. The first results of that effort were shown at WWDC last year; the most impressive features have yet to ship.

I feel embarrassed and dumb for hoping Giannandrea would help shake the company out of its bizarre Siri stupor. Alas, he is now on the Graceful Executive Exit Express, where he gets to spend a few more months at Apple in a kind of transitional capacity — you know the drill. Maybe Subramanya will help move the needle. Maybe this ex-Googler will make it so. Maybe I, Charlie Brown, will get to kick that football.

⌥ Permalink

⌥ A Questionable A.I. Plateau

By: Nick Heer

The Economist:

On November 20th American statisticians released the results of a survey. Buried in the data is a trend with implications for trillions of dollars of spending. Researchers at the Census Bureau ask firms if they have used artificial intelligence “in producing goods and services” in the past two weeks. Recently, we estimate, the employment-weighted share of Americans using AI at work has fallen by a percentage point, and now sits at 11% (see chart 1). Adoption has fallen sharply at the largest businesses, those employing over 250 people. Three years into the generative-AI wave, demand for the technology looks surprisingly flimsy.

[…]

Even unofficial surveys point to stagnating corporate adoption. Jon Hartley of Stanford University and colleagues found that in September 37% of Americans used generative AI at work, down from 46% in June. A tracker by Alex Bick of the Federal Reserve Bank of St Louis and colleagues revealed that, in August 2024, 12.1% of working-age adults used generative AI every day at work. A year later 12.6% did. Ramp, a fintech firm, finds that in early 2025 AI use soared at American firms to 40%, before levelling off. The growth in adoption really does seem to be slowing.

I am skeptical of the metrics used by the Economist to produce this summary, in part because they are all over the place, and also because they are mostly surveys. I am not sure people always know they are using a generative A.I. product, especially when those features are increasingly just part of the modern office software stack.

While the Economist has an unfortunate allergy to linking to its sources, I wanted to track them down because a fuller context is sometimes more revealing. I believe the U.S. Census data is the Business Trends and Outlook Survey though I am not certain because its charts are just plain, non-interactive images. In any case, it is the Economist’s own estimate of falling — not stalling — adoption by workers, not an estimate produced by the Census Bureau, which is curious given two of its other sources indicate more of a plateau instead of a decline.

The Hartley, et al. survey is available here and contains some fascinating results other than the specific figures highlighted by the Economist — in particular, that the construction industry has the fourth-highest adoption of generative A.I., that Gemini is shown in Figure 9 as more popular than ChatGPT even though the text on page 7 indicates the opposite, and that the word “Microsoft” does not appear once in the entire document. I have some admittedly uninformed and amateur questions about its validity. At any rate, this is the only source the Economist cites which indicates a decline.

The data point attributed to the tracker operated by the Federal Reserve Bank of St. Louis is curious. The Economist notes “in August 2024, 12.1% of working-age adults used generative A.I. every day at work. A year later 12.6% did”, but I am looking at the dashboard right now, and it says the share using generative A.I. daily at work is 13.8%, not 12.6%. In the same time period, the share of people using it “at least once last week” jumped from 36.1% to 46.9%. I have no idea where that 12.6% number came from.

Finally, Ramp’s data is easy enough to find. Again, I have to wonder about the Economist’s selective presentation. If you switch the chart from an overall view to a sector-based view, you can see adoption of paid subscriptions has more than doubled in many industries compared to October last year. This is true even in “accommodation and food services”, where I have to imagine use cases are few and far between.

After finding the actual source of the Economist’s data, it has left me skeptical of the premise of this article. However, plateauing interest — at least for now — makes sense to me on a gut level. There is a ceiling to work one can entrust to interns or entry-level employees, and that is approximately similar for many of today’s A.I. tools. There are also sector-level limits. Consider Ramp’s data showing high adoption in the tech and finance industries, with considerably less in sectors like healthcare and food services. (Curiously, Ramp says only 29% of the U.S. construction industry has a subscription to generative A.I. products, while Hartley, et al. says over 40% of the construction industry is using it.)

I commend any attempt to figure out how useful generative A.I. is in the real world. One of the problems with this industry right now is that its biggest purveyors are not public companies and, therefore, have fewer disclosure requirements. Like any company, they are incentivized to inflate their importance, but we have little understanding of how much they are exaggerating. If you want to hear some corporate gibberish, OpenAI interviewed executives at companies like Philips and Scania about their use of ChatGPT, but I do not know what I gleaned from either interview — something about experimentation and vague stuff about people being excited to use it, I suppose. It is not very compelling to me. I am not in the C-suite, though.

The biggest public A.I. firm is arguably Microsoft. It has rolled out Copilot to Windows and Office users around the world. Again, however, its press releases leave much to be desired. Levi Strauss employees, Microsoft says, “report the devices and operating system have led to significant improvements in speed, reliability and data handling, with features like the Copilot key helping reduce the time employees spend searching and free up more time for creating”. Sure. In another case study, Microsoft and Pantone brag about the integration of a colour palette generator that you can use with words instead of your eyes.

Microsoft has every incentive to pretend Copilot is a revolutionary technology. For people actually doing the work, however, its ever-nagging presence might be one of many nuisances getting in the way of the job that person actually knows how to do. A few months ago, the company replaced the familiar Office portal with a Copilot prompt box. It is still little more than a thing I need to bypass to get to my work.

All the stats and apparent enthusiasm about A.I. in the workplace are, as far as I can tell, a giant mess. A problem with this technology is that the ways in which it is revolutionary are often not very useful, its practical application in a work context is a mixed bag that depends on industry and role, and its hype encourages otherwise respectable organizations to suggest their proximity to its promised future.

The Economist being what it is, much of this article revolves around the insufficiently realized efficiency and productivity gains, and that is certainly something for business-minded people to think about. But there are more fundamental issues with generative A.I. to struggle with. It is a technology built on a shaky foundation. It shrinks the already-scant field of entry-level jobs. Its results are unpredictable and can validate harm. The list goes on, yet it is being loudly inserted into our SaaS-dominated world as a top-down mandate.

It turns out A.I. is not magic dust you can sprinkle on a workforce to double their productivity. CEOs might be thrilled by having all their email summarized, but the rest of us do not need that. We need things like better balance of work and real life, good benefits, and adequate compensation. Those are things a team leader cannot buy with a $25-per-month-per-seat ChatGPT business license.

An App Named Alan

By: Nick Heer

Tyler Hall:

Maybe it’s because my eyes are getting old or maybe it’s because the contrast between windows on macOS keeps getting worse. Either way, I built a tiny Mac app last night that draws a border around the active window. I named it “Alan”.

A good, cheeky name. The results are not what I would call beautiful, but that is not the point, is it? It works well. I wish it did not feel understandable for there to be an app that draws a big border around the currently active window. That should be something made sufficiently obvious by the system.

Unfortunately, this is a problem plaguing the latest versions of MacOS and Windows alike, which is baffling to me. The bar for what constitutes acceptable user interface design seems to have fallen low enough that it is tripping everyone at the two major desktop operating system vendors.

⌥ Permalink

Threads Continues to Reward Rage Bait

By: Nick Heer

Hank Green was not getting a lot of traction on a promotional post on Threads about a sale on his store. He got just over thirty likes, which does not sound awful, until you learn that was over the span of seven hours and across Green’s following of 806,000 accounts on Threads.

So he tried replying to rage bait with basically the same post, and that was far more successful. But, also, it has some pretty crappy implications:

That’s the signal that Threads is taking from this: Threads is like oh, there’s a discussion going on.

It’s 2025! Meta knows that “lots of discussion” is not a surrogate for “good things happening”!

I assume the home feed ranking systems are similar for Threads and Instagram — though they might not be — and I cannot tell you how many times my feed is packed with posts from many days to a week prior. So many businesses I frequent use it as a promotional tool for time-bound things I learn about only afterward. The same thing is true of Stories, since they are sorted based on how frequently you interact with an account.

Everyone is allowed one conspiracy theory, right? Mine is that a primary reason Meta is hostile to reverse-chronological feeds is because it requires businesses to buy advertising. I have no proof to support this, but it seems entirely plausible.

⌥ Permalink

⌥ Moraine Luck

By: Nick Heer

You have seen Moraine Lake. Maybe it was on a postcard or in a travel brochure, or it was on Reddit, or in Windows Vista, or as part of a “Best of California” demo on Apple’s website. Perhaps you were doing laundry in Lucerne. But I am sure you have seen it somewhere.

Moraine Lake is not in California — or Switzerland, for that matter. It is right here in Alberta, between Banff and Lake Louise, and I have been lucky enough to visit many times. One time I was particularly lucky, in a way I only knew in hindsight. I am not sure the confluence of events occurring in October 2019 is likely to be repeated for me.

In 2019, the road up to the lake would be open to the public from May until about mid-October, though the closing day would depend on when it was safe to travel. This is one reason why so many pictures of it have only the faintest hint of snow capping the mountains behind — it is only really accessible in summer.

I am not sure why we decided to head up to Lake Louise and Moraine Lake that Saturday. Perhaps it was just an excuse to get out of the house. It was just a few days before the road was shut for the season.

We visited Lake Louise first and it was, you know, just fine. Then we headed to Moraine.

I posted a higher-quality version of this on my Glass profile.
A photo of Moraine Lake, Alberta, frozen with chunks of ice and rocks on its surface.

Walking from the car to the lakeshore, we could see its surface was that familiar blue-turquoise, but it was entirely frozen. I took a few images from the shore. Then we realized we could just walk on it, as did the handful of other people who were there. This is one of several photos I took from the surface of the lake, the glassy ice reflecting that famous mountain range in the background.

I am not sure I would be able to capture a similar image today. Banff and Lake Louise have received more visitors than ever in recent years, to the extent private vehicles are no longer allowed to travel up to Moraine Lake. A shuttle bus is now required. The lake also does not reliably freeze at an accessible time and, when it does, it can be covered in snow or the water line may have receded. I am not arguing this is an impossible image to create going forward. I just do not think I am likely to see it this way again.

I am very glad I remembered to bring my camera.

OpenAI’s House Counsel to Be Deposed Over Deleted Pirated Material

By: Nick Heer

Winston Cho, the Hollywood Reporter:

To rewind, authors and publishers have gained access to Slack messages between OpenAI’s employees discussing the erasure of the datasets, named “books 1 and books 2.” But the court held off on whether plaintiffs should get other communications that the company argued were protected by attorney-client privilege.

In a controversial decision that was appealed by OpenAI on Wednesday, U.S. District Judge Ona Wang found that OpenAI must hand over documents revealing the company’s motivations for deleting the datasets. OpenAI’s in-house legal team will be deposed.

Wang’s decision (PDF), to the extent I can read it as a layperson, examines OpenAI’s shifting story about why it erased the books 1 and books2 data sets — apparently, the only time possible training materials were deleted.

I am not sure it has yet been proven OpenAI trained its models on pirated books. Anthropic settled a similar suit in September, and Meta and Apple are facing similar accusations. For practical purposes, however, it is trivial to suggest it did use pirated data in general: if you have access to its Sora app, enter any prompt followed by the word “camrip”.

What is a camrip?, a strictly law-abiding person might ask. It is a label added to a movie pirated in the old-fashioned way: by pointing a video camera at the screen in a theatre. As a result, these videos have a distinctive look and sound which is reproduced perfectly by Sora. It is very difficult for me to see a way in which OpenAI could have trained this model to understand what a camrip is without feeding it a bunch of them, and I do not know of a legitimate source for such videos.

⌥ Permalink

Internet Archive Wayback Machine Link Fixer

By: Nick Heer

The Internet Archive released a WordPress plugin not too long ago:

Internet Archive Wayback Machine Link Fixer is a WordPress plugin designed to combat link rot—the gradual decay of web links as pages are moved, changed, or taken down. It automatically scans your post content — on save and across existing posts — to detect outbound links. For each one, it checks the Internet Archive’s Wayback Machine for an archived version and creates a snapshot if one isn’t available.

Via Michael Tsai:

The part where it replaces broken links with archive links is implemented in JavaScript. I like that it doesn’t modify the post content in your database. It seems safe to install the plug-in without worrying about it messing anything up. However, I had kind of hoped that it would fix the links as part of the PHP rendering process. Doing it in JavaScript means that the fixed links are not available in the actual HTML tags on the page. And the data that the JavaScript uses is stored in an invisible <div> under the attribute data-iawmlf-post-links, which makes the page fail validation.

I love the idea of this plugin, but I do not love this implementation. I think I understand why it works this way: for the nondestructive property mentioned by Tsai, and also to account for its dependence on a third-party service of varying reliability. I would love to see a demo of this plugin in action.

⌥ Permalink

Investigating a Possible Scammer in Journalism’s A.I. Era

By: Nick Heer

Nicholas Hune-Brown, the Local:

Every media era gets the fabulists it deserves. If Stephen Glass, Jayson Blair and the other late 20th century fakers were looking for the prestige and power that came with journalism in that moment, then this generation’s internet scammers are scavenging in the wreckage of a degraded media environment. They’re taking advantage of an ecosystem uniquely susceptible to fraud—where publications with prestigious names publish rickety journalism under their brands, where fact-checkers have been axed and editors are overworked, where technology has made falsifying pitches and entire articles trivially easy, and where decades of devaluing journalism as simply more “content” have blurred the lines so much it can be difficult to remember where they were to begin with.

This is likely not the first story you have read about a freelancer managing to land bylines in prestigious publications thanks to dependency on A.I. tools, but it is one told very well.

⌥ Permalink

Web Development Tip: Disable Pointer Events on Link Images

By: Nick Heer

Good tip from Jeff Johnson:

My business website has a number of “Download on the App Store” links for my App Store apps. Here’s an example of what that looks like:

[…]

The problem is that Live Text, “Select text in images to copy or take action,” is enabled by default on iOS devices (Settings → General → Language & Region), which can interfere with the contextual menu in Safari. Pressing down on the above link may select the text inside the image instead of selecting the link URL.

I love the Live Text feature, but it often conflicts with graphics like these. There is a good, simple, two-line CSS trick for web developers that should cover most situations. Also, if you rock a user stylesheet — and I think you should — it seems to work fine as a universal solution. Any issues I have found have been minor and not worth noting. I say give it a shot.

Update: Adding Johnson’s CSS to a user stylesheet mucks up the layout of Techmeme a little bit. You can exclude it by adding div:not(.ii) > before a:has(> img) { display: inline-block; }.

⌥ Permalink

‘The iPad’s Software Problem Is Permanent’

By: Nick Heer

Quinn Nelson:

[…] at a moment when the Mac has roared back to the centre of Apple’s universe, the iPad feels closer than ever to fulfilling its original promise. Except it doesn’t, not really, because while the iPad has gained windowing and external display support, pro apps, all the trappings of a “real computer”, underneath it all, iPadOS is still a fundamentally mobile operating system with mobile constraints baked into its very DNA.

Meanwhile, the Mac is rumoured to be getting everything the iPad does best: touchscreens, OLED displays, thinner designs.

There are things I quibble with in Nelson’s video, including the above-quoted comparison to mere rumours about the Mac. The rest of the video is more compelling as it presents comparisons with the same or similar software on each platform in real-world head-to-head matches.

Via Federico Viticci, MacStories:

I’m so happy that Apple seems to be taking iPadOS more seriously than ever this year. But now I can’t help but wonder if the iPad’s problems run deeper than windowing when it comes to getting serious work done on it.

Apple’s post-iPhone platforms are only as good as Apple will allow them to be. I am not saying it needs to be possible to swap out Bluetooth drivers or monkey around with low-level code, but without more flexibility, platforms like the iPad and Vision Pro are destined to progress only at the rate Apple says is acceptable, and with the third-party apps it says are permissible. These are apparently the operating systems for the future of computers. They are not required to have similar limitations to the iPhone, but they do anyway. Those restrictions are holding back the potential of these platforms.

⌥ Permalink

Polarization in the United States Has Become the World’s Side Hustle

By: Nick Heer

Marina Dunbar, the Guardian:

Many of the most influential personalities in the “Make America great again” (Maga) movement on X are based outside of the US, including Russia, Nigeria and India, a new transparency feature on the social media site has revealed.

The new tool, called “about this account”, became available on Friday to users of the Elon Musk-owned platform. It allows anyone to see where an account is located, when it joined the platform, how often its username has been changed, and how the X app was downloaded.

This is a similar approach to adding labels or notes to tweets containing misinformation in that it is adding more speech and context. It is more automatic, but the function and intent is comparable, which means Musk’s hobbyist P.R. team must be all worked up. But I checked, and none seem particularly bothered. Maybe they actually care about trust and safety now, or maybe they are lying hacks.

Mike Masnick, Techdirt:

For years, Matt Taibbi, Michael Shellenberger, and their allies have insisted that anyone working on these [trust and safety] problems was part of a “censorship industrial complex” designed to silence political speech. Politicians like Ted Cruz and Jim Jordan repeated these lies. They treated trust & safety work as a threat to democracy itself.

Then Musk rolled out one basic feature, and within hours proved exactly why trust & safety work existed in the first place.

Jason Koebler, 404 Media, has been covering the monetization of social media:

This has created an ecosystem of side hustlers trying to gain access to these programs and YouTube and Instagram creators teaching people how to gain access to them. It is possible to find these guide videos easily if you search for things like “monetized X account” on YouTube. Translating that phrase and searching in other languages (such as Hindi, Portuguese, Vietnamese, etc) will bring up guides in those languages. Within seconds, I was able to find a handful of YouTubers explaining in Hindi how to create monetized X accounts; other videos on the creators’ pages explain how to fill these accounts with AI-generated content. These guides also exist in English, and it is increasingly popular to sell guides to make “AI influencers,” and AI newsletters, Reels accounts, and TikTok accounts regardless of the country that you’re from.

[…]

Americans are being targeted because advertisers pay higher ad rates to reach American internet users, who are among the wealthiest in the world. In turn, social media companies pay more money if the people engaging with the content are American. This has created a system where it makes financial sense for people from the entire world to specifically target Americans with highly engaging, divisive content. It pays more.

The U.S. market is a larger audience, too. But those of us in rich countries outside the U.S. should not get too comfortable; I found plenty of guides similar to the ones shown by Koebler for targeting Australia, Canada, Germany, New Zealand, and more. Worrisome — especially if you, say, are somewhere with an electorate trying to drive the place you live off a cliff.

Update: Several X accounts purporting to be Albertans supporting separatism appear to be from outside Canada, including a “Concerned 🍁 Mum”, “Samantha”, “Canada the Illusion”, and this “Albertan” all from the United States, and a smaller account from Laos. I tried to check more, but X’s fragile servers are aggressively rate-limited.

I do not think people from outside a country are forbidden from offering an opinion on what is happening within it. I would be a pretty staggering hypocrite if I thought that. Nor do I think we should automatically assume people who are stoking hostile politics on social media are necessarily external or bots. It is more like a reflection of who we are now, and how easily that can be exploited.

⌥ Permalink

Meta’s Accounting of Its Louisiana Data Centre ‘Strains Credibility’

By: Nick Heer

Jonathan Weil, Wall Street Journal:

It seems like a marvel of financial engineering: Meta Platforms is building a $27 billion data center in Louisiana, financed with debt, and neither the data center nor the debt will be on its own balance sheet.

That outcome looks too good to be true, and it probably is.

The phrase “marvel of financial engineering” does not seem like a compliment. In addition to the evidence from Weil’s article, Meta is taking advantage of a tax exemption created by Louisiana’s state legislature. But, in its argument, it is merely a user of this data centre.

Also, colour me skeptical this data centre will truly be “the size of Manhattan” before the bubble bursts, despite the disruption to life in the area.

Update: Paris Martineau points to Weil’s bio noting he was “the first reporter to challenge Enron’s accounting practices”.

⌥ Permalink

A.I. Mania Looks and Feels Bigger Than the .Com Bubble

By: Nick Heer

Fred Vogelstein, Crazy Stupid Tech — which, again, is a compliment:

We’re not only in a bubble but one that is arguably the biggest technology mania any of us have ever witnessed. We’re even back reinventing time. Back in 1999 we talked about internet time, where every year in the new economy was like a dog year – equivalent to seven years in the old.

Now VCs, investors and executives are talking about AI dog years – let’s just call them mouse years – which is internet time divided by five? Or is it by 11? Or 12? Sure, things move way faster than they did a generation ago. But by that math one year today now equals 35 years in 1995. Really?

A sobering piece that, unfortunately, is somewhat undercut since it lacks a single mention of layoffs, jobs, employment, or any other indication that this bubble will wreck the lives of people far outside its immediate orbit. In fairness, few of the related articles linked at the bottom mention that, either. Articles in Stratechery, the Brookings Institute, and the New York Times want you to think a bubble is just a sign of building something new and wonderful. A Bloomberg newsletter mentions layoffs only in the context of changing odds in predictions markets — I chuckled — while M.G. Siegler notes all the people who are being laid off while new A.I. hires get multimillion-dollar employment packages. Maybe all the pain and suffering that is likely to result from the implosion of this massive sector is too obvious to mention for the MBA and finance types. I think it is worth stating, though, not least because it acknowledges other people are worth caring about at least as much as innovation and growth and all that stuff.

⌥ Permalink

Our mixed assortment of DNS server software (as of December 2025)

By: cks

Without deliberately planning it, we've wound up running an assortment of DNS server software on an assortment of DNS servers. A lot of this involves history, so I might as well tell the story of that history in the process. This starts with our three sets of DNS servers: our internal DNS master (with a duplicate) that holds both the internal and external views of our zones, our resolving DNS servers (which use our internal zones), and our public authoritative DNS server (carrying our external zones, along with various relics of the past). These days we also have an additional resolving DNS server that resolves from outside our networks and so gives the people who can use it an external view of our zones.

In the beginning we ran Bind on everything, as was the custom in those days (and I suspect we started out without a separation between the three types of DNS servers, but that predates my time here), and I believe all of the DNS servers were Solaris. Eventually we moved the resolving DNS servers and the public authoritative DNS server to OpenBSD (and the internal DNS master to Ubuntu), still using Bind. Then OpenBSD switched which nameservers they liked from Bind to Unbound and NSD, so we went along with that. Our authoritative DNS server had a relatively easy NSD configuration, but our resolving DNS servers presented some challenges and we wound up with a complex Unbound plus NSD setup. Recently we switched our internal resolvers to using Bind on Ubuntu, and then we switched our public authoritative DNS server from OpenBSD to Ubuntu but kept it still with NSD, since we already had a working NSD configuration for it.

This has wound up with us running the following setups:

  • Our internal DNS masters run Bind in a somewhat complex split horizon configuration.

  • Our internal DNS resolvers run Bind in a simpler configuration where they act as internal authoritative secondary DNS servers for our own zones and as general resolvers.

  • Our public authoritative DNS server (and its hot spare) run NSD as an authoritative secondary, doing zone transfers from our internal DNS masters.

  • We have an external DNS resolver machine that runs Unbound in an extremely simple configuration. We opted to build this machine with Unbound because we didn't need it to act as anything other than a pure resolver, and Unbound is simple to set up for that.

At one level, this is splitting our knowledge and resources among three DNS servers rather than focusing on one. At another level, two out of the three DNS servers are being used in quite simple setups (and we already had the NSD setup written from prior use). Our only complex configurations are all Bind based, and we've explicitly picked Bind for complex setups because we feel we understand it fairly well from long experience with it.

(Specifically, I can configure a simple Unbound resolver faster and easier than I can do the same with Bind. I'm sure there's a simple resolver-only Bind configuration, it's just that I've never built one and I have built several simple and not so simple Unbound setups.)

Getting out of being people's secondary authoritative DNS server is hard

By: cks

Many, many years ago, my department operated one of the university's secondary authoritative DNS servers, which was used by most everyone with a university subdomain and as a result was listed as one of their DNS NS records. This DNs server was also the authoritative DNS server for our own domains, because this was in the era where servers were expensive and it made perfect sense to do this. At the time, departments who wanted a subdomain pretty much needed to have a Unix system administrator and probably run their own primary DNS server and so on. Over time, the university's DNS infrastructure shifted drastically, with central IT offering more and more support, and more than half a decade ago our authoritative DNS server stopped being a university secondary, after a lot of notice to everyone.

Experienced system administrators can guess what happened next. Or rather, what didn't happen next. References to our DNS server lingered in various places for years, both in the university's root zones as DNS glue records and in people's own DNS zone files as theoretically authoritative records. As late as the middle of last year, when I started grinding away on this, I believe that roughly half of our authoritative DNS server's traffic was for old zones we didn't serve and was getting DNS 'Refused' responses. The situation is much better today, after several rounds of finding other people's zones that were still pointing to us, but it's still not quite over and it took a bunch of tedious work to get this far.

(Why I care about this is that it's hard to see if your authoritative DNS server is correctly answering everything it should if things like tcpdumps of DNS traffic are absolutely flooded with bad traffic that your DNS server is (correctly) rejecting.)

In theory, what we should have done when we stopped being a university secondary authoritative DNS server was to switch the authoritative DNS server for our own domains to another name and another IP address; this would have completely cut off everyone else when we turned the old server off and removed its name from our DNS. In practice the transition was not clearcut, because for a while we kept on being a secondary for some other university zones that have long-standing associations with the department. Also, I think we were optimistic about how responsive people would be (and how many of them we could reach).

(Also, there's a great deal of history tied up in the specific name and IP address of our current authoritative DNS server. It's been there for a very long time.)

PS: Even when no one is incorrectly pointing to us, there's clearly a background Internet radiation of external machines throwing random DNS queries at us. But that's another entry.

In Linux, filesystems can and do have things with inode number zero

By: cks

A while back I wrote about how in POSIX you could theoretically use inode (number) zero. Not all Unixes consider inode zero to be valid; prominently, OpenBSD's getdents(2) doesn't return valid entries with an inode number of 0, and by extension, OpenBSD's filesystems won't have anything that uses inode zero. However, Linux is a different beast.

Recently, I saw a Go commit message with the interesting description of:

os: allow direntries to have zero inodes on Linux

Some Linux filesystems have been known to return valid entries with zero inodes. This new behavior also puts Go in agreement with recent glibc.

This fixes issue #76428, and the issue has a simple reproduction to create something with inode numbers of zero. According to the bug report:

[...] On a Linux system with libfuse 3.17.1 or later, you can do this easily with GVFS:

# Create many dir entries
(cd big && printf '%04x ' {0..1023} | xargs mkdir -p)
gio mount sftp://localhost/$PWD/big

The resulting filesystem mount is in /run/user/$UID/gvfs (see the issue for the exact long path) and can be experimentally verified to have entries with inode numbers of zero (well, as reported by reading the directory). On systems using glibc 2.37 and later, you can look at this directory with 'ls' and see the zero inode numbers.

(Interested parties can try their favorite non-C or non-glibc bindings to see if those environments correctly handle this case.)

That this requires glibc 2.37 is due to this glibc bug, first opened in 2010 (but rejected at the time for reasons you can read in the glibc bug) and then resurfaced in 2016 and eventually fixed in 2022 (and then again in 2024 for the thread safe version of readdir). The 2016 glibc issue has a bit of a discussion about the kernel side. As covered in the Go issue, libfuse returning a zero inode number may be a bug itself, but there are (many) versions of libfuse out in the wild that actually do this today.

Of course, libfuse (and gvfs) may not be the only Linux filesystems and filesystem environments that can create this effect. I believe there are alternate language bindings and APIs for the kernel FUSE (also, also) support, so they might have the same bug as libfuse does.

(Both Go and Rust have at least one native binding to the kernel FUSE driver. I haven't looked at either to see what they do about inode numbers.)

PS: My understanding of the Linux (kernel) situation is that if you have something inside the kernel that needs an inode number and you ask the kernel to give you one (through get_next_ino(), an internal function for this), the kernel will carefully avoid giving you inode number 0. A lot of things get inode numbers this way, so this makes life easier for everyone. However, a filesystem can decide on inode numbers itself, and when it does it can use inode number 0 (either explicitly or by zeroing out the d_ino field in the getdents(2) dirent structs that it returns, which I believe is what's happening in the libfuse situation).

Some things on X11's obscure DirectColor visual type

By: cks

The X Window System has a long standing concept called 'visuals'; to simplify, an X visual determines how to determine the colors of your pixels. As I wrote about a number of years ago, these days X11 mostly uses 'TrueColor' visuals, which directly supply 8-bit values for red, green, and blue ('24-bit color'). However X11 has a number of visual types, such as the straightforward PseudoColor indirect colormap (where every pixel value is an index into an RGB colormap; typically you'd get 8-bit pixels and 24-bit colormaps, so you could have 256 colors out of a full 24-bit gamut). One of the (now) obscure visual types is DirectColor. To quote:

For DirectColor, a pixel value is decomposed into separate RGB subfields, and each subfield separately indexes the colormap for the corresponding value. The RGB values can be changed dynamically.

(This is specific to X11; X10 had a different display color model.)

In a PseudoColor visual, each pixel's value is taken as a whole and used as an index into a colormap that gives the RGB values for that entry. In DirectColor, the pixel value is split apart into three values, one each for red, green, and blue, and each value indexes a separate colormap for that color component. Compared to a PseudoColor visual of the same pixel depth (size, eg each pixel is an 8-bit byte), you get less possible variety within a single color component and (I believe) no more colors in total.

When this came up in my old entry about TrueColor and PseudoColor visuals, in a comment Aristotle Pagaltzis speculated:

[...] maybe it can be implemented as three LUTs in front of a DAC’s inputs or something where the performance impact is minimal? (I’m not a hardware person.) [...]

I was recently reminded of this old entry and when I reread that comment, an obvious realization struck me about why DirectColor might make hardware sense. Back in the days of analog video, essentially every serious sort of video connection between your computer and your display carried the red, green, and blue components separately; you can see this in the VGA connector pinouts, and on old Unix workstations these might literally be separate wires connected to separate BNC connectors on your CRT display.

If you're sending the red, green, and blue signals separately you might also be generating them separately, with one DAC per color channel. If you have separate DACs, it might be easier to feed them from separate LUTs and separate pixel data, especially back in the days when much of a Unix workstation's graphics system was implemented in relatively basic, non-custom chips and components. You can split off the bits from the raw pixel value with basic hardware and then route each color channel to its own LUT, DAC, and associated circuits (although presumably you need to drive them with a common clock).

The other way to look at DirectColor is that it's a more flexible version of TrueColor. A TrueColor visual is effectively a 24-bit DirectColor visual where the color mappings for red, green, and blue are fixed rather than variable (this is in fact how it's described in the X documentation). Making these mappings variable costs you only a tiny bit of extra memory (you need 256 bytes for each color) and might require only a bit of extra hardware in the color generation process, and it enables the program using the display to change colors on the fly with small writes to the colormap rather than large writes to the framebuffer (which, back in the days, were not necessarily very fast). For instance, if you're looking at a full screen image and you want to brighten it, you could simply shift the color values in the colormaps to raise the low values, rather than recompute and redraw all the pixels.

(Apparently DirectColor was often used with 24-bit pixels, split into one byte for each color, which is the same pixel layout as a 24-bit TrueColor visual; see eg this section of the Starlink Project's Graphics Cookbook. Also, this seems to be how the A/UX X server worked. If you were going to do 8-bit pixels I suspected people preferred PseudoColor to DirectColor.)

These days this is mostly irrelevant and the basic simplicity of the TrueColor visual has won out. Well, what won out is PC graphics systems that followed the same basic approach of fixed 24-bit RGB color, and then X went along with it on PC hardware, which became more or less the only hardware.

(There probably was hardware with DirectColor support. While X on PC Unixes will probably still claim to support DirectColor visuals, as reported in things like xdpyinfo, I suspect that it involves software emulation. Although these days you could probably implement DirectColor with GPU shaders at basically no cost.)

Sending DMARC reports is somewhat hazardous

By: cks

DMARC has a feature where you can request that other mail systems send you aggregate reports about the DMARC results that they observed for email claiming to be from you. If you're a large institution with a sprawling, complex, multi-party mail environment and you're considering trying to make your DMARC policy stricter, it's very useful to get as many DMARC reports from as many people as possible. Especially, 'you' (in a broad sense) probably want to get as much information from mail systems run by sub-units as possible, and if you're a sub-unit, you want to report DMARC information up to the organization so they have as much visibility into what's going on as possible.

In related news, I've been looking into making our mail system send out DMARC reports, and I had what was in retrospect a predictable learning experience:

Today's discovery: if you want to helpfully send out DMARC reports to people who ask for them and you operate even a moderate sized email system, you're going to need to use a dedicated sending server and you probably don't want to. Because a) you'll be sending a lot of email messages and b) a lot of them will bounce because people's DMARC records are inaccurate and c) a decent number of them will camp out in your mail queue because see b, they're trying to go to non-responsive hosts.

Really, all of this DMARC reporting nonsense was predictable from first (Internet) principles, but I didn't think about it and was just optimistic when I turned our reporting on for local reasons. Of course people are going to screw up their DMARC reporting information (or for spammers, just make it up), they screw everything up and DMARC data will be no exception.

(Or they take systems and email addresses out of service without updating their DMARC records.)

If you operate even a somewhat modest email system that gets a wide variety of email, as we do, it doesn't take very long to receive email from hundreds of From: domains that have DMARC records in DNS that request reports. When you generate your DMARC reports (whether once a day or more often), you'll send out hundreds of email messages to those report addresses. If you send them through your regular outgoing email system, you'll have a sudden influx of a lot of messages and you may trigger any anti-flood ratelimits you have. Once your reporting system has upended those hundreds of reports into your mail system, your mail system has to process through them; some of them will be delivered promptly, some of them will bounce (either directly or inside the remote mail system you hand them off to), and some of them will be theoretically destined for (currently) non-responsive hosts and thus will clog up your mail queue with repeated delivery attempts. If you're sending these reports through a general purpose mail system, your mail queue probably has a long timeout for stalled email, which is not really what you want in this case; your DMARC reports are more like 'best effort one time delivery attempt and then throw the message away' email. If this report doesn't get through and the issue is transient, you'll keep getting email with that From: domain and eventually one of your reports will go through. DMARC reports are definitely not 'gotta deliver them all' email.

So in my view, you're almost certainly going to have to be selective about what domains you send DMARC reports for. If you're considering this and you can, it may help to trawl your logs to see what domains are failing DMARC checks and pick out the ones you care about (such as, say, your organization's overall domain or domains). It's somewhat useful to report even successful DMARC results (where the email passes DMARC checks), but if you're considering acting on DMARC results, it's important to get false negatives fixed. If you want to send DMARC reports to everyone, you'll want to set up a custom mail system, perhaps on the DMARC local machine, which blasts everything out, efficiently handles potentially large queues and fast submission rates, and discards queued messages quickly (and obviously doesn't send you any bounces).

(Sending through a completely separate mail system also avoids the possibility that someone will decide to put your regular system on a blocklist because of your high rate of DMARC report email.)

PS: Some of those hundreds of From: domains with DMARC records that request reports will be spammer domains; I assume that putting a 'rua=' into your DMARC record makes it look more legitimate to (some) receiving systems. Spammers sending from their own domains can DKIM sign their messages, but having working reporting addresses requires extra work and extra exposure. And of course spammers often rotate through domains rapidly.

Password fields should usually have an option to show the text

By: cks

I recently had to abruptly replace my smartphone, and because of how it happened I couldn't directly transfer data from the old phone to the new one; instead, I had to have the new phone restore itself from a cloud backup of the old phone (made on an OS version several years older than the new phone's OS). In the process, a number of passwords and other secrets fell off and I had to re-enter them. As I mentioned on the Fediverse, this didn't always go well:

I did get our work L2TP VPN to work with my new phone. Apparently the problem was a typo in one bit of one password secret, which is hard to see because of course there's no 'show the whole thing' option and you have to enter things character by character on a virtual phone keyboard I find slow and error-prone.

(Phone natives are probably laughing at my typing.)

(Some of the issue was that these passwords were generally not good ones for software keyboards.)

There are reasonable security reasons not to show passwords when you're entering them. In the old days, the traditional reason was shoulder surfing; today, we have to worry about various things that might capture the screen with a password visible. But at the same time, entering passwords and other secrets blindly is error prone, and especially these days the diagnostics of a failed password may be obscure and you might only get so many tries before bad things start happening.

(The smartphone approach of temporarily showing the last character you entered is a help but not a complete cure, especially if you're going back and forth three ways between the form field, the on-screen keyboard, and your saved or looked up copy of the password or secret.)

Partly as a result of my recent experiences, I've definitely come around to viewing those 'reveal the plain text of the password' options that some applications have as a good thing. I think a lot of applications should at least consider whether and how to do this, and how to make password entry less error prone in general. This especially applies if your application (and overall environment) doesn't allow pasting into the field (either from a memorized passwords system or by the person involved simply copying and pasting it from elsewhere, such as support site instructions).

In some cases, you might want to not even treat a 'password' field as a password (with hidden text) by default. Often things like wireless network 'passwords' or L2TP pre-shared keys are broadly known and perhaps don't need to be carefully guarded during input the way genuine account passwords do. If possible I'd still offer an option to hide the input text in whatever way is usual on your platform, but you could reasonably start the field out as not hidden.

Unfortunately, as of December 2025 I think there's no general way to do this in HTML forms in pure CSS, without JavaScript (there may be some browser-specific CSS attributes). I believe support for this is on the CSS roadmap somewhere, but that probably means at least several years before it starts being common.

(The good news is that a pure CSS system will presumably degrade harmlessly if the CSS isn't supported; the password will just stay hidden, which is no worse than today's situation with a basic form.)

Go still supports building non-module programs with GOPATH

By: cks

When Go 1.18 was released, I said that it made module mode mandatory, which I wasn't a fan of because it can break backward compatibility in practice (and switching a program to Go modules can be non-trivial). Recently on the Fediverse, @thepudds very helpfully taught me that I wasn't entirely correct and Go still sort of supports non-module GOPATH usage, and in fact according to issue 60915, the current support is going to be preserved indefinitely.

Specifically, what's preserved today (and into the future) is support for using 'go build' and 'go install' in non-module mode (with 'GO111MODULE=off'). This inherits all of the behavior of Go 1.17 and earlier, including the use of things in the program's /vendor/ area (which can be important if you made local hacks). This allows you to rebuild and modify programs that you already have a complete GOPATH environment for (with all of their direct and indirect dependencies fetched). Since Go 1.22 and later don't support the non-module version of 'go get', assembling such an environment from scratch is up to you (if, for example, you need to modify an old non-module program). If you have a saved version of a suitable earlier version of Go, using that is probably the easiest way.

(Initially I thought Go 1.17 was the latest version you could use for this, but that was wrong; you can use anything up through Go 1.21. Go 1.17 is merely the latest version where you can do this without explicitly setting 'GO111MODULE=off'.)

Of course you could just build your old non-module programs with your saved copy of Go 1.21 (if it still runs in your current OS and hardware environment), but rebuilding things with a modern version of Go has various advantages and may be required to support modern architectures and operating system versions that you're targeting. The latest versions of Go have compiler and runtime improvements and optimizations, standard library improvements, support for various more modern things in TLS and so on, and a certain amount of security fixes; you'll also get better support for using 'go version -m' on your built binaries (which is useful for tracking things later).

Learning this is probably going to get me to change how I handle some of our old programs. Even if I don't update their code, rebuilding them periodically on the latest Go version to update their binaries is probably a good thing, especially if they deal with cryptography (including SSH) or HTTP things.

(In retrospect this was implied by what the Go 1.18 release notes said. In fact even at the time I didn't read enough of the release notes; in forced 'Go modules off' mode, the Go 1.18 'go get' will still get things for you. That ability was removed later, in Go 1.22. Right up through Go 1.21, 'GO111MODULE=off go get [-u]' will do the traditional dependency fetching and so on for you.)

Discovering that my smartphone had infiltrated my life

By: cks

While I have a smartphone, I think of myself as not particularly using it all that much. I got a smartphone quite late, it spends a lot of its life merely sitting there (not even necessarily in the same room as me, especially at home), and while I installed various apps (such as a SSH client) I rarely use them; they're mostly for weird emergencies. Then I suddenly couldn't use my current smartphone any more and all sorts of things came out of the woodwork, both things I sort of knew about but hadn't realized how much they'd affect me and things that I didn't even think about until I had a dead phone.

The really obvious and somewhat nerve wracking thing I expected from the start is that plenty of things want to send you text messages (both for SMS authentication codes and to tell you what steps to do to, for example, get your new replacement smartphone). With no operating smartphone I couldn't receive them. I found myself on tenterhooks all through the replacement process, hoping very much that my bank wouldn't decide it needed to authenticate my credit card usage through either its smartphone app or a text message (and I was lucky that I could authenticate some things through another device). Had I been without a smartphone for a more extended time, I could see a number of things where I'd probably have had to make in-person visits to a bank branch.

(Another obvious thing I knew about is that my bike computer wants to talk to a smartphone app (also). At a different time of year this would have been a real issue, but fortunately my bike club's recreational riding season is over so all it did was delay me uploading one commute ride.)

In less obvious things, I use my smartphone as my alarm clock. With my smartphone unavailable I discovered that I had no good alternative (although I had some not so good ones that are too quiet). I've also become used to using my phone for a quick check of the weather on the way out the door, and to check the arrival time of TTC buses, neither of which were available. Nor could I check email (or text messages) on the way to pick up my new phone because with no smartphone I had no data coverage. I was lucky enough to have another wifi-enabled device available that I took with me, which turned out to be critical for the pickup process.

(It also felt weird and wrong to walk out of the door without the weight of my phone in my pocket, as if I was forgetting my keys or something equally important. And there were times on the trip to get the replacement phone when I found myself realizing that if I'd had an operating smartphone, I'd have taken it out for a quick look at this or that or whatever.)

On the level of mere inconveniences, over time I've gotten pulled into using my smartphone's payment setup for things like grocery purchases. I could still do that in several other ways even without a smartphone, but none of them would have been as nice an experience. There would also have been paper cuts in things like checking the balance on my public transit fare card and topping it up.

Having gone through this experience with my smartphone, I'm now wondering what other bits of technology have quietly infiltrated both my personal life and things at work without me noticing their actual importance. I suspect that there are some more and I'll only realize it when they break.

PS: The smartphone I had to replace is the same one I got back in late 2016, so I got a bit over nine years of usage out of it. This is pretty good by smartphone standards (although for the past few years I was carefully ignoring that it had questionable support for security bugs; there were some updates, but also some known issues that weren't being fixed).

Do you care about (all) HTTP requests from cloud provider IP address space?

By: cks

About a month ago Mike Hoye wrote Raised Shields, in which Hoye said, about defending small websites from crawler abuse in this day and age:

If you only care about humans I strongly advise you to block every cloudhost subnet you can find, pretty easy given the effort they put into finding you. Most of the worst actors out there are living comfortably on Azure, GCP, Yandex and sometimes Huawei’s servers.

(As usual, there's no point in complaining about abusive crawlers to the cloud providers.)

I've said something similar on the Fediverse:

Today's idle thought: how many small web servers actually have any reason to accept requests from AWS or Google Cloud IP address space? If you search through your logs with (eg) grepcidr, you may find that there's little or nothing of value coming from there, and they sure are popular with LLM crawlers these days.

You definitely want to search your logs before doing this, and you may find that you want to make some exceptions even if you do opt for it. For example, you might want or need to let cloud-hosted things fetch your syndication feeds, because there are a fair number of people and feed readers that do their fetching from the cloud. Possibly you'll find that you have a significant number of real visitors that are using do it yourself personal VPN setups that have cloud exit points.

(How many exceptions you want to make may depend on how much of a hard line you want to take. I suspect that Mike Hoye's line is much harder than mine.)

However, I think that for a lot of small, personal web servers and web sites you'll find that almost nothing of genuine value comes from the big cloud provider networks, from AWS, Google Cloud, Azure, Oracle, and so on. You're probably not getting real visitors from these clouds, people who are interested in reading your work and engaging with it. Instead you'll most likely see an ever-growing horde of obvious crawlers, increasingly suspicious user agents, claims to be things that they aren't, and so on.

On the one hand, it's in some sense morally pure to not block these cloud areas unless they're causing your site active harm; it's certainly what the ethos was on the older Internet, and it was a good and useful ethos for those times. On the other hand, that view is part of what got us here. More and more, these days are the days of Raised Shields, as we react to the new environment (much as email had to react to the new environment of ever increasing spam).

If you're doing this, one useful trick you can play if you have the right web server environment is to do your blocking with HTTP 429 Too Many Requests responses. Using this HTTP code is in some sense inaccurate, but it has the useful effect that very few things will take it as a permanent error the way they may take, for example, HTTP 403 (or HTTP 404). This gives you a chance to monitor your web server logs and add a suitable exemption for traffic that you turn out to want after all, without your error responses doing anything permanent (like potentially removing your pages from search engine indexes). You can also arrange to serve up a custom error page for this case, with an explanation or a link to an explanation.

(My view is that serving a 400-series HTTP error response is better than a HTTP 302 temporary redirect to your explanation, for various reasons. Possibly there are clever things you can do with error pages in general.)

We can't fund our way out of the free and open source maintenance problem

By: cks

It's in the tech news a lot these days that there are 'problems' with free and open source maintenance. I put 'problems' in quotes because the issue is mostly that FOSS maintenance isn't happening as fast or as much as the people who've come to depend on it would like, and the people who maintain FOSS are increasingly saying 'no' when corporations turn up (cf, also). But even with all the corporate presence, there are still a reasonable number of people who use non-corporate FOSS operating systems like Debian Linux, FreeBSD, and so on, and they too suffer when parts of the FOSS software stack struggle with maintenance. Every so often, people will suggest that the problem would be solved if only corporations would properly fund this maintenance work. However, I don't believe this can actually work even in a world where corporations are willing to properly fund such things (in this world, they're very clearly not).

One big problem with 'funding' as a solution to the FOSS maintenance problems is that for many FOSS maintainers, there isn't enough work available to support them. Many FOSS people write and support only a small number of things that don't necessarily need much active development and bug fixing (people have done studies on this), and so can't feasibly provide full time employment (especially at something equivalent to a competitive salary). Certainly,there's plenty of large projects that are underfunded and could support one or more people working on them full time, but there's also a long tail of smaller, less obvious dependencies that are also important for various sorts of maintenance.

(In a way, the lack of funding pushes people toward small projects. With no funding, you have to do your projects in your spare time and the easiest way to make that work is to choose some small area or modest project that simply doesn't need that much time to develop or maintain.)

There are models where people who work on FOSS can be funded to do a bit of work on a lot of projects. But that's not the same as having funding to work full time on your own little project (or set of little projects). It's much more like regular work, in that you're being paid to do development work on other people's stuff (and I suspect that it will be much more time consuming than one might expect, since anyone doing this will have to come up to speed on a whole bunch of projects).

(I'm assuming the FOSS funding equivalent of a perfectly spherical frictionless object from physics examples, so we can wave away all other issues except that there is not enough work on individual projects. In the real world there are a huge host of additional problems with funding people for FOSS work that create significant extra friction (eg, potential liabilities).)

PS: Even though we can't solve the whole problem with funding, companies absolutely should be trying to use funding to solve as much of it as possible. That they manifestly aren't is one of many things that is probably going to bring everything down as pressure builds to do something.

(I'm sure I'm far from the first person to write about this issue with funding FOSS work. I just feel like writing it down myself, partly as elaboration on some parts of past Fediverse posts.)

Sidebar: It's full time work that matters

If someone is already working a regular full time job, their spare time is a limited resource and there are many claims on it. For various reasons, not everyone will take money to spend (potentially) most of their spare time maintaining their FOSS work. Many people will only be willing to spend a limited amount of their spare time on FOSS stuff, even if you could fund them at reasonable rates for all of their spare time. The only way to really get 'enough' time is to fund people to work full time, so their FOSS work replaces their regular full time job.

One of the reasons I suspect some people won't take money for their extra time is that they already have one job and they don't want to effectively get a second one. They do FOSS work deliberately because it's a break from 'job' style work.

(This points to another, bigger issue; there are plenty of people doing all sorts of hobbies, such as photography, who have no desire to 'go pro' in their hobby no matter how avid and good they are. I suspect there are people writing and maintaining important FOSS software who similarly have no desire to 'go pro' with their software maintenance.)

Duplicate metric labels and group_*() operations in Prometheus

By: cks

Suppose that you have an internal master DNS server and a backup for that master server. The two servers are theoretically fed from the same data and so should have the same DNS zone contents, and especially they should have the same DNS zone SOAs for all zones in both of their internal and external views. They both run Bind and you use the Bind exporter, which provides the SOA values for every zone Bind is configured to be a primary or a secondary for. So you can write an alert with an expression like this:

bind_zone_serial{host="backup"}
  != on (view,zone_name)
    bind_zone_serial{host="primary"}

This is a perfectly good alert (well, alert rule), but it has lost all of the additional labels you might want in your alert. Especially, it has lost both host names. You could hard-code the host name in your message about the alert, but it would be nice to do better and propagate your standard labels into the alert. To do this you want to use one of group_left() and group_right(), but which one you want depends on where you want the labels to come from.

(Normally you have to chose between the two depending on which side has multiple matches, but in this case we have a one to one matching.)

For labels that are duplicated between both sides, the group_*() operators pick which side's labels you get, but backwards from their names. If you use group_right(), the duplicate label values come from the left; if you use group_left(), the duplicate label values come from the right. Here, we might change the backup host's name but we're probably not going to change the primary host's name, so we likely want to preserve the 'host' label from the left side and thus we use group_right():

bind_zone_serial{host="backup"}
  != on (view,zone_name)
    group_right (job,host,instance)
      bind_zone_serial{host="primary"}

One reason this little peculiarity is on my mind at the moment is that Cloudflare's excellent pint Prometheus rule linter recently picked up a new 'redundant label' lint rule that complains about this for custom labels such as 'host':

Query is trying to join the 'host' label that is already present on the other side of the query.

(It doesn't complain about job or instance, presumably because it understands why you might do this for those labels. As the pint message will tell you, to silence this you need to disable 'promql/impossible' for this rule.)

When I first saw pint's warning I didn't think about it and removed the 'host' label from the group_right(), but fortunately I actually tested what the result would be and saw that I was now getting the wrong host name.

(This is different from pulling in labels from other metrics, where the labels aren't duplicated.)

PS: I clearly knew this at some point, when I wrote the original alert rule, but then I forgot it by the time I was looking at pint's warning message. PromQL is the kind of complex thing where the details can fall out of my mind if I don't use it often enough, which I don't these days since our alert rules are relatively stable.

BSD PF versus Linux nftables for firewalls for us

By: cks

One of the reactions I saw to our move from OpenBSD to FreeBSD for firewalls was to wonder why we weren't moving all the way to nftables based Linux firewalls. It's true that this would reduce the number of different Unixes we have to operate and probably get us more or less state of the art 10G network performance. However, I have some negative views on the choice of PF versus nftables, both in our specific situation and in general.

(I've written about this before but it was in the implicit context of Linux iptables.)

In our specific situation:

  • We have a lot of existing, relatively complex PF firewall rules; for example, our perimeter firewall has over 400 non-comment lines of rules, definitions, and so on. Translating these from OpenBSD PF to FreeBSD PF is easy, if it's necessary at all. Translating everything to nftables is a lot more work, and as far as I know there's no translation tool, especially not one that we could really trust. We'd probably have to basically rebuild each firewall from the ground up, which is both a lot of work and a high-stakes thing. We'd have to be extremely convinced that we had to do this in order to undertake it.

  • We have a lot of well developed tooling around operating, monitoring, and gathering metrics from PF-based firewalls, most of it locally created. Much or all of this tooling ports straight over from OpenBSD to FreeBSD, while we have no equivalent tooling for nftables and would have to develop (or find) equivalents.

  • We already know PF and almost all of that knowledge transfers over from OpenBSD PF to FreeBSD PF (and more will transfer with FreeBSD 15, which has some PF and PF syntax updates from modern OpenBSD).

In general (much of which also applies to our specific situation):

  • There are a number of important PF features that nftables at best has in incomplete, awkward versions. For example, nftables' version of pflog is awkward and half-baked compared to the real thing (also). While you may be able to put together some nftables based rough equivalent of BSD pfsync, casual reading suggests that it's a lot more involved and complex (and maybe less integrated with nftables).

  • The BSD PF firewall system is straightforward and easy to understand and predict. The Linux firewall system is much more complex and harder to understand, and this complexity bleeds through into nftables configuration, where you need to know chains and tables and so on. Much of this Linux complexity is not documented in ways that are particularly accessible.

  • Nftables documentation is opaque compared to the BSD pf.conf manual page (also). Partly this is because there is no 'nftables.conf' manual page; instead, your entry point is the nft manual page, which is both a command line tool and the documentation of the format of nftables rules. I find that these are two tastes that don't go well together.

    (This is somewhat forced by the nftables decision to retain compatibility with adding and removing rules on the fly. PF doesn't give you a choice, you load your entire ruleset from a file.)

  • nftables is already the third firewall rule format and system that the Linux kernel has had over the time that I've been writing Linux firewall rules (ipchains, iptables, nftables). I have no confidence that there won't be a fourth before too long. PF has been quite stable by comparison.

What I mostly care about is what I have to write and read to get the IP filtering and firewall setup that we want (and then understand it later), not how it gets compiled down and represented in the kernel (this has come up before). Assuming that the nftables backend is capable enough and the result performs sufficiently well, I'd be reasonably happy with a PF like syntax (and semantics) on top of kernel nftables (although we'd still have things like the pflog and pfsync issues).

Can I get things done in nftables? Certainly, nftables is relatively inoffensive. Do I want to write nftables rules? No, not really, no more than I want to write iptables rules. I do write nftables and iptables rules when I need to do firewall and IP filtering things on a Linux machine, but for a dedicated machine for this purpose I'd rather use a PF-based environment (which is now FreeBSD).

As far as I can tell, the state of Linux IP filtering documentation is partly a result of the fact that Linux doesn't have a unified IP filtering system and environment the way that OpenBSD does and FreeBSD mostly does (or at least successfully appears to so far). When the IP filtering system is multiple more or less separate pieces and subsystems, you naturally tend to get documentation that looks at each piece in isolation and assumes you already know all of the rest.

(Let's also acknowledge that writing good documentation for a complex system is hard, and the Linux IP filtering system has evolved to be very complex.)

PS: There's no real comparison between PF and the older iptables system; PF is clearly far more high level than you can reasonably do in iptables, which by comparison is basically an IP filtering assembly language. I'm willing to tentatively assume that nftables can be used in a higher level way than iptables can (I haven't used it for enough to have a well informed view either way); if it can't, then there's again no real comparison between PF and nftables.

Making Polkit authenticate people like su does (with group wheel)

By: cks

Polkit is how a lot of things on modern Linux systems decide whether or not to let people do privileged operations, including systemd's run0, which effectively functions as another su or sudo. Polkit normally has a significantly different authentication model than su or sudo, where an arbitrary login can authenticate for privileged operations by giving the password of any 'administrator' account (accounts in group wheel or group admin, depending on your Linux distribution).

Suppose, not hypothetically, that you want a su like model in Polkit, one where people in group 'wheel' can authenticate by providing the root password, while people not in group 'wheel' cannot authenticate for privileged operations at all. In my earlier entry on learning about Polkit and adjusting it I put forward an untested Polkit stanza to do this. Now I've tested it and I can provide an actual working version.

polkit.addAdminRule(function(action, subject) {
    if (subject.isInGroup("wheel")) {
        return ["unix-user:0"];
    } else {
        // must exist but have a locked password
        return ["unix-user:nobody"];
    }
});

(This goes in /etc/polkit-1/rules.d/50-default.rules, and the filename is important because it has to replace the standard version in /usr/share/polkit-1/rules.d.)

This doesn't quite work the way 'su' does, where it will just refuse to work for people not in group wheel. Instead, if you're not in group wheel you'll be prompted for the password of 'nobody' (or whatever other login you're using), which you can never successfully supply because the password is locked.

As I've experimentally determined, it doesn't work to return an empty list ('[]'), or a Unix group that doesn't exist ('unix-group:nosuchgroup'), or a Unix group that exists but has no members. In all cases my Fedora 42 system falls back to asking for the root password, which I assume is a built-in default for privileged authentication. Instead you apparently have to return something that Polkit thinks it can plausibly use to authenticate the person, even if that authentication can't succeed. Hopefully Polkit will never get smart enough to work that out and stop accepting accounts with locked passwords.

(If you want to be friendly and you expect people on your servers to run into this a lot, you should probably create a login with a more useful name and GECOS field, perhaps 'not-allowed' and 'You cannot authenticate for this operation', that has a locked password. People may or may not realize what's going on, but at least they have a chance.)

PS: This is with the Fedora 42 version of Polkit, which is version 126. This appears to be the most recent version from the upstream project.

Sidebar: Disabling Polkit entirely

Initially I assumed that Polkit had explicit rules somewhere that authorized the 'root' user. However, as far as I can tell this isn't true; there's no normal rules that specifically authorize root or any other UID 0 login name, and despite that root can perform actions that are restricted to groups that root isn't in. I believe this means that you can explicitly disable all discretionary Polkit authorization with an '00-disable.rules' file that contains:

polkit.addRule(function(action, subject) {
    return polkit.Result.NO;
});

Based on experimentation, this disables absolutely everything, even actions that are considered generally harmless (like libvirt's 'virsh list', which I think normally anyone can do).

A slightly more friendly version can be had by creating a situation where there are no allowed administrative users. I think this would be done with a 50-default.rules file that contained:

polkit.addAdminRule(function(action, subject) {
    // must exist but have a locked password
    return ["unix-user:nobody"];
});

You'd also want to make sure that nobody is in any special groups that rules in /usr/share/polkit-1/rules.d use to allow automatic access. You can look for these by grep'ing for 'isInGroup'.

The (early) good and bad parts of Polkit for a system administrator

By: cks

At a high level, Polkit is how a lot of things on modern Linux systems decide whether or not to let you do privileged operations. After looking into it a bit, I've wound up feeling that Polkit has both good and bad aspects from the perspective of a system administrator (especially a system administrator with multi-user Linux systems, where most of the people using them aren't supposed to have any special privileges). While I've used (desktop) Linuxes with Polkit for a while and relied on it for a certain amount of what I was doing, I've done so blindly, effectively as a normal person. This is the first I've looked at the details of Polkit, which is why I'm calling this my early reactions.

On the good side, Polkit is a single source of authorization decisions, much like PAM. On a modern Linux system, there are a steadily increasing number of programs that do privileged things, even on servers (such as systemd's run0). These could all have their own bespoke custom authorization systems, much as how sudo has its own custom one, but instead most of them have centralized on Polkit. In theory Polkit gives you a single thing to look at and a single thing to learn, rather than learning systemd's authentication system, NetworkManager's authentication system, etc. It also means that programs have less of a temptation to hard-code (some of) their authentication rules, because Polkit is very flexible.

(In many cases programs couldn't feasibly use PAM instead, because they want certain actions to be automatically authorized. For example, in its standard configuration libvirt wants everyone in group 'libvirt' to be able to issue libvirt VM management commands without constantly having to authenticate. PAM could probably be extended to do this but it would start to get complicated, partly because PAM configuration files aren't a programming language and so implementing logic in PAM gets awkward in a hurry.)

On the bad side, Polkit is a non-declarative authorization system, and a complex one with its rules not in any single place (instead they're distributed through multiple files in two different formats). Authorization decisions are normally made in (JavaScript) code, which means that they can encode essentially arbitrary logic (although there are standard forms of things). This means that the only way to know who is authorized to do a particular thing is to read its XML 'action' file and then look through all of the JavaScript code to find and then understand things that apply to it.

(Even 'who is authorized' is imprecise by default. Polkit normally allows anyone to authenticate as any administrative account, provided that they know its password and possibly other authentication information. This makes the passwords of people in group wheel or group admin very dangerous things, since anyone who can get their hands on one can probably execute any Polkit-protected action.)

This creates a situation where there's no way in Polkit to get a global overview of who is authorized to do what, or what a particular person has authorization for, since this doesn't exist in a declarative form and instead has to be determined on the fly by evaluating code. Instead you have to know what's customary, like the group that's 'administrative' for your Linux distribution (wheel or admin, typically) and what special groups (like 'libvirt') do what, or you have to read and understand all of the JavaScript and XML involved.

In other words, there's no feasible way to audit what Polkit is allowing people to do on your system. You have to trust that programs have made sensible decisions in their Polkit configuration (ones that you agree with), or run the risk of system malfunctions by turning everything off (or allowing only root to be authorized to do things).

(Not even Polkit itself can give you visibility into why a decision was made or fully predict it in advance, because the JavaScript rules have no pre-filtering to narrow down what they apply to. The only way you find out what a rule really does is invoking it. Well, invoking the function that the addRule() or addAdminRule() added to the rule stack.)

This complexity (and the resulting opacity of authorization) is probably intrinsic in Polkit's goals. I even think they made the right decision by having you write logic in JavaScript rather than try to create their own language for it. However, I do wish Polkit had a declarative subset that could express all of the simple cases, reserving JavaScript rules only for complex ones. I think this would make the overall system much easier for system administrators to understand and analyze, so we had a much better idea (and much better control) over who was authorized for what.

Brief notes on learning and adjusting Polkit on modern Linuxes

By: cks

Polkit (also, also) is a multi-faceted user level thing used to control access to privileged operations. It's probably used by various D-Bus services on your system, which you can more or less get a list of with pkaction, and there's a pkexec program that's like su and sudo. There are two reasons that you might care about Polkit on your system. First, there might be tools you want to use that use Polkit, such as systemd's run0 (which is developing some interesting options). The other is that Polkit gives people an alternate way to get access to root or other privileges on your servers and you may have opinions about that and what authentication should be required.

Unfortunately, Polkit configuration is arcane and as far as I know, there aren't really any readily accessible options for it. For instance, if you want to force people to authenticate for root-level things using the root password instead of their password, as far as I know you're going to have to write some JavaScript yourself to define a suitable Administrator identity rule. The polkit manual page seems to document what you can put in the code reasonably well, but I'm not sure how you test your new rules and some areas seem underdocumented (for example, it's not clear how 'addAdminRule()' can be used to say that the current user cannot authenticate as an administrative user at all).

(If and when I wind up needing to test rules, I will probably try to do it in a scratch virtual machine that I can blow up. Fortunately Polkit is never likely to be my only way to authenticate things.)

Polkit also has some paper cuts in its current setup. For example, as far as I can see there's no easy way to tell Polkit-using programs that you want to immediately authenticate for administrative access as yourself, rather than be offered a menu of people in group wheel (yourself included) and having to pick yourself. It's also not clear to me (and I lack a test system) if the default setup blocks people who aren't in group wheel (or group admin, depending on your Linux distribution flavour) from administrative authentication or if instead they get to pick authenticating using one of your passwords. I suspect it's the latter.

(All of this makes Polkit seem like it's not really built for multi-user Linux systems, or at least multi-user systems where not everyone is an administrator.)

PS: Now that I've looked at it, I have some issues with Polkit from the perspective of a system administrator, but those are going to be for another entry.

Sidebar: Some options for Polkit (root) authentication

If you want everyone to authenticate as root for administrative actions, I think what you want is:

polkit.addAdminRule(function(action, subject) {
    return ["unix-user:0"];
});

If you want to restrict this to people in group wheel, I think you want something like:

polkit.addAdminRule(function(action, subject) {
    if (subject.isInGroup("wheel")) {
        return ["unix-user:0"];
    } else {
        // might not work to say 'no'?
        return [];
    }
});

If you want people in group wheel to authenticate as themselves, not root, I think you return 'unix-user:' + subject.user instead of 'unix-user:0'. I don't know if people still get prompted by Polkit to pick a user if there's only one possible user.

You can't (easily) ignore errors in Python

By: cks

Yesterday I wrote about how there's always going to be a way to not write code for error handling. When I wrote that entry I deliberately didn't phrase it as 'ignoring errors', because in some languages it's either not possible to do that or at least very difficult, and one of them is Python.

As every Python programmer knows, errors raise exceptions in Python and you can catch those exceptions, either narrowly or (very) broadly (possibly by accident). If you don't handle an exception, it bubbles up and terminates your program (which is nice if that's what you want and does mean that errors can't be casually ignored). On the surface it seems like you can ignore errors by simply surrounding all of your code with a try:/except: block that catches everything. But if you do this, you're not ignoring errors in the same way as you do in a language where errors are return values. In a language where you can genuinely ignore errors, all of your code keeps on running when errors happen. But in Python, if you put a broad try block around your code, your code stops executing at the first exception that gets raised, rather than continuing on to the other code within the try block.

(If there's further code outside the try block, it will run but probably not work very well because there will likely be a lot that simply didn't happen inside the try block. Your code skipped right from the statement that raised the exception to the first statement outside the try block.)

To get the C or Go like experience that your program keeps running its code even after an exception, you need to effectively catch and ignore exceptions separately for each statement. You can write this out by hand, putting each statement in its own try: block, but you'll probably get tired of this very fast, the result will be hard to read, and it's extremely obviously not like regular Python. This is the sign that Python doesn't really let you ignore errors in any easy way. All Python lets you do easily is suppress messages about errors and potentially make them not terminate your program. The closer you want to get to actually ignoring all errors, the more work you'll have to do.

(There are probably clever things you can do with Python debugging hooks since I believe that Python debuggers can intercept exceptions, although I'm not sure if they can resume execution after unhandled ones. But this is not going to really be easy.)

There's always going to be a way to not code error handling

By: cks

Over on the Fediverse, I said something:

My hot take on Rust .unwrap(): no matter what you do, people want convenient shortcut ways of not explicitly handling errors in programming languages. And then people will use them in what turn out to be inappropriate places, because people aren't always right and sometimes make mistakes.

Every popular programming language lets your code not handle errors in some way, taking an optimistic approach. If you're lucky, your program notices at runtime when there actually is an error.

The subtext for this is that Cloudflare had a global outage where one contributing factor was using Rust's .unwrap(), which will panic your program if an error actually happens.

Every popular programming language has something like this. In Python you can ignore the possibility of exceptions, in C and Go you can ignore or explicitly discard error returns, in Java you can catch and ignore all exceptions, and so on. What varies from language to language is what the consequences are. In Python and Rust, your program dies (with an uncaught exception or a panic, respectively). In Go, your program either sails on making an increasingly big mess or panics (for example, if another return value is nil when there's an error and you try to do something with it that requires a non-nil value).

(Some languages let you have it either way. The default state of the Bourne shell is to sail onward in the face of failures, but you can change that with 'set -e' (mostly) and even get good error reports sometimes.)

These features don't exist because language designers are idiots (especially since error handling isn't a solved problem). They ultimately exist because people want a way to not so much ignore errors as not write code to 'handle' them. These people don't expect errors, they think in practice errors will either be extremely infrequent or not happen, and they don't want to write code that will deal with them anyway (if they're forced to write code that does something, often their choice will be to end the program).

You could probably create a programming language that didn't allow you to do this (possibly Haskell and other monad-using functional languages are close to it). I suspect it would be unpopular. If it wasn't unpopular, I suspect people would write their own functions or whatever to ignore the possibility of errors (either with or without ending the program if an error actually happens). People want to not have to write error handling, and they'll make it happen one way or another.

(Then, as I mentioned, some of the time they'll turn out to be wrong about errors not happening.)

Automatically scrubbing ZFS pools periodically on FreeBSD

By: cks

We've been moving from OpenBSD to FreeBSD for firewalls. One advantage of this is giving us a mirrored ZFS pool for the machine's filesystems; we have a lot of experience operating ZFS and it's a simple, reliable, and fully supported way of getting mirrored system disks on important machines. ZFS has checksums and you want to periodically 'scrub' your ZFS pools to verify all of your data (in all of its copies) through these checksums (ideally relatively frequently). All of this is part of basic ZFS knowledge, so I was a little bit surprised to discover that none of our FreeBSD machines had ever scrubbed their root pools, despite some of them having been running for months.

It turns out that while FreeBSD comes with a configuration option to do periodic ZFS scrubs, the option isn't enabled by default (as of FreeBSD 14.3). Instead you have to know to enable it, which admittedly isn't too hard to find once you start looking.

FreeBSD has a general periodic(8) system for triggering things on a daily, weekly, monthly, or other basis. As covered in the manual page, the default configuration for this is in /etc/defaults/periodic.conf and you can override things by creating or modifying /etc/periodic.conf. ZFS scrubs are a 'daily' periodic setting, and as of 14.3 the basic thing you want is an /etc/periodic.conf with:

# Enable ZFS scrubs
daily_scrub_zfs_enable="YES"

FreeBSD will normally scrub each pool a certain number of days after its previous scrub (either a manual scrub or an automatic scrub through the periodic system). The default number of days is 35, which is a bit high for my tastes, so I suggest that you shorten it, making your periodic.conf stanza be:

# Enable ZFS scrubs
daily_scrub_zfs_enable="YES"
daily_scrub_zfs_default_threshold="14"

There are other options you can set that are covered in /etc/defaults/periodic.conf.

(That the daily automatic scrubs happen some number of days after the pool was last scrubbed means that you can adjust their timing by doing a manual scrub. If you have a bunch of machines that you set up at the same time, you can get them to space out their scrubs by scrubbing one a day by hand, and so on.)

Looking at the other ZFS periodic options, I might also enable the daily ZFS status report, because I'm not certain if there's anything else that will alert you if or when ZFS starts reporting errors:

# Find out about ZFS errors?
daily_status_zfs_enable="YES"

You can also tell ZFS to TRIM your SSDs every day. As far as I can see there's no option to do the TRIM less often than once a day; I guess if you want that you have to create your own weekly or monthly periodic script (perhaps by copying the 801.trim-zfs daily script and modifying it appropriately). Or you can just do 'zpool trim ...' every so often by hand.

We're (now) moving from OpenBSD to FreeBSD for firewalls

By: cks

A bit over a year ago I wrote about why we'd become interested in FreeBSD; to summarize, FreeBSD appeared promising as a better, easier to manage host operating system for PF-based things. Since then we've done enough with FreeBSD to have decided that we actively prefer it to OpenBSD. It's been relatively straightforward to convert our firewall OpenBSD PF rulesets to FreeBSD PF and the resulting firewalls have clearly better performance on our 10G network than our older OpenBSD ones did (with less tuning).

(It's possible that the very latest OpenBSD has significantly improved bridging and routing firewall performance so that it no longer requires the fastest single-core CPU performance you can get to go decently. But pragmatically it's too late; FreeBSD had that performance earlier and we now have more confidence in FreeBSD's performance in the firewall role than OpenBSD's.)

There are some nice things about FreeBSD, like root on ZFS, and broadly I feel that it's more friendly than OpenBSD. But those are secondary to its firewall network performance (and PF compatibility); if its network performance was no better than OpenBSD (or worse), we wouldn't be interested. Since it is better, it's now displacing OpenBSD for our firewalls and our latest VPN servers. We've stopped building new OpenBSD machines, so as firewalls come up for replacement they get rebuilt as FreeBSD machines.

(We have a couple of non-firewall OpenBSD machines that will likely turn into Ubuntu machines when we replace them, although we can't be sure until it actually happens.)

Would we consider going back to OpenBSD? Maybe, but probably not. Now that we've migrated a significant number of firewalls, moving the remaining ones to FreeBSD is the easiest approach, even if new OpenBSD firewalls would equal their performance. And the FreeBSD 10G firewall performance we're getting is sufficiently good that it leaves OpenBSD relatively little ground to exceed it.

(There are some things about FreeBSD that we're not entirely enthused about. We're going to be doing more firewall upgrades than we used to with OpenBSD, for one.)

PS: As before, I don't think there's anything wrong with OpenBSD if it meets your needs. We used it happily for years until we started being less happy with its performance on 10G Ethernet. A lot of people don't have that issue.

A surprise with how '#!' handles its program argument in practice

By: cks

Every so often I get to be surprised about some Unix thing. Today's surprise is the actual behavior of '#!' in practice on at least Linux, FreeBSD, and OpenBSD, which I learned about from a comment by Aristotle Pagaltzis on my entry on (not) using '#!/usr/bin/env'. I'll quote the starting part here:

In fact the shebang line doesn’t require absolute paths, you can use relative paths too. The path is simply resolved from your current directory, just as any other path would be – the kernel simply doesn’t do anything special for shebang line paths at all. [...]

I found this so surprising that I tested it on our Linux servers as well as a FreeBSD and an OpenBSD machine. On the Linux servers (and probably on the others too), the kernel really does accept the full collection of relative paths in '#!'. You can write '#!python3', '#!bin/python3', '#!../python3', '#!../../../usr/bin/python3', and so on, and provided that your current directory is in the right place in the filesystem, they all worked.

(On FreeBSD and OpenBSD I only tested the '#!python3' case.)

As far as I can tell, this behavior goes all the way back to 4.2 BSD (which isn't quite the origin point of '#!' support in the Unix kernel but is about as close as we can get). The execve() kernel implementation in sys/kern_exec.c finds the program from your '#!' line with a namei() call that uses the same arguments (apart from the name) as it did to find the initial executable, and that initial executable can definitely be a relative path.

Although this is probably the easiest way to implement '#!' inside the kernel, I'm a little bit surprised that it survived in Linux (in a completely independent implementation) and in OpenBSD (where the security people might have had a double-take at some point). But given Hyrum's Law there are probably people out there who are depending on this behavior so we're now stuck with it.

(In the kernel, you'd have to go at least a little bit out of your way to check that the new path starts with a '/' or use a kernel name lookup function that only resolves absolute paths. Using a general name lookup function that accepts both absolute and relative paths is the simplest approach.)

PS: I don't have access to Illumos based systems, other BSDs (NetBSD, etc), or macOS, but I'd be surprised if they had different behavior. People with access to less mainstream Unixes (including commercial ones like AIX) can give it a try to see if there are any Unixes that don't support relative paths in '#!'.

People are sending HTTP requests with X-Forwarded-For across the Internet

By: cks

Over on the Fediverse, I shared a discovery that came from turning over some rocks here on Wandering Thoughts:

This is my face when some people out there on the Internet send out HTTP requests with X-Forwarded-For headers, and maybe even not maliciously or lying. Take a bow, ZScaler.

The HTTP X-Forwarded-For header is something that I normally expect to see only on something behind a reverse proxy, where the reverse proxy frontend is using it to tell the backend the real originating IP (which is otherwise not available when the HTTP requests are forwarded with HTTP). As a corollary of this usage, if you're operating a reverse proxy frontend you want to remove or rename any X-Forwarded-For headers that you receive from the HTTP client, because it may be trying to fool your backend about who it is. You can use another X- header name for this purpose if you want, but using X-Forwarded-For has the advantage that it's a de-facto standard and so random reverse proxy aware software is likely to have an option to look at X-Forwarded-For.

(See, for example, the security and privacy concerns section of the MDN page.)

Wandering Thoughts doesn't run behind a reverse proxy, and so I assume that I wouldn't see X-Forwarded-For headers if I looked for them. More exactly I assumed that I could take the presence of an X-Forwarded-For header as an indication of a bad request. As I found out, this doesn't seem to be the case; one source of apparently legitimate traffic to Wandering Thoughts appears to attach what are probably legitimate X-Forwarded-For headers to requests going through it. I believe this particular place operates partly as a (forward) HTTP proxy; if they aren't making up the X-Forwarded-For IP addresses, they're willing to leak the origin IPs of people using them to third parties.

All of this makes me more curious than usual to know what HTTP headers and header values show up on requests to Wandering Thoughts. But not curious enough to stick in logging, because that would be quite verbose unless I could narrow things down to only some requests. Possibly I should stick in logging that can be quickly turned on and off, so I can dump header information only briefly.

(These days I've periodically wound up in a mood to hack on DWiki, the underlying engine behind Wandering Thoughts. It reminds me that I enjoy programming.)

We haven't seen ZFS checksum failures for a couple of years

By: cks

Over on the Fediverse I mentioned something about our regular ZFS scrubs:

Another weekend, another set of ZFS scrubs of work's multiple terabytes of data sitting on a collection of consumer 4 TB SSDs (mirrored, we aren't crazy, and also we have backups). As usual there is not a checksum error to be seen. I think it's been years since any came up.

I accept that SSDs decay (we've had some die, of course) and random read errors happen, but our ZFS-based experience across both HDDs and SSDs has been that the rate is really low for us. Probably we're not big enough.

We regularly scrub our pools through automation, currently once every few weeks. Back in 2022 I wrote about us seeing only a few errors since we moved to SSDs in 2018, and then I had the impression that everything had been quiet since then. Hand-checking our records tells me that I'm slightly wrong about this and we had some errors on our fileservers in 2023, but none since then.

  • starting in January of 2023, one particular SSD began experiencing infrequent read and checksum errors that persisted (off and on) through early March of 2023, when we gave in and replaced it. This was a relatively new 4 TB SSD that had only been in service for a few months at the time.

  • In late March of 2023 we saw a checksum error on a disk that later in the year (in November) experienced some read errors, and then in late February of 2024 had read and write errors. We replaced the disk at that point.

I believe these two SSDs are the only ones that we've replaced since 2022, although I'm not certain and we've gone through a significant amount of SSD shuffling since then for reasons outside the scope of this entry. That shuffling means that I'm not going to try to give any number for what percentage of our fileserver SSDs have had problems.

In the first case, the checksum errors were effectively a lesser form of the read errors we saw at the same time, so it was obvious the SSD had problems. In the second case the checksum error may have been a very early warning sign of what later became an obvious slow SSD failure. Or it could be coincidence.

(It also could be that modern SSDs have so much internal error checking and correction that if there is some sort of data rot or mis-read it's most likely to be noticed inside the SSD and create a read failure at the protocol level (SAS, SATA, NVMe, etc).)

I definitely believe that disk read errors and slow disk failures happen from time to time, and if you have a large enough population of disks (SSDs or HDDs or both) you definitely need to worry about these problems. We get all sorts of benefits from ZFS checksums and ZFS scrubs, and the peace of mind about this is one of them. But it looks like we're not big enough to have run into this across our fileserver population.

(At the moment we have 114 4 TB SSDs in use across our production fileservers.)

OIDC, Identity Providers, and avoiding some obvious security exposures

By: cks

OIDC (and OAuth2) has some frustrating elements that make it harder for programs to support arbitrary identity providers (as discussed in my entry on the problems facing MFA-enabled IMAP in early 2025). However, my view is that these elements exist for good reason, and the ultimate reason is that an OIDC-like environment is by default an obvious security exposure (or several of them). I'm not sure there's any easy way around the entire set of problems that push towards these elements or something quite like them.

Let's imagine a platonically ideal OIDC-like identity provider for clients to use, something that's probably much like the original vision of OpenID. In this version, people (with accounts) can authenticate to the identity provider from all over the Internet, and it will provide them with a signed identity token. The first problem is that we've just asked identity providers to set up an Internet-exposed account and password guessing system. Anyone can show up, try it out, and best of all if it works they don't just get current access to something, they get an identity token.

(Within a trusted network, such as an organization's intranet, this exposed authentication endpoint is less of a concern.)

The second problem is that identity token, because the IdP doesn't actually provide the identity token to the person, it provides the token to something that asked for it. One of the uses of that identity token is to present it to other things to demonstrate that you're acting on the person's behalf; for example, your IMAP client presents it to your IMAP server. If what the identity token is valid for is not restricted in some way, a malicious party could get you to 'sign up with your <X> ID' for their website, take the identity token it got from the IdP, and reuse it with your IMAP server.

To avoid issues, this identity token must have a limited scope (and everything that uses identity tokens needs to check that the token for them). This implies that you can't just ask for an identity token in general, you have to ask for it for use with something specific. As a further safety measure the identity provider doesn't want to give such a scoped token to anything except the thing that's supposed to get it. You (an attacker) should not be able to tell the identity provider 'please create a token for webserver X, and give it to me, not webserver X' (this is part of the restrictions on OIDC redirect URIs).

In OIDC, what deals with much of these risks is client IDs, optionally client secrets, and redirect URIs. Client IDs are used to limit what an identity token can be used for and where it can be sent to (in combination with redirect URIs), and a client secret can be used by something getting a token to prove that it is the client ID it claims to be. If you don't have the right information, the OIDC IdP won't even talk to you. However, this means that all of this information has to be given to the client, or at least obtained by the client and stored by it.

(These days OIDC has a specification for Dynamic Client Registration and can support 'open' dynamic registration of clients, if desired (although it's apparently not widely implemented). But clients do have to register to get the risk-mitigating information for the main IdP endpoint, and I don't know how this is supposed to handle the IMAP situation if the IMAP server wants to verify that the OIDC token it receives was intended for it, since each dynamic client will have a different client ID.)

My script to 'activate' Python virtual environments

By: cks

After I wrote about Python virtual environments and source code trees, I impulsively decided to set up the development tree of our Django application to use a Django venv instead of a 'pip install --user' version of Django. Once I started doing this, I quickly decided that I wanted a general script that would switch me into a venv. This sounds a little bit peculiar if you know Python virtual environments so let me explain.

Activating a Python virtual environment mostly means making sure that its 'bin' directory is first on your $PATH, so that 'python3' and 'pip' and so on come from it. Venvs come with files that can be sourced into common shells in order to do this (with the one for Bourne shells called 'activate'), but for me this has three limits. You have to use the full path to the script, they change your current shell environment instead of giving you a new one that you can just exit to discard this 'activation', and I use a non-standard shell that they don't work in. My 'venv' script is designed to work around all three of those limitations. As a script, it starts a new shell (or runs a command) instead of changing my current shell environment, and I set it up so that it knows my standard place to keep virtual environments (and then I made it so that I can use symbolic links to create 'django' as the name of 'whatever my current Django venv is').

(One of the reasons I want my 'venv' command to default to running a shell for me is that I'm putting the Python LSP server into my Django venvs, so I want to start GNU Emacs from an environment with $PATH set properly to get the right LSP server.)

My initial version only looked for venvs in my standard location for development related venvs. But almost immediately after starting to use it, I found that I wanted to be able to activate pipx venvs too, so I added ~/.local/pipx/venvs to what I really should consider to be a 'venv search path' and formalize into an environment variable with a default value.

I've stuffed a few other features into the venv script. It will print out the full path to the venv if I ask it to (in addition to running a command, which can be just 'true'), or something to set $PATH. I also found I sometimes wanted it to change directory to the root of the venv. Right now I'm still experimenting with how I want to build other scripts on top of this one, so some of this will probably change in time.

One of my surprises about writing the script is how much nicer it's made working with venvs (or working with things in venvs). There's nothing it does that wasn't possible before, but the script has removed friction (more friction than I realized was there, which is traditional for me).

PS: This feels like a sufficiently obvious idea that I suspect that a lot of people have written 'activate a venv somewhere along a venv search path' scripts. There's unlikely to be anything special about mine, but it works with my specific shell.

Getting feedback as a small web crawler operator

By: cks

Suppose, hypothetically, that you're trying to set up a small web crawler for a good purpose. These days you might be focused on web search for text focused sites, or small human written sites, or similar things, and certainly given the bad things that are happening with the major crawlers we could use them. As a small crawler, you might want to get feedback and problem reports from web site operators about what your crawler is doing (or not doing). As it happens, I have some advice and views on this.

  • Above all, remember that you are not Google or even Bing. Web site operators need Google to crawl them, and they have no choice but to bend over backward for Google and to send out plaintive signals into the void if Googlebot is doing something undesirable. Since you're not Google and you need websites much more than they need you, the simplest thing for website operators to do with and about your crawler is to ignore the issue, potentially block you if you're causing problems, and move on.

    You cannot expect people to routinely reach out to you. Anyone who does reach out to you is axiomatically doing you a favour, at the expense of some amount of their limited time and at some risk to themselves.

  • Website operators have no reason to trust you or trust that problem reports will be well received. This is a lesson plenty of people have painfully learned from reporting spam (email or otherwise) and other abuse; a lot of the time your reports can wind up in the hands of people who aren't well intentioned toward you (either going directly to them or 'helpfully' being passed on by the ISP). At best you confirm that your email address is alive and get added to more spam address lists; at worst you get abused in various ways.

    The consequence of this is that if you want to get feedback, you should make it as low-risk as possible for people. The lowest risk way (to website operators) is for you to have a feedback form on your site that doesn't require email or other contact methods. If you require that website operators reveal their email addresses, social media handles, or whatever, you will get much less feedback (this includes VCS forge handles if you force them to make issue reports on some VCS forge).

    (This feedback form should be easy to find, for example being directly linked from the web crawler information URL in your User-Agent.)

  • As far as feedback goes, both your intentions and your views on the reasonableness of what your web crawler is doing (and how someone's website behaves) are irrelevant. What matters is the views of website operators, who are generally doing you a favour by not simply blocking or ignoring your crawler and moving on. If you disagree with their feedback, the best thing to do is be quiet (and maybe say something neutral if they ask for a reply). This is probably most important if your feedback happens through a public VCS forge issue tracker, where future people who are thinking about filing an issue the way you asked may skim over past issues to see how they went.

    (You may or may not ignore website operator feedback that you disagree with depending on how much you want to crawl (all of) their site.)

At the moment, most website operators who notice a previously unknown crawler will likely assume that it's an (abusive) LLM crawler. One way to lower the chances of this is to follow social conventions around crawlers for things like crawler User-Agents and not setting the Referer header. I don't think you have to completely imitate how Googlebot, bingbot, Applebot, the archive.org bot and so on format their User-Agent strings, but it's going to help to generally look like them and clearly put the same sort of information into yours. Similarly, if you can it will help to crawl from clearly identified IPs with reverse DNS. The more that people think you're legitimate and honest, the more likely they are to spend the time and take the risk to give you feedback; the more sketchy or even uncertain you look, the less likely you are to get feedback.

(In general, any time you make website operators uncertain about an aspect of your web crawler, some number of them will not be charitable in their guess. The more explicit and unambiguous you are in the more places, the better.)

Building and running a web crawler is not an easy thing on today's web. It requires both technical knowledge of various details of HTTP and how you're supposed to react to things (eg), and current social knowledge of what is customary and expected of web crawlers, as well as what you may need to avoid (for example, you may not want to start your User-Agent with 'Mozilla/5.0' any more, and in general the whole anti-crawling area is rapidly changing and evolving right now). Many website operators revisit blocks and other reactions to 'bad' web crawlers only infrequently, so you may only get one chance to get things right. This expertise can't be outsourced to a random web crawling library because many of them don't have it either.

(While this entry was sparked by a conversation I had on the Fediverse, I want to be explicit that it is in no way intended as a subtoot of that conversation. I just realized that I had some general views that didn't fit within the margins of Fediverse posts.)

Firefox's sudden weird font choice and fixing it

By: cks

Today, while I was in the middle of using my normal browser instance, it decided to switch from DejaVu Sans to Noto Sans as my default font:

Dear Firefox: why are you using Noto Sans all of a sudden? I have you set to DejaVu Sans (and DejaVu everything), and fc-match 'sans' and fc-match serif both say they're DejaVu (and give the DejaVu TTF files). This is my angry face.

This is a quite noticeable change for me because it changes the font I see on Wandering Thoughts, my start page, and other things that don't set any sort of explicit font. I don't like how Noto Sans looks and I want DejaVu Sans.

(I found out that it was specifically Noto Sans that Firefox was using all of a sudden through the Web Developer tools 'Font' information, and confirmed that Firefox should still be using DejaVu through the way to see this in Settings.)

After some flailing around, it appears that what I needed to do to fix this was explicitly set about:config's font.name.serif.x-western, font.name.sans-serif.x-western, and font.name.monospace.x-western to specific values instead of leaving them set to nothing, which seems to have caused Firefox to arrive on Noto Sans through some mysterious process (since the generic system font name 'sans' was still mapping to DejaVu Sans). I don't know if these are exposed through the Fonts advanced options in Settings → General, which are (still) confusing in general. It's possible that these are what are used for 'Latin'.

(I used to be using the default 'sans', 'serif', and 'monospace' font names that cascaded through to the DejaVu family. Now I've specifically set everything to the DejaVu set, because if something in Fedora or Firefox decides that the default mapping should be different, I don't want Firefox to follow it, I want it to stay with DejaVu.)

I don't know why Firefox would suddenly decide these pages are 'western' instead of 'unicode'; all of them are served as or labeled as UTF-8, and nothing about that has changed recently. Unfortunately, as far as I know there's no way to get Firefox to tell you what font.name preference name it used to pick (default) fonts for a HTML document. When it sends HTTP 304 Not Modified responses, Wandering Thoughts doesn't include a Content-Type header (with the UTF-8 character set), but as far as I know that's a standard behavior and browsers presumably cope with it.

(Firefox does see 'Noto Sans' as a system UI font, which it uses on things like HTML form buttons, so it didn't come from nowhere.)

It makes me sad that Firefox continues to have no global default font choice. You can set 'Unicode' but as I've just seen, this doesn't make what you set there the default for unset font preferences, and the only way to find out what unset font preferences you have is to inspect about:config.

PS: For people who aren't aware of this, it's possible for Firefox to forget some of your about:config preferences. Working around this probably requires using Firefox policies (via), which can force-set arbitrary about:config preferences (among other things).

Discovering orphaned binaries in /usr/sbin on Fedora 42

By: cks

Over on the Fediverse, I shared a somewhat unwelcome discovery I made after upgrading to Fedora 42:

This is my face when I have quite a few binaries in /usr/sbin on my office Fedora desktop that aren't owned by any package. Presumably they were once owned by packages, but the packages got removed without the files being removed with them, which isn't supposed to happen.

(My office Fedora install has been around for almost 20 years now without being reinstalled, so things have had time to happen. But some of these binaries date from 2021.)

There seem to be two sorts of these lingering, unowned /usr/sbin programs. One sort, such as /usr/sbin/getcaps, seems to have been left behind when its package moved things to /usr/bin, possibly due to this RPM bug (via). The other sort is genuinely unowned programs dating to anywhere from 2007 (at the oldest) to 2021 (at the newest), which have nothing else left of them sitting around. The newest programs are what I believe are wireless management programs: iwconfig, iwevent, iwgetid, iwlist, iwpriv, and iwspy, and also "ifrename" (which I believe was also part of a 'wireless-tools' package). I had the wireless-tools package installed on my office desktop until recently, but I removed it some time during Fedora 40, probably sparked by the /sbin to /usr/sbin migration, and it's possible that binaries didn't get cleaned up properly due to that migration.

The most interesting orphan is /usr/sbin/sln, dating from 2018, when apparently various people discovered it as an orphan on their system. Unlike all the other orphan programs, the sln manual page is still shipped as part of the standard 'man-pages' package and so you can read sln(8) online. Based on the manual page, it sounds like it may have been part of glibc at one point.

(Another orphaned program from 2018 is pam_tally, although it's coupled to pam_tally2.so, which did get removed.)

I don't know if there's any good way to get mappings from files to RPM packages for old Fedora versions. If there is, I'd certainly pick through it to try to find where various of these files came from originally. Unfortunately I suspect that for sufficiently old Fedora versions, much of this information is either offline or can't be processed by modern versions of things like dnf.

(The basic information is used by eg 'dnf provides' and can be built by hand from the raw RPMs, but I have no desire to download all of the RPMs for decade-old Fedora versions even if they're still available somewhere. I'm curious but not that curious.)

PS: At the moment I'm inclined to leave everything as it is until at least Fedora 43, since RPM bugs are still being sorted out here. I'll have to clean up genuinely orphaned files at some point but I don't think there's any rush. And I'm not removing any more old packages that use '/sbin/<whatever>', since that seems like it has some bugs.

Python virtual environments and source code trees

By: cks

Python virtual environments are mostly great for actually deploying software. Provided that you're using the same version of Python (3) everywhere (including CPU architecture), you can make a single directory tree (a venv) and then copy and move it around freely as a self-contained artifact. It's also relatively easy to use venvs to switch the version of packages or programs you're using, for example Django. However, venvs have their frictions, at least for me, and often I prefer to do Python development outside of them, especially for our Django web application).

(This means using 'pip install --user' to install things like Django, to the extent that it's still possible.)

One point of friction is in their interaction with working on the source code of our Django web application. As is probably common, this source code lives in its own version control system controlled directory tree (we use Mercurial for this for reasons). If Django is installed as a user package, the native 'python3' will properly see it and be able to import Django modules, so I can directly or indirectly run Django commands with the standard Python and my standard $PATH.

If Django is installed in a venv, I have two options. The manual way is to always make sure that this Django venv is first on my $PATH before the system Python, so that 'python3' is always from the venv and not from the system. This has a little bit of a challenge with Python scripts, and is one of the few places where '#!/usr/bin/env python3' makes sense. In my particular environment it requires extra work because I don't use a standard Unix shell and so I can't use any of the venv bin/activate things to do all the work for me.

The automatic way is to make all of the convenience scripts that I use to interact with Django explicitly specify the venv python3 (including for things like running a test HTTP server and invoking local management commands), which works fine since a program can be outside the venv it uses. This leaves me with the question of where the Django venv should be, and especially if it should be outside the source tree or in a non-VCS-controlled path inside the tree. Outside the source tree is the pure option but leaves me with a naming problem that has various solutions. Inside the source tree (but not VCS controlled) is appealingly simple but puts a big blob of otherwise unrelated data into the source tree.

(Of course I could do both at once by having a 'venv' symlink in the source tree, ignored by Mercurial, that points to wherever the Django venv is today.)

Since 'pip install --user' seems more and more deprecated as time goes by, I should probably move to developing with a Django venv sooner or later. I will probably use a venv outside the source tree, and I haven't decided about an in-tree symlink.

(I'll still have the LSP server problem but I have that today. Probably I'll install the LSP server into the Django venv.)

PS: Since this isn't a new problem, the Python community has probably come up with some best practices for dealing with it. But in today's Internet search environment I have no idea how to find reliable sources.

A HTTP User-Agent that claims to be Googlebot is now a bad idea

By: cks

Once upon a time, people seem to have had a little thing for mentioning Googlebot in their HTTP User-Agent header, much like browsers threw in claims to make them look like Firefox or whatever (the ultimate source of the now-ritual 'Mozilla/5.0' at the start of almost every browser's User-Agent). People might put in 'allow like Googlebot' or just say 'Googlebot' in their User-Agent. Some people are still doing this today, for example:

Gwene/1.0 (The gwene.org rss-to-news gateway) Googlebot

This is now an increasingly bad idea on the web and if you're doing it, you should stop. The problem is that there are various malicious crawlers out there claiming to be Googlebot, and Google publishes their crawler IP address ranges. Anything claiming to be Googlebot that is not from a listed Google IP is extremely suspicious and in this day and age of increasing anti-crawler defenses, blocking all 'Googlebot' activity that isn't from one of their listed IP ranges is an obvious thing to do. Web sites may go even further and immediately taint the IP address or IP address range involved in impersonating Googlebot, blocking or degrading further requests regardless of the User-Agent.

(Gwene is not exactly claiming to be Googlebot but they're trying to get simple Googlebot-recognizers to match them against Googlebot allowances. This is questionable at best. These days such attempts may do more harm than good as they get swept up in precautions against Googlebot forgery, or rules that block Googlebot from things it shouldn't be fetching, like syndication feeds.)

A similar thing applies to bingbot and the User-Agent of any other prominent web search engines, and Bing does publish their IP address ranges. However, I don't think I've ever seen someone impersonate bingbot (which probably doesn't surprise anyone). I don't know if anyone ever impersonates Archive.org (no one has in the past week here), but it's possible that crawler operators will fish to see if people give special allowances to them that can be exploited.

(The corollary of this is that if you have a website, an extremely good signal of bad stuff is someone impersonating Googlebot and maybe you could easily block that. I think this would be fairly easy to do in an Apache <If> clause that then Allow's from Googlebot's listed IP addresses and Denies everything else, but I haven't actually tested it.)

Containers and giving up on expecting good software installation practices

By: cks

Over on the Fediverse, I mentioned a grump I have about containers:

As a sysadmin, containers irritate me because they amount to abandoning the idea of well done, well organized, well understood, etc installation of software. Can't make your software install in a sensible way that people can control and limit? Throw it into a container, who cares what it sprays where across the filesystem and how much it wants to be the exclusive owner and controller of everything in sight.

(This is a somewhat irrational grump.)

To be specific, it's by and large abandoning the idea of well done installs of software on shared servers. If you're only installing software inside a container, your software can spray itself all over the (container) filesystem, put itself in hard-coded paths wherever it feels like, and so on, even if you have completely automated instructions for how to get it to do that inside a container image that's being built. Some software doesn't do this and is well mannered when installed outside a container, but some software does and you'll find notes to the effect that the only supported way of installing it is 'here is this container image', or 'here is the automated instructions for building a container image'.

To be fair to containers, some of this is due to missing Unix APIs (or APIs that theoretically exist but aren't standardized). Do you want multiple Unix logins for your software so that it can isolate different pieces of itself? There's no automated way to do that. Do you run on specific ports? There's generally no machine-readable way to advertise that, and people may want you to build in mechanisms to vary those ports and then specify the new ports to other pieces of your software (that would all be bundled into a container image). And so on. A container allows you to put yourself in an isolated space of Unix UIDs, network ports, and so on, one where you won't conflict with anyone else and won't have to try to get the people who want to use your software to create and manage the various details (because you've supplied either a pre-built image or reliable image building instructions).

But I don't have to be happy that software doesn't necessarily even try, that we seem to be increasingly abandoning much of the idea of running services in shared environments. Shared environments are convenient. A shared Unix environment gives you a lot of power and avoids a lot of complexity that containers create. Fortunately there's still plenty of software that is willing to be installed on shared systems.

(Then there is the related grump that the modern Linux software distribution model seems to be moving toward container-like things, which has a whole collection of issues associated with it.)

Go's runtime may someday start explicitly freeing some internal memory

By: cks

One of my peculiar hobbies is that I read every commit message for the Go (development) repository. Often this is boring, but sometimes I discover things I find amusing:

This is my amused face when Go is adding explicit, non-GC freeing of memory from within the runtime and compiler-generated code under some circumstances. It's perfectly sensible, but still.

It turns out that right now, the only thing that's been added is a 'GOEXPERIMENT=runtimefree' Go experiment, which you can set without build errors. There's no actual use of it in the current development tree.

The proposal that led to this doesn't seem to currently be visible in a mainline commit in the Go proposal repository, but until it surfaces you can access Directly freeing user memory to reduce GC work from the (proposed?) change (update: see below for the final version), and also Go issue 74299: runtime, cmd/compile: add runtime.free, runtime.freetracked and GOEXPERIMENT=runtimefree and the commit itself, which only adds the Go experiment flag. A preview of performance results (from a link in issue 74299) is in the message of slices: free intermediate memory in Collect via runtime.freeSlice.

(Looking into this has caused me to find the Go Release Dashboard, and see eg the pending proposals section, where you can find multiple things for this proposal.)

Update: The accepted proposal is now merged in the Go proposals repository, Directly freeing user memory to reduce GC work.

I feel the overall idea is perfectly sensible, for all that it feels a bit peculiar in a language with a mark and sweep garbage collector. As the proposal points out, there are situations where the runtime knows that something doesn't escape but it has to allocate it on the heap instead of the stack, and also situations where the runtime knows that some value is dead but the compiler can't prove it. In both situations we can reduce pressure on memory allocation and to some extent garbage collection by explicitly marking the objects as free right away. A runtime example cited in the proposal is when maps grow and split, which is safe since map values are unaddressable so no one can have (validly formed) pointers to them.

(Because unused objects aren't traversed by the garbage collector, this doesn't directly reduce the amount of work GC has to do but it does mean GC might not have to run as much.)

Sadly, so far only the GOEXPERIMENT setting has landed in the Go development tree so there's nothing to actually play with (and no code to easily read). We have to look from afar and anticipate, and at this point it's possible no actual code will land until after Go 1.26, since based on the usual schedule there will be a release freeze soon, leaving not very much time to land all of these changes).

(The whole situation turns out to be less exciting than I thought when I read the commit message and made my Fediverse post, but that's one reason to write these entries.)

PS: In general, garbage collected languages can also have immediate freeing of memory, for example if they use reference counting. CPython is an example and CPython people can be quite used to deterministic, immediate collection of unreferenced objects along with side effects such as closing file descriptors. Sometimes this can mask bugs.

A problem for downloading things with curl

By: cks

For various reasons, I'm working to switch from wget to curl, and generally this has been going okay. However, I've now run into one situation where I don't know how to make curl do what I want. It is, of course, a project that doesn't bother to do easily-fetched downloads, but in a very specific way. In fact it's Django (again).

The Django URLs for downloads look like this:

https://www.djangoproject.com/download/5.2.8/tarball/

The way the websites of many projects turn these into actual files is to provide a filename in the HTTP Content-Disposition header in the reply. In curl, these websites can be handled with the -J (--remote-header-name) option, which uses the filename from the Content-Disposition if there is one.

Unfortunately, Django's current website does not operate this way. Instead, the URL above is a HTTP redirection to the actual .tar.gz file (on media.djangoproject.com). The .tar.gz file is then served without a Content-Disposition header as an application/octet-stream. Wget will handle this with --trust-server-names, but as far as I can tell from searching through the curl manpage, there is no option that will do this in curl.

(In optimistic hope I even tried --location-trusted, but no luck.)

If curl is directed straight to the final URL, 'curl -O' alone is enough to get the right file name. However, if curl goes through a redirection, there seems to be no option that will cause it to re-evaluate the 'remote name' based on the new URL; the initial URL and the name derived from it sticks, and you get a file unhelpfully called 'tarball' (in this case). If you try to be clever by running the initial curl without -O but capturing any potential redirection with "-w '%{redirect_url}\n'" so you can manually follow it in a second curl command, this works (for one level of redirections) but leaves you with a zero-length file called 'tarball' from the first curl.

It's possible that this means curl is the wrong tool for the kind of file downloads I want to do from websites like this, and I should get something else entirely. However, that something else should at least be a completely self contained binary so that I can easily drag it around to all of the assorted systems where I need to do this.

(I could always try to write my own in Go, or even take this as an opportunity to learn Rust, but that way lies madness and a lot of exciting discoveries about HTTP downloads in the wild. The more likely answer is that I hold my nose and keep using wget for this specific case.)

PS: I think it's possible to write a complex script using curl that more or less works here, but one of the costs is that you have to make first a HEAD and then a GET request to the final target, and that irritates me.

Some notes on duplicating xterm windows

By: cks

Recently on the Fediverse, Dave Fischer mentioned a neat hack:

In the decades-long process of getting my fvwm config JUST RIGHT, my xterm right-click menu now has a "duplicate" command, which opens a new xterm with the same geometry, on the same node, IN THE SAME DIRECTORY. (Directory info aquired via /proc.)

[...]

(See also a followup note.)

This led to @grawity sharing an xterm-native approach to this, using xterm's spawn-new-terminal() internal function that's available through xterm's keybindings facility.

I have a long-standing shell function in my shell that attempts to do this (imaginatively called 'spawn'), but this is only available in environments where my shell is set up, so I was quite interested in the whole area and did some experiments. The good news is that xterm's 'spawn-new-terminal' works, in that it will start a new xterm and the new xterm will be in the right directory. The bad news for me is that that's about all that it will do, and in my environment this has two limitations that will probably make it not something I use a lot.

The first limitation is that this starts an xterm that doesn't copy the command line state or settings of the parent xterm. If you've set special options on the parent xterm (for example, you like your root xterms to have a red foreground), this won't be carried over to the new xterm. Similarly, if you've increased (or decreased) the font size in your current xterm or otherwise changed its settings, spawn-new-terminal doesn't duplicate these; you get a default xterm. This is reasonable but disappointing.

(While spawn-new-terminal takes arguments that I believe it will pass to the new xterm, as far as I know there's no way to retrieve the current xterm's command line arguments to insert them here.)

The larger limitation for me is that when I'm at home, I'm often running SSH inside of an xterm in order to log in to some other system (I have a 'sshterm' script to automate all the aspects of this). What I really want when I 'duplicate' such an xterm is not a copy of the local xterm running a local shell (or even starting another SSH to the remove system), but the remote (shell) context, with the same (remote) current directory and so on. This is impossible to get in general and difficult to set up even for situations where it's theoretically possible. To use spawn-new-terminal effectively, you basically need either all local xterms or copious use of remote X forwarded over SSH (where the xterm is running on the remote system, so a duplicate of it will be as well and can get the right current directory).

Going through this experience has given me some ideas on how to improve the situation overall. Probably I should write a 'spawn' shell script to replace or augment my 'spawn' shell function so I can readily have it in more places. Then when I'm ssh'd in to a system, I can make the 'spawn' script at least print out a command line or two for me to copy and paste to get set up again.

(Two command lines is the easiest approach, with one command that starts the right xterm plus SSH combination and the other a 'cd' to the right place that I'd execute in the new logged in window. It's probably possible to combine these into an all-in-one script but that starts to get too clever in various ways, especially as SSH has no straightforward way to pass extra information to a login shell.)

My GPS bike computer is less distracting than the non-computer option

By: cks

I have a GPS bike computer primarily for following pre-planned routes, because it became a better supported option than our old paper cue sheets. One of the benefits of switching to from paper cue sheets to a GPS unit was better supported route following, but after I made the switch, I found that it was also less distracting than using paper cue sheets. On the surface this might sound paradoxical, since people often say that computer screens are more distracting. It's true that a GPS bike computer has a lot that you can look at, but for route following, a GPS bike computer also has features that let me not pay attention to it.

When I used paper cue sheets, I always had to pay a certain amount of attention to following the route. I needed to keep track of where we were on the cue sheet's route, and either remember what the next turn was or look at the cue sheet frequently enough that I could be sure I wouldn't miss it. I also needed to devote a certain amount of effort to scanning street signs to recognize the street we'd be turning on to. All of this distracted me from looking around and enjoying the ride; I could never check out completely from route following.

When I follow a route on my GPS bike computer, it's much easier to not pay attention to route following most of the time. My GPS bike computer will beep at me and display a turn alert when we get close to a turn, and I always have it display the distance to the next turn so I can take a quick glance to reassure myself that we're nowhere near the turn. If there's any ambiguity about where to turn, I can look at the route's trace on a map and see that the turn is, for example, two streets ahead, and of course the GPS bike computer is always keeping track of where in the route I am.

Because the GPS bike computer can tell me when I need to pay attention to following the route, I'm free to not pay attention at other times. I can stop thinking about the route at all and look around at the scenery, talk with my fellow club riders, and so on.

(When I look around there are similar situations at work, with some of our systems. Our metrics, monitoring, and alerting system often has the net effect that I don't even look at how things are going because I assume that silence means all is okay. And if I want to do the equivalent of glancing at my GPS bike computer to check the distance to the next turn, I can look at our dashboards.)

How I handle URLs in my unusual X desktop

By: cks

I have an unusual X desktop environment that has evolved over a long period, and as part of that I have an equally unusual and slowly evolved set of ways to handle URLs. By 'handle URLs', what I mean is going from an URL somewhere (email, text in a terminal, etc) to having the URL open in one of my several browser environments. Tied into this is handling non-URL things that I also want to open in a browser, for example searching for various sorts of things in various web places.

The simplest place to start is at the end. I have several browser environments and to go along with them I have a script for each that opens URLs provided as command line arguments in a new window of that browser. If there's no command line arguments, the scripts open a default page (usually a blank page, but for my main browser it's a special start page of links). For most browsers this works by running 'firefox <whatever>' and so will start the browser if it's not already running, but for my main browser I use a lightweight program that uses Firefox's X-based remote control protocol. which means I have to start the browser outside of it.

Layered on top of these browser specific scripts is a general script to open URLs that I call 'openurl'. The purpose of openurl is to pick a browser environment based on the particular site I'm going to. For example, if I'm opening the URL of a site where I know I need JavaScript, the script opens the URL in my special 'just make it work' JavaScript enabled Firefox. Most urls open in my normal, locked down Firefox. I configure programs like Thunderbird to open URLs through this openurl script, sometimes directly and sometimes indirectly.

(I haven't tried to hook openurl into the complex mechanisms that xdg-open uses to decide how to open URLs. Probably I should but the whole xdg-open thing irritates me.)

Layered on top of openurl and the specific browser scripts is a collection of scripts that read the X selection and do a collection of URL-related things with it. One script reads the X selection, looks for it being a URL, and either feeds the URL to openurl or just runs openurl to open my start page. Other scripts feed the URL to alternate browser environments or do an Internet search for the selection. Then I have a fvwm menu with all of these scripts in it and one of my fvwm mouse button bindings brings up this menu. This lets me select a URL in a terminal window, bring up the menu, and open it in either the default browser choice or a specific browser choice.

(I also have a menu entry for 'open the selection in my main browser' in one of my main fvwm menus, the one attached to the middle mouse button, which makes it basically reflexive to open a new browser window or open some URL in my normal browser.)

The other way I handle URLs is through dmenu. One of the things my dmenu environment does is recognize URLs and open them in my default browser environment. I also have short dmenu commands to open URLs in my other browser environments, or open URLs based on the parameters I pass the command (such as a 'pd' script that opens Python documentation for a standard library module). Dmenu itself can paste in the current X selection with a keystroke, which makes it convenient to move URLs around. Dmenu is also how I typically open a URL if I'm typing it in instead of copying it from the X selection, rather than opening a new browser window, focusing the URL bar, and entering the URL there.

(I have dmenu set up to also recognize 'about:*' as URLs and have various Firefox about: things pre-configured as hidden completions in dmenu, along with some commonly used website URLs.)

As mentioned, dmenu specifically opens plain URLs in my default browser environment rather than going through openurl. I may change this someday but in practice there aren't enough special sites that it's an issue. Also, I've made dedicated little dmenu-specific scripts that open up the various sites I care about in the appropriate browser, so I can type 'mastodon' in dmenu to open up my Fediverse account in the JavaScript-enabled Firefox instance.

Trying to understand Firefox's approaches to tracking cookie isolation

By: cks

As I learned recently, modern versions of Firefox have two different techniques that try to defeat (unknown) tracking cookies. As covered in the browser addon JavaScript API documentation, in Tracking protection, these are called first-party isolation and dynamic partitioning (or storage partitioning, the documentation seems to use both). Of these two, first party isolation is the easier to describe and understand. To quote the documentation:

When first-party isolation is on, cookies are qualified by the domain of the original page the user visited (essentially, the domain shown to the user in the URL bar, also known as the "first-party domain").

(In practice, this appears to be the top level domain of the site, not necessarily the site's domain itself. For example, Cookie Manager reports that a cookie set from '<...>.cs.toronto.edu' has the first party domain 'toronto.edu'.)

Storage partitioning is harder to understand, and again I'll quote the Storage partitioning section of the cookie API documentation:

When using dynamic partitioning, Firefox partitions the storage accessible to JavaScript APIs by top-level site while providing appropriate access to unpartitioned storage to enable common use cases. [...]

Generally, top-level documents are in unpartitioned storage, while third-party iframes are in partitioned storage. If a partition key cannot be determined, the default (unpartitioned storage) is used. [...]

If you read non-technical writeups like Firefox rolling out Total Cookie Protection (from 2022), it certainly sounds like they're describing first-party isolation. However, if you check things like Status of partitioning in Firefox and the cookies API documentation on first-party isolation, as far as I can tell what Firefox actually normally uses for "Total Cookie Protection" is storage partitioning.

Based on what I can decode from the two descriptions and from the fact that Tor Browser defaults to first-party isolation, it appears that first-party isolation is better and stricter than storage partitioning. Presumably it also causes problems on more websites, enough so that Firefox either no longer uses it for Total Cookie Protection or never did, despite their description sounding like first-party isolation.

(So far I haven't run into any issues with first-party isolation in my cookie-heavy browser environment. It's possible that websites have switched how they do things to avoid problems.)

First-party isolation can be enabled in about:config by setting privacy.firstparty.isolate to true. If and when you do this, the normal Settings → Privacy and Security will show a warning banner at the top to the effect of:

You are using First Party Isolation (FPI), which overrides some of Firefox’s cookie settings.

All of this is relevant to me because one of my add-ons, Cookie AutoDelete, probably works with first-party isolation but almost certainly doesn't work with storage isolation (ie, it will fail to delete some cookies under storage isolation, although I believe it can still delete unpartitioned cookies). Given what I've learned, I'm likely to turn on first-party isolation in my main browser environment soon.

If Cookie Manager is reporting correct information to me, it's possible to have cookies that are both first-party isolated and partitioned; the one I've seen so far is from Youtube. Cookie Manager can't seem to remove these cookies. Based on what I've read about (storage or dynamic) partitioned cookies, I suspect that these are created by embedded iframes.

(Turning on or off first-party isolation effectively drops all of the cookies you currently have, so it's probably best to do it when you restart your browser.)

My mistake with swallowing EnvironmentError errors in our Django application

By: cks

We have a little Django application to handle request for Unix accounts. Once upon a time it was genuinely little, but it's slowly accreted features over the years. One of the features it grew over the years was a command line program (a Django management command) to bulk-load account request information from files. We use this to handle things like each year's new group of incoming graduate students; rather than force the new graduate students to find the web form on their own, we get information on all of them from the graduate program people and load them into the system in bulk.

One of the things that regularly happens with new graduate students is that they were already involved on the research side of the department. For example, as an undergraduate you might work on a research project with a professor, and then you get admitted as a graduate student (maybe with that professor, or maybe with someone else). When this happens, the new graduate student already has an account and we don't want to give them another one (for various reasons). To detect situations where someone already has an existing account, the bulk loader reads some historical data out of a couple of files and looks through it to match any existing accounts to the new graduate students.

When I originally wrote the code to load data from files, for some reason I decided that it wasn't particular bad if the files didn't exist or couldn't be read, so I wrote code that looked more or less like this:

try:
  fp = open(fname, "r")
  [process file]
  fp.close()
except EnvironmentError:
  pass

Of course, for testing purposes (and other reasons, for example to suppress this check) we should be able to change where the data files were read from, so I made the file names of the data files be argparse options, set the default values to the standard locations that the production application recorded things, and called it all good.

Except that for the past two years, one of the default file names was wrong; when I added this specific file, I made a typo in the file name. Using the command line option to change the file name worked so this passed my initial testing when I added the specific type of historical data, but in production, using my typo'd default file name, we silently never detected existing Unix logins for new graduate students (and others) through this particular type of historical data.

All of this happened because I made a deliberate design decision to silently swallow all EnvironmentError exceptions when trying to open and read these files, instead of either failing or at least reporting a warning. When I made the decision (back in 2013, it turns out), I was probably thinking that the only source of errors was if you ran it as the wrong user or deliberately supplied nonexistent files; I doubt it ever occurred to me that I could make an embarrassing typo in the name of any of the production files. One of the lessons I draw from this is that I don't always even understand the possible sources of errors, which makes it all the more dangerous to casually ignore them.

(Even silently ignoring nonexistent files is rather questionable in retrospect. I don't really know what I was thinking in 2013.)

Removing Fedora's selinux-policy-targeted package is mostly harmless so far

By: cks

A while back I discussed why I might want to remove the selinux-policy-targeted RPM package for a Fedora 42 upgrade. Today, I upgraded my office workstation from Fedora 41 to Fedora 42, and as part of preparing for that upgrade I removed the selinux-policy-targeted policy (and all of the packages that depended on it). The result appears to work, although there were a few things that came up during the upgrade and I may reinstall at least selinux-policy-targeted itself to get rid of them (for now).

The root issue appears to be that when I removed the selinux-policy-targeted package, I probably should have edited /etc/selinux/config to set SELINUXTYPE to some bogus value, not left it set to "targeted". For entirely sensible reasons, various packages have postinstall scripts that assume that if your SELinux configuration says your SELinux type is 'targeted', they can do things that implicitly or explicitly require things from the package or from the selinux-policy package, which got removed when I removed selinux-policy-targeted.

I'm not sure if my change to SELINUXTYPE will completely fix things, because I suspect that there are other assumptions about SELinux policy programs and data files being present lurking in standard, still-installed package tools and so on. Some of these standard SELinux related packages definitely can't be removed without gutting Fedora of things that are important to me, so I'll either have to live with periodic failures of postinstall scripts or put selinux-policy-targeted and some other bits back. On the whole, reinstalling selinux-policy-targeted is probably the safest way and the issue that caused me to remove it only applies during Fedora version upgrades and might anyway be fixed in Fedora 42.

What this illustrates to me is that regardless of package dependencies, SELinux is not really optional on Fedora. The Fedora environment assumes that a functioning SELinux environment is there and if it isn't, things are likely to go wrong. I can't blame Fedora for this, or for not fully capturing this in package dependencies (and Fedora did protect the selinux-policy-targeted package from being removed; I overrode that by hand, so what happens afterward is on me).

(Although I haven't checked modern versions of Fedora, I suspect that there's no official way to install Fedora without getting a SELinux policy package installed, and possibly selinux-policy-targeted specifically.)

PS: I still plan to temporarily remove selinux-policy-targeted when I upgrade my home desktop to Fedora 42. A few package postinstall glitches is better than not being able to read DNF output due to the package's spam.

Firefox, the Cookie AutoDelete add-on, and "Total Cookie Protection"

By: cks

In a comment on my entry on flailing around with Firefox's Multi-Account Containers, Ian Z aka nobrowser asked a good question:

The Cookie Autodelete instructions with respect to Total Cookie Protection mode are very confusing. Reading them makes me think this extension is not for me, as I have Strict Mode on in all windows, private or not. [...]

This is an interesting question (and, it turns out, relevant to my usage too) so I did some digging. The short answer is that I suspect the warning on Cookie AutoDelete's add-on page is out of date and it works fine. The long answer starts with the history of HTTP cookies.

Back in the old days, HTTP cookies were global, which is to say that browsers kept a global pool of HTTP cookies (both first party, from the website you were on, and third-party cookies), and it would send any appropriate cookie on any HTTP request to its site. This enabled third-party tracking cookies and a certain amount of CSRF attacks, since the browser would happily send your login cookies along with that request initiated by the JavaScript on some sketchy website you'd accidentally wound up on (or JavaScript injected through an ad network).

This was obviously less than ideal and people wound up working to limit the scope of HTTP cookies, starting with things like Firefox's containers and eventually escalating to first-party cookie isolation, where a cookie is restricted to whatever the first-party domain was when it was set. If you're browsing example.org and the page loads google.com/tracker, which sets a tracker cookie, that cookie will not be sent when you browse example.com and the page also loads google.com/tracker; the first tracking cookie is isolated to example.org.

(There is also storage isolation for cookies, but I think that's been displaced by first-party cookie isolation.)

However, first-party isolation has the possibility to break things you expect to work, as covered in this Firefox FAQ). As a result of this, my impression is that browsers have been cautious and slow to roll out first-party isolation by default. However, they have made it available as an option or part of an option. Firefox calls this Total Cookie Protection (also, also).

(Firefox is working to go even further, blocking all third-party cookies.)

Firefox add-ons have special APIs that allow them to do privileged things, and these include an API for dealing with cookies. When first-party cookie isolation came to pass, these APIs needed to be updated to deal with such isolated cookies (and cookie tracking protection in general). For instance, cookies.remove() has to be passed a special parameter to remove a first-party isolated cookie. As covered in the documentation, an add-on using the cookies APIs without the necessary updates would only see non-isolated cookies, if there were any. So at the time the message on Cookie AutoDelete's add-on page was written, I suspect that it hadn't been updated for first-party isolation. However, based on checking the source code of Cookie AutoDelete, I believe that it currently supports first-party isolation for cookies, and in fact may have done so for some time, perhaps v3.5.0, or v3.4.0 or even earlier.

(It's also possible that this support is incomplete or buggy, or that there are still some things that you can't easily do through it that matter to Cookie AutoDelete.)

Cookie AutoDelete itself is potentially useful even if you have Firefox set to block all third-party cookies, because it will also clean up unwanted first-party cookies (assuming that it truly works with first-party isolation). Part of my uncertainly is that I'm not sure how you reliably find out what cookies you have in a browser world with first-party isolation. There's theoretically some information about this in Settings → Privacy & Security → Cookies and Site Data → "Manage Data...", but since that's part of the normal Settings UI that normal people use, I'm not sure if it's simplifying things.

PS: Now that I've discovered all of this, I'm not certain if my standard Cookie Quick Manager add-on properly supports first-party isolated cookies. There's this comment on an issue that suggests it does support first-party isolation but not storage partitioning (also). The available Firefox documentation and Settings UI is not entirely clear about whether first-party isolation is now on more or less by default.

(That comment points to Cookie Manager as a potential partition-aware cookie manager.)

Finally, run Docker containers natively in Proxmox 9.1 (OCI images)

Proxmox VE is a virtualization platform, like VMWare, but open source, based on Debian. It can run KVM virtual machines and Linux Containers (LXC). I've been using it for over 10 years, the [first article I wrote mentioning it was in 2012](/s/tags/proxmox.html). At home I have a 2 node Proxmox VE cluster consisting of 2 HP EliteDesk Mini machines, both running with 16 GB RAM and both an NVMe and SATA SSD with ZFS on root (256 GB). It's small enough (physically) and is just enough for my homelab needs specs wise. Proxmox VE 9.1 was released [recently](https://www.proxmox.com/en/about/company-details/press-releases/proxmox-virtual-environment-9-1) and this new version is able to run Docker containers / OCI images natively, no more hacks or VM's required to run docker. This post shows you how to run a simple container from a docker image.

Sparkling Network

This is an overview of all the servers in the Sparkling Network, mostly as an overview for myself, but it might be interesting for others. It also has a status overview of the nodes. Prices are monthly, excluding VAT.

New Blog

By: cpldcpu

I relocated my blog to Hugo due to easier maintainance and more control over content and layout. You can find it here.

All articles from this blog have been preserved, although I won’t list some that I found lacking in quality.

Recently

This’ll be the last Recently in 2025. It’s been a decent year for me, a pretty rough year for the rest of the world. I hope, for everyone, that 2026 sees the reversal of some of the current trends.

Watching

This video from Daniel Yang, who makes spectacular bikes of his own, covers a lot of the economics of being a bike-builder, which are all pretty rough. I felt a lot of resonance with Whit: when I ran a business I always felt like it was tough to be commercial about it, and had to fight my own instincts to over engineer parts of it. It’s also heartbreaking to think about how many jobs are so straightforwardly good for the maker and good for the buyer but economically unviable because of the world we live in. I feel like the world would be a lot different if the cost of living was lower.

Yes, the fan video! It’s a solid 48 minutes of learning how fans work. I finally watched it. Man, if every company did their advertising this way it would be so fun. I learned a lot watching this.

I have trouble finding videos about how things work that are actually about how things work. Titles like “how it’s made” or “how it works” or “how we did it” perform well in A/B tests and SEO so they get used for media that actually doesn’t explain how it’s made and how it works, greatly frustrating people like me. But the fan video delivers.

Reading

But then, you realize that the goal post has shifted. As the tech industry has become dramatically more navigable, YC became much less focused on making the world understandable, revolving, instead, around feeding consensus. “Give the ecosystem what they want.”

I have extremely mixed feelings about Build What’s Fundable, this article from Kyle Harrison. Some of it I think is bravely truth-telling in an industry that usually doesn’t do public infighting - Harrison is a General Partner at a big VC firm, and he’s critiquing a lot of firms directly on matters both financial and ethical.

But on the other hand, there’s this section about “Breaking the Normative Chains”:

When you look at successful contrarian examples, many of them have been built by existing billionaires (Tesla, SpaceX, Palantir, Anduril). The lesson from that, I think, isn’t “be a billionaire first then you can have independent thoughts.” It is, instead, to reflect on what other characteristics often lead to those outcomes. And, in my opinion, the other commonality that a lot of those companies have is that they’re led by ideological purists. People that believe in a mission.

And then, in the next section he pulls up an example of a portfolio company that encapsulates the idea of true believers, and he names Base, which has a job ad saying “Don’t tell your grandkids all you did was B2B SaaS.”

Now, Base’s mission is cool: they’re doing power storage at scale. I like the website. But I have to vent here that the founder is Zach Dell. Michael Dell’s son. Of Dell Computer, and a 151 billion dollar fortune.

I just think that if we’re going to talk about how the lesson isn’t that you should be a billionaire first before having independent thoughts and building a big tech company, it should be easy to find someone who is not the son of the 10th wealthiest person in the world to prove that point. I have nothing against Zach in particular: he is probably a talented person. But in the random distribution of talented, hardworking people, very few of them are going to be the son of the 10th wealthiest person in the world.


Like so many other bits of Times coverage, the whole of the piece is structured as an orchestrated encounter. Some people say this; however, others say this. It’s so offhand you can think you’re gazing through a pane of glass. Only when you stand a little closer, or when circumstances make you a little less blinkered, do you notice the fact which then becomes blinding and finally crazymaking, which is just that there is zero, less than zero, stress put on the relation between those two “sides,” or their histories, or their sponsors, or their relative evidentiary authority, or any of it.

I love this article on maybe don’t talk to the New York Times about Zohran Mamdani. It describes the way in which the paper launders its biases, which overlaps with one of my favorite rules from Wikipedia editing about weasel words.

I don’t want you to hate this guy. Yes, he actively promotes poisonous rhetoric – ignore that for now. This is about you. Reflect on all your setbacks, your unmet potential, and the raw unfairness of it all. It sucks, and you mustn’t let that bitterness engulf you. You can forgive history itself; you can practice gratitude towards an unjust world. You need no credentials, nor awards, nor secrets, nor skills to do so. You are allowed to like yourself.

Taylor Troesh on IQ is exactly what I needed that day.

The React team knows this makes React complicated. But the bet is clear: React falls on the sword of complexity so developers don’t have to. That’s admirable, but it asks developers to trust React’s invisible machinery more than ever.

React and Remix Choose Different Futures is perfect tech writing: it unpacks the story and philosophy behind a technical decision without cramming it into a right-versus-wrong framework.

When you consider quitting, try to find a different scoreboard. Score yourself on something else: on how many times you dust yourself off and get up, or how much incremental progress you make. Almost always, in your business or life, there are things you can make daily progress on that can make you feel like you’re still winning. Start compounding.

“Why a language? Because I believe that the core of computing is not based on operating system or processor technologies but on language capability. Language is both a tool of thought and a means of communication. Just as our minds are shaped by human language, so are operating systems shaped by programming languages. We implement what we can express. If it cannot be expressed, it will not be implemented.” – Carl Sassenrath

Alexis Sellier, whose work and aesthetic I’ve admired since the early days of Node.js, is working on a new operating system. A real new operating system, like playb.it or SerenityOS (bad politics warning). I’m totally into it: we need more from-scratch efforts like this!

Yes, the funds available for any good cause are scarce, but that’s not because of some natural law, some implacable truth about human society. It’s because oligarchic power has waged war on benign state spending, leading to the destruction of USAID and drastic cuts to the aid budgets of other countries, including the UK. Austerity is a political choice. The decision to impose it is driven by governments bowing to the wishes of the ultra-rich.

The Guardian on Bill Gates is a good read. I’ve had The Bill Gates Problem on my reading list for a long time. Maybe it’s next after I finish The Fort Bragg Cartel.

Contrast this with the rhetorical shock and awe campaign that has been waged by technology companies for the last fifteen years championing the notion of ephemerality.

Implicit, but unspoken, in this worldview is the idea of transience leading to an understanding of a world awash in ephemeral moments that, if not seized on and immediately capitalized to maximum effect, will be forever lost to the mists of time and people’s distracted lifestyles.

Another incredible article by Aaron Straup Cope about AI, the web, ephemerality, and culture. (via Perfect Sentences)


Also, no exact quote, but I’ve been subscribed to Roma’s Unpolished Posts and they have been pretty incredible: mostly technical articles about CSS, which have been ‘new to me’ almost every day, and the author is producing them once a day. Feels like a cheat code to absorb so much new information so quickly.

Listening

I didn’t add any major new albums to my collection this month. I did see Tortoise play a show, which was something I never expected to do. So in lieu of new albums, here’s a theme.

There are a bunch of songs in my library that use a meter change as a way to add or resolve tension. I’m not a big fan of key changes but I love a good rhythm or production shift.

First off: Kissing the Beehive. Yes, it’s nearly 11 minutes long. Magically feeling “off kilter” and “in the pocket” at the same time. I’m no drummer but I think the first part is something like three measures of 4/4 and one of 6/4. But then at 3:26, brand new connected song, and by the time we get to 7 minutes in, we’re in beautiful breezy easy 4/4!

An Andrew Bird classic: about a minute of smooth 4/4, and then over to 7/4 in the second half or so.

I adore Akron/Family’s Running, Returning. Starts in classic 5/4, then transitions to 4/4, then 6/8. For me it all feels very cohesive. Notably, the band is not from Akron Ohio but formed in Williamsburg and were from other East Coast places. If you’re looking for the band from Akron, it’s The Black Keys.

Slow Mass’s Schemes might be my song of the year. Everything they write just sounds so cool. The switchup happens around 2:40 when the vocals move to 6/4. Astounding.

Predictions

Back in January, I made some predictions about 2025. Let’s see how they turned out!

1: The web becomes adversarial to AI

I am marking this one as a absolute win: more and more websites are using Anubis, which was released in March, to block LLM scrapers. Cloudflare is rolling out more LLM bot protections. I at Val Town have started to turn on those protections to keep LLM bots from eating up all of our bandwidth and CPU. The LLM bots are being assholes and everyone hates them.

2: Copyright nihilism breeds a return to physical-only media

This was at most going to be a moderate win because physical-only media will be niche, but I think there are good signs that this is right. The Internet Phone Book, in which this site is featured, started publishing this year. Gen Z seems to be buying more vinyl and printing out more photos.

3: American tech companies will pull out of Europe because they want to do acquisitions

Middling at best: there are threats and there is speculation, but nothing major to report.

4: The tech industry’s ‘DEI backlash’ will run up against reality

Ugh, probably the opposite has happened. Andreesen Horowitz shut down their fund that focused on women, minorities, and people underepresented in VC funding. We’ll know more about startups themselves when Carta releases their annual report, which looked pretty bad last year.

5: Local-first will have a breakthrough moment

Sadly, no. Lots and lots of promising projects, but the ecosystem really struggles to produce something production-ready that offers good tradeoffs. Tanstack DB might be the next contender.

6: Local, small AI models will be a big deal

Not yet. Big honkin’ models are still grabbing most of the headlines. LLMs still really thrive at accomplishing vague tasks that are achievable with a wide range of acceptance criteria, like chatbots, and are pretty middling at tasks that require specific strict, quantifiable outputs.

For my mini predictions:

  • Substack will re-bundle news. Sort of! Multi-editor newsletters like The Argument are all the rage right now.
  • TypeScript gets a zeitwork equivalent and lots of people use it. Sadly, no.
  • Node.js will fend off its competitors. Mostly! Bun keeps making headway but Node.js keeps implementing the necessary featureset, and doesn’t seem to be losing that much marketshare.
  • Another US city starts seriously considering congestion pricing. There are rumblings from Boston and Chicago!
  • Stripe will IPO. Nope. Unlimited tender offers from Sequoia is just as good as IPOing, so why would they.

Val Town 2023-2025 Retrospective

It’s the end of 2025, which means that I’m closing in on three years at Val Town. I haven’t written much about the company or what it’s really been like. The real story of companies is usually told well after years after the dust has settled. Founders usually tell a heroic story of success while they’re building.

Reading startup news really warps your perspective, especially when you’re building a startup yourself. Everyone else is getting fabulously rich! It makes me less eager to write about anything.

But I’m incurably honest and like working with people who are too. Steve, the first founder of Val Town (I joined shortly after as cofounder/CTO) is a shining example of this. He is a master of saying the truth in situations when other people are afraid to. I’ve seen it defuse tension and clear paths. It’s a big part of ‘the culture’ of the company.

So here’s some of the story so far.

Delivering on existing expectations and promises

Here’s what the Val Town interface looked like fairly early on:

Val Town user interface in mid-2023

When I initially joined, we had a prototype and a bit of hype. The interface was heavily inspired by Twitter - every time that you ran code, it would save a new ‘val’ and add it to an infinite-scrolling list.

Steve and Dan had really noticed the utter exhaustion in the world of JavaScript: runaway complexity. A lot of frameworks and infrastructure was designed for huge enterprises and was really, really bad at scaling down. Just writing a little server that does one thing should be easy, but if you do it with AWS and modern frameworks, it can be a mess of connected services and boilerplate.

Val Town scaled down to 1 + 1. You could type 1 + 1 in the text field and get 2. That’s the way it should work.

It was a breath of fresh air. And a bunch of people who encountered it even in this prototype-phase state were inspired and engaged.

The arrows marketing page

One of the pivotal moments of this stage was creating this graphic for our marketing site: the arrows graphic. It really just tied it all together: look how much power there was in this little val! And no boilerplate either. Where there otherwise is a big ritual of making something public or connecting an email API, there’s just a toggle and a few lines of code.

I kind of call this early stage, for me, the era of delivering on existing expectations and promises. The core cool idea of the product was there, but it was extremely easy to break.

Security was one of the top priorities. We weren’t going to be a SOC2 certified bank-grade platform, but we also couldn’t stay where we were. Basically, it was trivially easy to hack: we were using the vm2 NPM module to run user code. I appreciate that vm2 exists, but it really, truly, is a trap. There are so many ways to get out of its sandbox and access other people’s code and data. We had a series of embarrassing security vulnerabilities.

For example: we supported web handlers so you could easily implement a little server endpoint, and the API for this was based on express, the Node.js server framework. You got a request object and response object, from express, and in this case they were literally the same as our server’s objects. Unfortunately, there’s a method response.download(path: string) which sends an arbitrary file from your server to the internet. You can see how this one ends: not ideal.

So, we had to deliver on a basic level of security. Thankfully, in the way that it sometimes does, the road rose to meet us. The right technology appeared just in time: Deno. Deno’s sandboxing made it possible to run people’s code securely without having to build a mess of Kubernetes and Docker sandbox optimizations. It delivered being secure, fast, and simple to implement: we haven’t identified a single security bug caused by Deno.

That said, the context around JavaScript runtimes has been tough. Node.js is still dominant and Bun has attracted most of the attention as an alternative, with Deno in a distant third place, vibes-wise. The three are frustratingly incompatible - Bun keeps adding built-ins like an S3 client which would have seemed unthinkable in the recent past. Node added an SQLite client in 22. Contrary to what I hoped in 2022, JavaScript has gotten more splintered and inconsistent as an ecosystem.

Stability was the other problem. The application was going down constantly for a number of reasons, but most of all was the database, which was Supabase. I wrote about switching away from Supabase, which they responded to in a pretty classy way, and I think they’ve since improved. But Render has been a huge step up in maintainability and maturity for how we host Val Town.

Adding Max was a big advance in our devops-chops too: he was not only able to but excited to work on the hard server capacity and performance problems. We quietly made a bunch of big improvements like allowing vals to stay alive after serving requests - before that, every run was a cold start.

What to do about AI

Townie

Townie, the Val Town chatbot, in early 2024

Believe it or not, but in early 2023, there were startups that didn’t say “AI” on the front page of their marketing websites. The last few years have been a dizzying shift in priorities and vibes, which I have had mixed feelings about that I’ve written about a lot.

At some point it became imperative to figure out what Val Town was supposed to do about all that. Writing code is undeniably one of the sweet spots of what LLMs can do, and over the last few years the fastest-growing most hyped startups have emerged from that ability.

This is where JP Posma comes in. He was Steve’s cofounder at a previous startup, Zaplib, and was our ‘summer intern’ - the quotes because he’s hilariously overqualified for that title. He injected some AI-abilities into Val Town, both RAG-powered search and he wrote the first version of Townie, a chatbot that is able to write code.

Townie has been really interesting. Basically it lets you write vals (our word for apps) with plain English. This development happened around the same time as a lot of the ‘vibe-coding’ applications, like Bolt and Lovable. But Townie was attached to a platform that runs code and has community elements and a lot more. It’s an entry point to the rest of the product, while a lot of other vibe-coding tools were the core product that would eventually expand to include stuff like what Val Town provides.

Ethan Ding has written a few things about this: it’s maybe preferable to sell compute instead of being the frontend for LLM-vibe-coding. But that’s sort of a long-run prediction about where value accrues rather than an observation about what companies are getting hype and funding in the present.

Vibe coding companies

There are way too many companies providing vibe-coding tools without having a moat or even a pathway to positive margins. But having made a vibe-coding tool, I completely see why: it makes charts look amazing. Townie was a huge growth driver for a while, and a lot of people were hearing about Townie first, and only later realizing that Val Town could run code, act as a lightweight GitHub alternative, and power a community.

Unlike a lot of AI startups, we didn’t burn a ton of money running Townie. We did have negative margins on it, but to the tune of a few thousand dollars a month during the most costly months.

Introducing a pro plan made it profitable pretty quickly and today Townie is pay-as-you-go, so it doesn’t really burn money at all. But on the flip side, we learned a lot about the users of vibe-coding tools. In particular, they use the tools a lot, and they really don’t want to pay for them. This kind of makes sense: vibe-coding actual completed apps without ever dropping down to write or read code is zeno’s paradox: every prompt gets you halfway there, so you inch closer and closer but never really get to your destination.

So you end up chatting for eight hours, typically getting angrier and angrier, and using a lot of tokens. This would be great for business in theory, but in practice it doesn’t work for obvious reasons: people like to pay for results, not the process. Vibe-coding is a tough industry - it’s simultaneously one of the most expensive products to run, and one of the most flighty and cost-sensitive user-bases I’ve encountered.

So AI has been complicated. On one hand, it’s amazing for growth and obviously has spawned wildly successful startups. On the other, it can be a victim of its own expectations: every company seems to promise perfect applications generated from a single prompt and that just isn’t the reality. And that results in practically every tool falling short of those expectations and thus getting the rough end of user sentiment.

We’re about to launch MCP support, which will make it possible to use Val Town via existing LLM interfaces like Claude Code. It’s a lot better than previous efforts - more powerful and flexible, plus it requires us to reinvent less of the wheel. The churn in the ‘state of the art’ feels tremendous: first we had tool-calling, then MCPs, then tool calling writing code to call MCPs: it’s hard to tell if this is fast progress or just churn.

As a business

When is a company supposed to make money? It’s a question that I’ve thought about a lot. When I was running a bootstrapped startup, the answer was obviously as soon as possible, because I’d like to stop paying my rent from my bank account. Venture funding lets you put that off for a while, sometimes a very long while, and then when companies start making real revenue they at best achieve break-even. There are tax and finance reasons for all of this – I don’t make the rules!

Anyway, Val Town is far from break-even. But that’s the goal for 2026, and it’s optimistically possible.

One thing I’ve thought for a long time is that people building startups are building complicated machines. They carry out a bunch of functions, maybe they proofread your documents or produce widgets, or whatever, but the machine also has a button on it that says “make money.” And everything kind of relates to that button as you’re building it, but you don’t really press it.

The nightmare is if the rest of the machine works, you press the button, and it doesn’t do anything. You’ve built something useful but not valuable. This hearkens back to the last section about AI: you can get a lot of people using the platform, but if you ask them for money and they’re mostly teenagers or hobbyists, they’re not going to open their wallets. They might not even have wallets.

So we pressed the button. It kind of works.

But what I’ve learned is that making revenue is a lot like engineering: it requires a lot of attempts, testing, and hard work. It’s not something that just results from a good product. Here’s where I really saw Charmaine and Steve at work, on calls, making it happen.

The angle right now is to sell tools for ‘Go To Market’ - stuff like capturing user signups of your website, figuring out which users are from interesting companies or have interesting use-cases, and forwarding that to Slack, pushing it to dashboards, and generally making the sales pipeline work. It’s something Val Town can do really well: most other tools for this kind of task have some sort of limit in how complicated and custom they can get, and Val Town doesn’t.

Expanding and managing the complexity

Product-wise, the big thing about Val Town that has evolved is that it can do more stuff and it’s more normal. When we started out, a Val was a single JavaScript expression - this was part of what made Val Town scale down so beautifully and be so minimal, but it was painfully limited. Basically people would type into the text box

const x = 10;
function hi() {};
console.log(1);

And we couldn’t handle that at all: if you ran the Val did it run that function? Export the x variable? It was magic but too confusing. The other tricky niche choice was that we had a custom import syntax like this:

@tmcw.helper(10);

In which @tmcw.helper was the name of another val and this would automatically import and use it. Extremely slick but really tricky to build off of because this was non-standard syntax, and it overlapped with the proposed syntax for decorators in JavaScript. Boy, I do not love decorators: they have been under development for basically a decade and haven’t landed, just hogging up this part of the unicode plane.

But regardless this syntax wasn’t worth it. I have some experience with this problem and have landed squarely on the side of normality is good.

So, in October 2023, we ditched it, adopted standard ESM import syntax, and became normal. This is was a big technical undertaking, in large part because we tried to keep all existing code running by migrating it. Thankfully JavaScript has a very rich ecosystem of tools that can parse & produce code and manipulate syntax trees, but it was still a big, dramatic shift.

This is one of the core tensions of Val Town as well as practically every startup: where do you spend your user-facing innovation energy?

I’m a follower of the use boring technology movement when it comes to how products are built: Val Town intentionally uses some boring established parts like Postgres and React Router, but what about when it comes to the product itself? I’ve learned the hard way that most of what people call intuition is really familiarity: it’s good when an interface behaves like other interfaces. A product that has ten new concepts and a bunch of new UI paradigms is going to be hard to learn and probably will lose out to one that follows some familiar patterns.

Moving to standard JavaScript made Val Town more learnable for a lot of people while also removing some of its innovation. Now you can copy code into & out of Val Town without having to adjust it. LLMs can write code that targets Val Town without knowing everything about its quirks. It’s good to go with the flow when it comes to syntax.

Hiring and the team

Office Sign

Val Town has an office. I feel like COVID made everything remote by default and the lower-COVID environment that we now inhabit (it’s still not gone!) has led to a swing-back, but the company was founded in the latter era and has never been remote. So, we work from home roughly every other Friday.

This means that we basically try to hire people in New York. It hasn’t been too hard in the past. About 6% of America lives in the New York City metro area and the Northeast captures about 23% of venture funding, so there are lots of people who live here or want to.

Stuff on the window sill in the office

Here’s something hard to publish: we’re currently at three people. It was five pretty recently. Charmaine got poached by Anthropic where she’ll definitely kick ass, and Max is now at Cloudflare where he’s writing C++, which will be even more intimidating than his chess ranking. The company’s really weirdly good at people leaving: we had parties and everyone exchanges hand-written cards. How people handle hard things says a lot.

But those three are pretty rad: Jackson was a personal hero of mine before we hired him (he still is). He’s one of the best designers I’ve worked with, and an incredibly good engineer to boot. He’s worked at a bunch of startups you’ve heard of, had a DJ career, gotten to the highest echelons of tech without acquiring an ego. He recently beat me to the top spot in our GitHub repo’s lines-changed statistic.

Steve has what it takes for this job: grit, optimism, curiosity. The job of founding a company and being a CEO is a different thing every few months - selling, hiring, managing, promoting. Val Town is a very developer-consumer oriented product and that kind of thing requires a ton of promotion. Steve has done so much, in podcasts, spreading the word in person, writing, talking to customers. He has really put everything into this. A lot of the voice and the attitude of the company flows down from the founder, and Steve is that.

Did I mention that we’re hiring?

In particular, for someone to be a customer-facing technical promoter type - now called a “GTM” hire. Basically, who can write a bit of code but has the attitude of someone in sales. Who can see potential and handle rejection. Not necessarily the world’s best programmer, but who can probably code, and definitely someone who can write. Blogging and writing online is a huge green flag for this position.

And the other role that we really need is an “application engineer.” These terms keep shifting, so if full-stack engineer means more, sure, that too. Basically someone who can write code across boundaries. This is more or less what Jackson and I do - writing queries, frontend code, fixing servers, the whole deal. Yeah, it sounds like a lot but this is how all small companies operate, and I’ve made a lot of decisions to make this possible: we’ve avoided complexity like the plague in Val Town’s stack, so it should all be learnable. I’ve written a bunch of documentation for everything, and constantly tried to keep the codebase clean.

Sidenote, but even though I think that the codebase is kind of messy, I’ve heard from very good engineers (even the aforementioned JP Posma) that it’s one of the neatest and most rational codebases they’ve seen. Maybe it is, maybe it isn’t, see for yourself!

What we’re really looking for in hires

Tech hiring has been broken the whole time I’ve been in the industry, for reasons that would take a whole extra article to ponder. But one thing that makes it hard is vagueness, both on the part of applicants and companies. I get it - cast a wide net, don’t put people off. But I can say that:

  • For the GTM position, you should be able to write for the internet. This can be harder than it looks: there are basically three types of writing: academic, corporate, and internet, and they are incompatible.
  • You should also be kind of entrepreneurial: which means optimistic, resilient, and opportunistic.
  • For the application engineering role, you should be a good engineer who understands the code you write and is good at both writing and reading code. Using LLM tools is great, but relying on them exclusively is a dealbreaker. LLMs are not that good at writing code.

What the job is like

The company’s pretty low drama. Our office is super nice. We work hard but not 996. We haven’t had dinner in the office. But we all do use PagerDuty so when the servers go down, we wake up and it sucks. Thankfully the servers go down less than they used to.

We all get paid the same: $175k. Lower than FAANG, but pretty livable for Brooklyn. Both of the jobs listed - the Product Engineer, and Growth Engineer - are set at 1% equity. $175k is kind of high-average for where we’re at, but 1% in my opinion is pretty damn good. Startups say that equity is “meaningful” at all kinds of numbers but it’s definitely meaningful at that one. If Val Town really succeeds, you can get pretty rich off of that.

Of course, will it succeed? It’s something I think about all the time. I was born to ruminate. We have a lot going for us, and a real runway to make things happen. Some of the charts in our last investor update looked great. Some days felt amazing. Other days were a slog. But it’s a good team, with a real shot of making it.

Recently

Hello! Only a day late this time. October was another busy month but it didn’t yield much content. I ran a second half-marathon, this time with much less training, but only finished a few minutes slower than I did earlier this year. Next year I’m thinking about training at a normal mileage for the kinds of races I’m running - 25 miles or so per week instead of this year’s roughly 15.

And speaking of running, I just wrote up this opinion I have about how fewer people should run marathons.

Reading

I enjoyed reading Why Functor Doesn’t Matter, but I don’t really agree. The problem that I had with functional programming jargon isn’t that the particular terms are strange or uncommon, but that their definitions rely on a series of other jargon terms, and the discipline tends to omit good examples, metaphors, or plain-language explanations. It’s not that the strict definition is bad, but when a function is defined as _a mapping that associates a morphism F: X -> Y in category C to a morphism F(f): F(X) -> F(Y), in category D, you now have to define morphism, categories, and objects, and all of which have domain-specific definitions.

I am loving Sherif’s posting about building a bike from the frame up.

Maximizers are biased to speed, optionality, breadth, momentum, opportunism, parallel bets, hype, luck exposure, momentum, “Why not both?”, “Better to move fast than wait for perfect”. Maximizers want to see concrete examples before they’ll make tradeoffs. They anchor decisions in the tangible. “Stop making things so complicated.” “Stop overthinking.”

Focusers are biased to focus, coherence, depth, meaningful constraints, doing less for more, sequential experiments, intentionality, sustainability, “What matters most?”, compounding clarity. Focusers are comfortable with abstraction. A clear constraint or principle is enough to guide them. “Stop mistaking chaos for progress.” “Stop overdoing.”

John Cutler’s post about maximizers vs. focusers matches my experience in tech. Like many young engineers, I think I started out as a focuser and have tried to drift to the center all the time, but the tension both internally and interpersonally at every job is present.

I recently remarked to a friend that traveling abroad after the advent of the smartphone feels like studying biology after the advent of microplastics. It has touched every aspect of life. No matter where you point your microscope you will see its impact.

Josh Erb’s blog about living in India is great, personal, a classic blog’s blog.

For me, the only reason to keep going is to try and make AI a wonderful technology for the world. Some feel the same. Others are going because they’re locked in on a path to generational wealth. Plenty don’t have either of these alignments, and the wall of effort comes sooner.

This article about AI researchers working all the time and burning out is interesting, in part because I find the intention of AI researchers so confusing. I can see the economic intention: these guys are making bank! Congrats to all of them. But it’s so rare to talk to anyone who has a concrete idea about how they are making the world better by doing what they’re doing, and that’s the reason why they’re working so hard. OpenAI seems to keep getting distracted from that cancer cure, and their restructuring into a for-profit company kind of indicates that there’s more greed than altruism in the mix.

every vc who bet on the modern data stack watched their investments get acquired for pennies or go to zero. the only survivors: the warehouses themselves, or the companies the warehouses bought to strengthen their moats.

It’s niche, but this article about Snowflake, dbt, fivetran, and other ‘data lake’ architecture is really enlightening.

Listening

Totorro’s new album was the only one I picked up this month. It’s pretty good math-rock, very energetic and precise.

Watching

  • One Battle After Another was incredible.
  • eXistenZ is so gloriously weird, I really highly recommend it. It came out the same year as The Matrix and explores similar themes, but the treatment of futuristic technology is something you won’t see anywhere else: instead of retro-steampunk metal or fully dystopian grimness, it’s colorful, slimy, squelchy, organic, and weird.

Speaking of weird, Ben Levin’s gesamtkunstwerk videos are wild and glorious.

OpenAI employees… are you okay?

You might have seen an article making the rounds this week, about a young man who ended his life after ChatGPT encouraged him to do so. The chat logs are really upsetting.

Someone two degrees removed from me took their life a few weeks ago. A close friend related the story to me, about how this person had approached their neighbor one evening to catch up, make small talk, and casually discussed their suicidal ideation at some length. At the end of the conversation, they asked to borrow a rope, and their neighbor agreed without giving the request any critical thought. The neighbor found them the next morning.

I didn’t know the deceased, nor their neighbor, but I’m close friends with someone who knew both. I found their story deeply chilling – ice runs through my veins when I imagine how the neighbor must have felt. I had a similar feeling upon reading this article, wondering how the people behind ChatGPT and tools like it are feeling right now.

Two years ago, someone I knew personally took their life as well. I was not friendly with this person – in fact, we were on very poor terms. I remember at the time, I had called a crisis hotline just to ask an expert for advice on how to break this news to other people in my life, many of whom were also on poor terms with a person whose struggles to cope with their mental health issues caused a lot of harm to others.

None of us had to come to terms with any decisions with the same gravity as what that unfortunate neighbor had to face. None of us were ultimately responsible for this person’s troubles or were the impetus for what happened. Nonetheless, the uncomfortable and confronting feelings I experienced in the wake of that event perhaps give me some basis for empathy and understanding towards the neighbor, or for OpenAI employees, and others who find themselves in similar situations.

If you work on LLMs, well… listen, I’ve made my position as an opponent of this technology clear. I feel that these tools are being developed and deployed recklessly, and I believe tragedy is the inevitable result of that recklessness. If you confide in me, I’m not going to validate your career choice. But maybe that’s not necessarily a bad quality to have in a confidant? I still feel empathy towards you and I recognize your humanity and our need to acknowledge each other as people.

If you feel that I can help, I encourage you to reach out. I will keep our conversation in confidence, and you can reach out anonymously if that makes you feel safer. I’m a good listener and I want to know how you’re doing. Email me.


If you’re experiencing a crisis, 24-hour support is available from real people who are experts in getting you the help you need. Please consider reaching out. All you need to do is follow the link.

More tales about outages and numeric limits

Outages, you say? Of course I have stories about outages, and limits, and some limits causing outages, and other things just screwing life up. Here are some random thoughts which sprang to mind upon reading this morning's popcorn-fest.

...

I was brand new at a company that "everybody knew" had AMAZING infrastructure. They could do things with Linux boxes that nobody else could. As part of the new employee process, I had to get accounts in a bunch of systems, and one of them was this database used to track the states of machines. It was where you could look to see if it was (supposed to be) serving, or under repair, or whatever. You could also see (to some degree) what services were supposed to be running on it, and what servers (that is, actual programs), the port numbers, and whether all of that stuff was synced to the files on the box or not.

My request didn't go through for a while, and I found out that it had something to do with my employee ID being a bit over 32767. And yeah, for those of you who didn't just facepalm at seeing that number, that's one of those "magic numbers" which pops up a bunch when talking about limits. That one is what you get when you try to store numbers as 16 bit values... with a sign to allow negative values. Why you'd want a negative employee number is anyone's guess, but that's how they configured it.

I assume they fixed the database schema at some point to allow more than ~15 bits of employee numbers, but they did an interesting workaround to get me going before then. They just shaved off the last digit and gave me that ID in their system instead. I ended up as 34xx instead of 34xxx, more or less.

This was probably my first hint that their "amazing infra" was in fact the same kind of random crazytown as everywhere else once you got to see behind the curtain.

...

Then there was the time that someone decided that a log storage system that had something like a quarter of a million machines (and growing fast) feeding it needed a static configuration. The situation unfolded like this:

(person 1) Hey, why is this thing crashing so much?

(person 2) Oh yeah, it's dumping cores constantly! Wow!

(person 1) It's running but there's nothing in the log?

(person 2) Huh, "runtime error ... bad id mapping?"

(person 2) It's been doing this for a month... and wait, other machines are doing it, too!

(person 1) Guess I'll dig into this.

(person 2) "range name webserv_log.building1.phase3 range [1-20000]"

(person 2) But this machine is named webserv20680...

(person 2) Yeah, that's enough for me. Bye!

The machines were named with a ratcheting counter: any time they were assigned to be a web server, they got names like "webserv1", "webserv2", ... and so on up the line. That had been the case all along.

Whoever designed this log system years later decided to put a hard-coded limiter into it. I don't know if they did it because they wanted to feel useful every time it broke so they could race in and fix it, or if they didn't care, or if they truly had no idea that numbers could in fact grow beyond 20000.

Incidentally, that particular "building1.phase3" location didn't even have 20000 machines at that specific moment. It had maybe 15000 of them, but as things went away and came back, the ever-incrementing counter just went up and up and up. So, there _had been_ north of 20K machines in that spot overall, and that wasn't even close to a surprising number.

...

There was a single line that would catch obvious badness at a particular gig where we had far too many Apache web servers running on various crusty Linux distributions:

locate access_log | xargs ls -la | grep 2147

It was what I'd send in chat to someone who said "hey, the customer's web server won't stay up". The odds were very good that they had a log file that had grown to 2.1 GB, and had hit a hard limit which was present in that particular system. Apache would try to write to it, that write would fail, and the whole process would abort.

"2147", of course, is the first 4 digits of the expected file size: 2147483647 ... or (2^31)-1.

Yep, that's another one of those "not enough bits" problems like the earlier story, but this one is 32 bits with one of them being for the sign, not 16 like before. It's the same problem, though: the counter maxes out and you're done.

These days, files can get quite a bit bigger... but you should still rotate your damn log files once in a while. You should probably also figure out what's pooping in them so much and try to clean that up, too!

...

As the last one for now, there was an outage where someone reported that something like half of their machines were down. They had tried to do a kernel update, and wound up hitting half of them at once. I suspect they wanted to do a much smaller quantity, but messed up and hit fully half of them somehow. Or, maybe they pointed it at all of them, and only half succeeded at it. Whatever the cause, they now had 1000 freshly-rebooted machines.

The new kernel was fine, and the usual service manager stuff came back up, and it went to start the workload for those systems, and then it would immediately crash. It would try to start it again. It would crash again. Crash crash crash. This is why we call it "crashlooping".

Finally, the person in question showed up in the usual place where we discussed outages, and started talking about what was going on.

(person 1) Our stuff isn't coming back.

(person 2) Oh yeah, that's bad, they're all trying to start.

(person 1) Start, abort, start, abort, ...

(person 2) Yep, aborting... right about here: company::project::client::BlahClient::loadConfig ... which is this code: <paste>

(person 2) It's calling "get or throw" on a map for an ID number...

(person 1) My guess is the config provider service isn't running.

(person 2) It's there... it's been up for 30 minutes...

(person 1) Restarting the jobs.

(person 2) Nooooooooooo...

<time passes>

(person 2) Why is there no entry for number 86 in the map in the config?

(person 1) Oh, I bet it's problems with port takeover.

(person 3) I think entry 86 is missing from <file>.

(person 2) Definitely is missing.

(person 4) Hey everyone, we removed that a while back. Why would it only be failing now?

(person 2) It's only loaded at startup, right?

(person 4) Right.

(person 2) So if they were running for a long time, then it changed, then they're toast after a restart...

(person 3) Hey, this change looks related.

(person 4) I'm going to back that out.

This is a common situation: program A reads config C. When it starts up, config C is on version C1, and everything is fine. While A is running, the config is updated from C1 to C2, but nothing notices. Later, A tries to restart and it chokes on the C2 config, and refuses to start.

Normally, you'd only restart a few things to get started, and you'd notice that your program can't consume the new config at that point. You'd still have a few instances down, but that's it - a *few* instances. Your service should keep running on whatever's left over that you purposely didn't touch.

This is why you strive to release things in increments.

Also, it helps when programs notice config changes while they're running, so this doesn't sneak up on you much later when you're trying to restart. If the programs notice the bad config right after the change is made, it's *far* easier to correlate it to the change just by looking at the timeline.

Tuesday, 11:23:51: someone applies change.

Tuesday, 11:23:55: first 1% of machines which subscribe to the change start complaining.

... easy, right? Now compare it to this:

Tuesday, November 18: someone applies a change

Wednesday, January 7: 50% of machines fail to start "for some reason"

That's a lot harder to nail down.

...

Random aside: restarting the jobs did not help. They were already restarting themselves. "Retry, reboot, reinstall, repeat" is NOT a strategy for success.

It was not the config system being down. It was up the whole time.

It was nothing to do with "port takeover". What does that have to do with a config file being bad?

The evidence was there: the processes were crashing. They were logging a message about WHY they were killing themselves. It included a number they wanted to see, but couldn't find. It also said what part of the code was blowing up.

*That* is where you start looking. You don't just start hammering random things.

❌