❌

Reading view

There are new articles available, click to refresh the page.

Feedback on lessons, leap seconds, and LLMs

I'll roll up some responses to reader feedback here.

...

Someone asked if they could view the old code lessons. The one I put back online last year is where I do a terrible little TCP listener, compile it, start it in the background, and then connect to it with netcat. It's awkward as hell, but it's there if you really want to see it.

There is also the six part "protofeed" demo which showed how to fetch a feed in this protobuf-based scheme I rigged up. Spoiler: it's not there any more, since nobody was using it, so it's not very useful to follow the instructions on that thing.

Ironically, that fetcher program would run afoul of all kinds of badness if it was pointed at a production site. It doesn't do conditional requests, it doesn't know about Cache-Control headers, it won't recognize that a 429 is asking for throttling, and so on. I guess that's okay for something that was a proof of concept to show how to fetch something from the network and parse it, but *actual* feed readers get all of that stuff wrong, too.

...

Another reader asked if the Linux "hrtimer" glitch from the leap second was fixed. I have to assume that it was based on the fact that most people didn't have that same problem three years later when we had the one in my "leap smearing" story. My worries were about userspace stuff.

This is an opportunity for me to share just why I went to those lengths. In short, it was because of a lack of confidence in everyone everywhere doing the right thing in terms of time handling. If everyone uses monotonic clocks for measuring durations and otherwise is okay with wall time going backwards now and then, then there's no reason to smear it out. My own personal systems have never smeared a leap second. They just ride it out and keep on going.

I couldn't assume the correctness of such implementations at the company. Worse, even if I went and deliberately injected backwards-going time step operations and proved that it would crash some code, there was no guarantee of anything coming from it. I had found myself in a place at that company where some parts of it were completely unresponsive to the problems they were causing for themselves and sometimes for other people, and was starting to tire of the "bad cop" schtick. That's where I'd show up and go "your shit is broken" and they would do nothing to work with us (the whole team) to do something about it.

I just had this feeling that if we repeated the last UTC second of June 2015, we'd end up breaking something. What's kind of amazing is that later on that year, it actually happened.

Someone misconfigured the ailing NTP appliances to *not* apply the correction factor from GPS to UTC. This ended up forcing one appliance into shipping unadjusted GPS time to roughly half of production via NTP, and the difference at the time was something like 17 seconds. (This changes, and indeed, it's no longer 17.)

Anyway, I got to working on this after hearing about it and found roughly half the fleet running 17 seconds fast. It was completely unreasonable to try to "smear off" 17 seconds to get things back to normal - that would have taken weeks. I made the decision to fix the setting and then let every broken machine individually have its clock dragged backwards the 17 seconds to where things should be.

This broke stuff. Some kind of sharding mechanism deep inside the fabric of things was using wall time to determine something or other, and when it jumped back, it fired an assertion and killed the program. This nuked the web server (or whatever else was using that library).

So, basically, every single machine which had been poisoned with the bad time and which was running this library was going to crash exactly once and there wasn't really anything which could be done about it. It was something like 2 in the morning by this point and I opted to let it happen.

About the only good thing about this is that the adjustments happened at different intervals depending on the ntpd poll rate, so it's not like hundreds of thousands of machines all crashed their workloads at the same time. One would pop off here, then one there, and so on... over the span of an hour or two... until it was all done.

Thus, some services didn't really go down, but they did have a bad time with a bunch of failed/dropped requests which were on the affected systems.

That one was dubbed the "Back to the Future" SEV. At least one team made a screenshot of some display showing the 17 second offset the banner of the group where they talked about production issues.

Stuff like that is why I smeared it out. When you can't be sure of the correctness of implementations, and there are good chances that attempts to fix them will be ignored, rebuffed, or actively attacked, you have to "treat the whole situation as damage" and route around it. You remove the discontinuity in wall time to save them from themselves.

...

A reader asked for my take about "AI" and LLMs and all of this.

In the vein of the "annoyances" post from earlier in the month, I'll start by saying that I don't push any of that on you here, either. All of this stuff is straight off my keyboard with a sprinkling of ispell applied after the fact. Even that's of limited utility since there are a bunch of technical terms and not-really-words that I use for various reasons.

I think all of the hype and waste has generated an enormous mountain of useless nonsense that has attracted the absolute worst of the vampires and buzzards and bottom-feeders who are looking to exploit this stuff for their own benefit.

The LAST thing we needed was a better way to generate plausible-looking horse shit for random gullible people to consume unwittingly, but here we are, and it's only going to get worse.

I think a lot of this falls into "the Internet we deserve".

So no, I don't use anything of the sort, and I tell people not to quote any of that crap at me, or to send me screenshots of it pretending to be an answer to something, and that they need to find actual sources for their data. This has not made me the most popular person.

But hey, I've already said that I'm obviously out of touch with what most people are up to. My green-on-black terminal with nano in it that's writing up a bunch of plain text with a handful of triggers for callouts to other posts should be proof positive of that already. Hardly anyone else does things this way any more. That makes me the weirdo, not them. I know this. I'm okay with this.

Screen shot of my X session showing the post being written and the list of posts off to the side.  Both are just ordinary text files in a boring old text editor.

Feed score update: new hostname in effect today

Right, so, one of the things that can happen when you're trying to collect fresh data on the behaviors of something dynamic is that you get bogged down under the load of what happened previously. With the feed reader score project, this is what's been happening. A lot of clients were started up and pointed at it, and we gathered a lot of behavioral data.

The problem is that some of them are not changing, and having a few dozen of them call back every five minutes is not doing anyone any favors. So, I did what I promised I would do, and I updated the hostname.

If you are participating in the test and want to continue, go back to the original mail from me you got with the code(s), and load the instruction page. There, you will find the very slightly changed base hostname that can be used to construct your new unique feed URL. The keys are the same.

This also gives us the benefit of seeing what a fresh start looks like with the latest batch of feed reader software. Many of them have done a lot of excellent work these past six months, and they deserve to leave the historical baggage behind. I want to see where they are now, and this is how we get there.

For anyone wondering, I was looking at some of the reports before I cleared things out a few minutes ago. Some of the problem spots that I had mentioned in multiple report posts were still there. A lot of this was just people running old versions when they really need to upgrade. Some of it was just nobody at the wheel for the various clown services.

Seeing a whole bunch of unchanged behavior just reinforced the need to do a fresh start on this stuff. The people who have invested in improving their software deserve it.

Once there's a fresh set of data built up, I guess I'll write up another summary.

Oh, side note for anyone keeping track: this is not the "wildcard DNS" thing that I mentioned a few weeks back that would be needed to track down the *really* goofy stuff that polls all kinds of extra crap paths. That would require (more) actual work, and I'm not ready to do that just yet. (Plus, for that kind of effort, I might want to charge for it. Just saying.)

Pushing the whole company into the past on purpose

Every six months or so, this neat group called the International Earth Rotation Service issues a directive on whether there will be a leap second inserted at the end of that six month period. You usually find out at the beginning of January or the beginning of July, and thus would have a leap second event at the end of June or December, respectively.

Ten years ago, in January 2015, they announced a leap second would be added at the end of June 2015. The last one had been three years prior, and when it happened, it screwed things up pretty bad for the cat picture factory. They hit kernel problems, userspace problems, and worse.

This time, I was working there, and decided there would not be a repeat. The entire company's time infrastructure would be adjusted so it would simply slow down for about 20 hours before the event, and so it would become a whole second "slow" relative to the rest of the world. Then at midnight UTC, the rest of the world would go 58, 59, 60, 0, and we'd go 57, 58, 59, 0, and then we'd be in lock-step again.

So how do you do something like this? Well, you have to get yourself into a position where you can add a "lie" to the time standard. This company had a handful of these devices which had a satellite receiver for GPS on one side and an Ethernet port for NTP on the other with a decent little clock on the inside. I just had to get between those and everyone else so they would receive my adjusted time scale for the duration, then we could switch back when things were ready.

This is the whole "leap smearing" thing that you might have heard of if you run in "time nut" circles. Someone else came up with it and they had only published their formula for computing the lie over a spread of time. The rest of it was "left as an exercise for the reader", so to speak.

Work like this benefits from being highly visible, so I bought a pair of broadcast-studio style clocks which spoke NTP over Ethernet and installed them on my desk. One of them was pointed at the usual GPS->NTP infrastructure, and the other was pointed at the ntp servers running my hacked-up code which could have "lies" injected.

I'd start up a test and watch them drift apart. At first, you can't even tell, but after a couple of hours, you get to where one subtly updates just a bit before the other one. You can even see it in pictures: parts of one light up before the other.

"Two digital clocks stacked vertically, one green (top), one amber; the green clock shows 41 seconds while the amber one still showing bits of the 0 in 40"

Then at the end of the ramp, they're a full second apart, but they're still updating at the same time. It's just that one goes from 39 to 40 when the other goes from 40 to 41.

Back and forth I went with my test clocks, test systems, and a handful of guinea pig boxes that volunteered to subscribe to the hacked-up time standard during these tests. We had to find a rate-of-change that would be accepted by the ntp daemons all over the fleet. There's only so much of a change you can introduce to the rate of change itself, and that meant a lot of careful experimentation to find out just what would work.

We ended up with something like 20 hours to smear off a single second.

The end of June approached, and it was time to do a full-scale test. I wanted to be sure that we could survive being a second out of whack without having the confounding factor of the whole rest of the world simultaneously dealing with their own leap second stuff. We needed to know if we'd be okay, and the only way to know was to smear it off, hold a bit to see if anything happened, then *smear it back on*.

This is probably the first anyone outside the company's heard of it, but about a week before, I smeared off a whole second and left the ENTIRE company's infra (laptops and all) running a second slow relative to the rest of the world. Then we stayed there for a couple of hours if I remember correctly, and then went forward again and caught back up.

A week later, we did it for real and it just worked.

"Same two clocks during the leap second itself: local time is 16:59:60 PDT, company time is 16:59:59 PDT"

So, yes, in June 2015, I slowed down the whole company by a second.

Of course, here it is ten years later, and the guy in charge just sent it back fifty years. Way to upstage me, dude.

Web page annoyances that I don't inflict on you here

I've been thinking about things that annoy me about other web pages. Safari recently gained the ability to "hide distracting items" and I've been having great fun telling various idiot web "designers" to stuff it. Reclaiming a simple experience free of wibbly wobbly stuff has been great.

In doing this, I figured maybe I should tell people about the things I don't do here, so they realize how much they are "missing out" on.

I don't force people to have Javascript to read my stuff. The simplest text-based web browser going back about as far as you can imagine should be able to render the content of the pages without any trouble. This is because there's no JS at all in these posts.

I don't force you to use SSL/TLS to connect here. Use it if you want, but if you can't, hey, that's fine, too.

The last two items mean you could probably read posts via telnet as long as you were okay with skipping over all of the HTML <tag> <crap>. You might notice that the text usually word-wraps around 72, so it's not that much of a stretch.

I don't track "engagement" by running scripts in the post's pages that report back on how long someone's looked at it... because, again, no JS.

I don't set cookies. I also don't send unique values for things like Last-Modified or ETag which also could be used to identify individuals. You can compare the values you get with others and confirm they are the same.

I don't use visitor IP addresses outside of a context of filtering abuse.

I don't do popups anywhere. You won't see something that interrupts your reading to ask you to "subscribe" and to give up your e-mail address.

I don't do animations outside of one place. Exactly one post has something in it which does some graphical crap that changes by itself. It's way back in July 2011, and it's in a story ABOUT animating a display to show the absence of a value. It doesn't try to grab your attention or mislead you, and it's not selling anything.

I don't use autoplaying video or audio. There are a couple of posts where you can click on your browser's standard controls to start playback of a bit of audio that's related to the post. Those are also not used to grab your attention, mislead you, or sell something.

I don't try to "grab you" when you back out of a page to say "before you go, check out this other thing". The same applies to closing the window or tab: you won't get this "are you sure?" crap. If you want out, you get out *the first time*.

I don't pretend that posts are evergreen by hiding their dates. Everything has a clear date both in the header of the page and built into the URL. If it's out of date, it'll be pretty obvious.

I don't put crap in the pages which "follows you" down the page as you scroll. You want to see my header again? Cool, you can scroll back up to it if it's a particularly long post. I don't keep a "dick bar" that sticks to the top of the page to remind you which site you're on. Your browser is already doing that for you.

There are no floating buttons saying things like "contact me" or "pay me" or "check out this service I totally didn't just write this post to hawk on the red or orange sites". I don't put diagonal banner things across the corners. I don't blur it out and force you to click on something to keep going. TL;DR I don't cover up the content, period.

I don't mess with the scrolling of the page in your browser. You won't get some half-assed attempt at "smoothing" from anything I've done. You won't get yanked back up to the top just because you switched tabs and came back later.

I don't do some half-assed horizontal "progress bar" as you scroll down the page. Your browser probably /already/ has one of those if it's graphical. It's called the scroll bar. (See also: no animations.)

I don't litter the page with icons that claim to be for "sharing" or "liking" a post but which frequently are used to phone home to the mothership for a given service to report that someone (i.e., YOU) has looked at a particular page somewhere. The one icon you will find on all posts links to the "how-to" page for subscribing to my Atom feed, and that comes from here and phones home to nobody.

I don't use "invisible icons" or other tracker crap. You won't find evil 1x1s or things of that nature. Nobody's being pinged when you load one of these posts.

I don't load the page in parts as you scroll it. It loads once and then you have it. If you get disconnected after that point, you can still read the whole thing. There's nothing more to be done.

I don't add images without ALTs and/or accompanying text in the post which aims to describe what's going on for the sake of those who can't get at the image for whatever reason (and there are a great many). (Full disclosure: I wasn't always so good at writing the descriptions, and old posts that haven't been fixed yet are hit or miss.)

I don't do nefarious things to "outgoing links" to report back on which ones have been clicked on by visitors. A link to example.com is just <a href="http://example.com/">blah blah blah</a> with no funny stuff added. There are no ?tracking_args added or other such nonsense, and I strip them off if I find them on something I want to use here. If you click on a link, that's between you and your browser, and I'm none the wiser. I really don't want to know, anyway. I also don't mess with whether it opens in a tab or new window or whatever else.

I don't redirect you through other sites and/or domains in order to build some kind of "tracking" "dossier" on you. If you ask for /w/2024/12/17/packets/, you get that handed to you directly. (And if you leave off the trailing slash, you get a 301 back to that, because, well, it's a directory, and you really want the index page for it.)

I don't put godawful vacuous and misleading clickbait "you may be interested in..." boxes of the worst kind of crap on the Internet at the bottom of my posts, or anywhere else for that matter.

My pages actually have a bottom, and it stays put. If you hit [END] or scroll to the bottom, you see my footer and that's it. It won't try to jam more crap in there to "keep you engaged". That's it. If you want more stuff to read, that's entirely up to you, and you can click around to do exactly that.

I don't make any money just because someone lands on one of my posts. You won't find ads being injected by random terrible companies. In fact, keeping this stuff up and available costs me a chunk every month (and always has). I sell the occasional book and get the occasional "buy me a cup of tea or lunch" type of thing, and I do appreciate those. (I tried doing paid watch-me-code "lessons" years ago, but it really didn't go anywhere, and it's long gone now.)

I'm pretty sure everything that loads as part of one of my posts is entirely sourced from the same origin - i.e., http[s]://rachelbythebay.com/ something or other. The handful of images (like the feed icon or the bridge pic), sounds, the CSS, and other things "inlined" in a post are not coming from anywhere else. You aren't "leaving tracks" with some kind of "trust me I'm a dolphin" style third-party "CDN" service. You connect to me, ask for stuff, and I provide it. Easy.

I say "pretty sure" on the last one because there are almost 1500 posts now, and while my page generation stuff doesn't even allow for an IMG SRC that comes from another origin, there are some "raw" bits of HTML in a few old weird posts that break the usual pattern. I don't think I've ever done an IMG or SOURCE or LINK from off-site in a raw block, though.

I don't even WANT stuff coming from off-site, since it tends to break. I find that I can really only rely on myself to keep URLs working over time.

Phew! That's all I can think of for the moment.

Feed readers which don't take "no" for an answer

I don't think people really appreciate what kind of mayhem some of their software gets up to. I got a bit of feedback the other night from someone who's been confounded by the site becoming unreachable. Based on running traceroutes, this person thinks that maybe it's carrier A or carrier B, or maybe even my own colocation host.

I would have responded to this person directly, but they didn't leave any contact info, so all I can do is write a post and hope it reaches them and others in the same situation.

It's not any of the carriers and it's not Hurricane Electric. It's my end, and it's not an accident. Hosts that get auto-filtered are usually running some kind of feed reader that flies in the face of best practices, and then annoys the web server, receives 429s, and then ignores those and keeps on going.

The web server does its own thing. I'm not even in the loop. I can be asleep and otherwise entirely offline and it'll just chug along without me.

A typical timeline goes like this:

  • 00:04:51 GET /w/atom.xml, unconditional.
    Fulfilled with 200, 502 KB.
  • 00:24:51 GET /w/atom.xml, unconditional.
    Rejected with 429.
    Advised (via Retry-After header) to come back in one day since they are unwilling or unable to do conditional requests.
  • 00:44:51 GET /w/atom.xml, unconditional.
    Same 429 + Retry-After.
  • 01:04:51 GET /w/atom.xml, unconditional.
    Just like last time.
  • 01:24:51 GET /w/atom.xml, unconditional.
    Same thing, again.

Somewhere around here, the web server decided that it wasn't being listened to, and so it decided it was going to stop listening, too.

Some time after this, it will "forgive" and then things will work again, but of course, if there's still a bad feed reader running out there, it will eventually start this process all over again.

A 20 minute retry rate with unconditional requests is wasteful. That's three requests per hour, so 72 requests per day. That'd be about 36 MB of traffic that's completely useless because it would be the same feed contents over and over and over.

Multiply that by a bunch of people because it's a popular feed, and that should explain why I've been tilting at this windmill for a while now.

If you're running a feed reader and want to know what its behavior looks like, the "feed reader score" project thing I set up earlier this year is still running, and is just humming along, logging data as always.

You just point your reader at a special personalized URL, and you will receive a feed with zero nutritional content but many of your reader's behaviors (*) will be analyzed and made available in a report page.

It's easy... and I'm not even charging for it. (Maybe I should?)

...

(*) I say _many_ of the behaviors since a bunch of these things have proven that my approach of just handing people a bunch of uniquely-keyed paths on the same host is not nearly enough. Some of these feed readers just go and make up their own paths and that's garbage, but it also means my dumb little CGI program at /one/particular/path doesn't see it. It also means that when they drill / or /favicon.ico or whatever, it doesn't see it. I can't possibly predict all of their clownery, and need a much bigger hammer.

There's clearly a Second System waiting to be written here.

As usual, the requirements become known after you start doing the thing.

Please upgrade past Pleroma 2.7.0 (or at least patch it)

Hey there. Are you one of these "Fediverse" enthusiasts? Are you hard core enough to run an instance of some of this stuff? Do you run Pleroma? Is it version 2.7.0? If so, you probably should do something about that, like upgrading to 2.7.1 or something.

Based on my own investigations into really bad behavior in my web server logs, there's something that got into 2.7.0 that causes dumb things to happen. It goes like this: first, it shows up and does a HEAD. Then it comes back and does a GET, but it sends complete nonsense in the headers. Apache hates it, and it gets a 400.

What do I mean by nonsense? I mean sending things like "etag" *in the request*. Guess what, that's a server-side header. Or, sending "content-type" "and "content-length" *in the request*. Again, those are server-side headers unless you're sending a body, and why the hell would you do that on a GET?

I mean, seriously, I had real problems trying to understand this behavior. Who sends that kind of stuff in a request, right? And why?

This is the kind of stuff I was seeing on the inbound side:

raw_header {
  name: "user-agent"
  value: "Pleroma 2.7.0-1-g7a73c34d; < guilty party removed >"
}
raw_header {
  name: "date"
  value: "Thu, 05 Dec 2024 23:52:38 GMT"
}
raw_header {
  name: "server"
  value: "Apache"
}
raw_header {
  name: "last-modified"
  value: "Tue, 30 Apr 2024 04:03:30 GMT"
}
raw_header {
  name: "etag"
  value: "\"26f7-6174873ecba70\""
}
raw_header {
  name: "accept-ranges"
  value: "bytes"
}
raw_header {
  name: "content-length"
  value: "9975"
}
raw_header {
  name: "content-type"
  value: "text/html"
}
raw_header {
  name: "Host"
  value: "rachelbythebay.com"
}

Sending date and server? What what what?

Last night, I finally got irked enough to go digging around in their git repo, and I think I found a smoking gun. I don't know Elixir *at all*, so this is probably wrong on multiple levels, but something goofy seems to have changed with a commit in July, resulting in this:

  def rich_media_get(url) do
    headers = [{"user-agent", Pleroma.Application.user_agent() <> "; Bot"}]

    with {_, {:ok, %Tesla.Env{status: 200, headers: headers}}} <-
           {:head, Pleroma.HTTP.head(url, headers, http_options())},
         {_, :ok} <- {:content_type, check_content_type(headers)},
         {_, :ok} <- {:content_length, check_content_length(headers)},
         {_, {:ok, %Tesla.Env{status: 200, body: body}}} <-
           {:get, Pleroma.HTTP.get(url, headers, http_options())} do
      {:ok, body}

Now, based on my addled sense of comprehension for this stuff, this is just a guess, but it sure looks like it's populating "headers" with a user-agent, then fires that off as a HEAD. Then it takes the *incoming* headers, adds them to that, then turns the whole mess around and sends it as a GET.

Assuming I'm right, that would explain the really bizarre behavior.

There was another commit about a month later and the code changed quite a bit, including a telling change to NOT send "headers" back out the door on the second request:

  defp head_first(url) do
    with {_, {:ok, %Tesla.Env{status: 200, headers: headers}}} <-
           {:head, Pleroma.HTTP.head(url, req_headers(), http_options())},
         {_, :ok} <- {:content_type, check_content_type(headers)},
         {_, :ok} <- {:content_length, check_content_length(headers)},
         {_, {:ok, %Tesla.Env{status: 200, body: body}}} <-
           {:get, Pleroma.HTTP.get(url, req_headers(), http_options())} do
      {:ok, body}
    end
  end

Now both requests call a function (req_headers) which itself just supplies the user-agent as seen before.

What's frustrating is that the commit for this doesn't explain that it's fixing an inability to fetch previews of links or anything of the sort, and so the changelog for 2.7.1 doesn't say it either. This means users of the thing would have no idea if they should upgrade past 2.7.0.

Well, I'm changing that. This is your notification to upgrade past that. Please stop regurgitating headers at me. I know my servers are named after birds, but they really don't want to be fed that way.

...

One small side note for the devs: having version numbers and even git commit hashes made it possible to bracket this thing. Without those in the user-agent, I would have been stuck trying to figure it out based on the dates the behavior began, and that's never fun. The pipeline from "git commit" to actual users causing mayhem can be rather long.

So, whoever did that, thanks for that.

Circular dependencies for socket activation and WireGuard

One of the more interesting things you can do with systemd is to use the "socket activation" feature: systemd itself opens a socket of some sort for listening, and then it hands it over to your program, inetd-style. And yes, I know by saying "inetd-style" that it's not even close to being a new thing. Obviously. This is about what else you can do with it.

Like in my previous tale about systemd stuff, you can add "deny" and "allow" rules which bring another dimension of filtering to whatever you're doing. That applies for the .socket files which are part of this socket activation thing. It can even forcibly bind it to a specific interface, i.e.:

[Socket]
ListenStream=443
IPAddressDeny=any
IPAddressAllow=192.0.2.0/24
BindToDevice=wg0

That gives you a socket which listens to TCP port 443 and which will do some bpf shenanigans to drop traffic unless the other end is in that specific /24. Then it also locks it down so it's not listening to the entire world, but instead is bound to this wg0 interface (which in this case means WireGuard).

This plus the usual ip[6]tables rules will keep things pretty narrowly defined, and that's just the way I like it.

I did this in a big way over the past year, and then never rebooted the box in question after installing such magic. Then earlier this week, I migrated that system's "personality" to new hardware and that meant boots and reboots here and there, and wasn't it weird how it was spending almost two minutes to reboot every time? What the hell, right?

Digging into the systemd journal turned up that some of the "wg" stuff wasn't coming up, and it sure looked like a dependency cycle. A depends on B, which depends on C, which depends on D, which depends on A again? If not for the thing eventually timing out, it wouldn't have EVER booted.

I'm thankful for that timeout, since the rest of the box came up and I was able to get into that little headless monster to work on the problem.

The problem is basically this: if you have a .socket rigged up in the systemd world, you by default pick up a couple of dependencies in terms of sequencing/ordering at boot time, and one of them is "sockets.target". Your foo.socket essentially has a "Before=sockets.target", which means that sockets.target won't succeed until you're up and running.

But, what if your foo.socket has a BindToDevice that's pointing at WireGuard? You now have a dependency on that wg0 thing coming up, and, well, at least on Debian, that gets interesting, because it ("wg-quick@wg0" or similar) wants basic.target to be done, and basic.target in turn wants sockets.target to happen first.

foo.socket waits on wg waits on basic waits on sockets waits on foo.socket. There's the cycle.

Getting out of this mess means breaking the cycle, and the way you do that is to remove the default dependencies from your .socket file, like this:

[Unit]
DefaultDependencies=no

After that, it's on you to set up the appropriate WantedBy, Wants, Before or After declarations on your .socket to make sure it's attached to the call graph somewhere.

I should mention that it took a LOT of rebooting, journal analysis, cursing, and generally complaining about things before I got to this point. If you're in a mess like this, "systemd-analyze dump <whatever>" is generally your friend, because it will point out the *implicit* dependencies which are just as important but which won't show up in your .socket or .service files. Then you get to sketch it out on paper, curse some more, and adjust things to not loop any more.

There doesn't seem to be a good way to spot this kind of problem before you step in it during a boot. It's certainly not the sort of thing which would stop you before you aimed a cannon directly at your foot. Apparently, "systemd-analyze verify <whatever>" will at least warn you that you have a cycle, but figuring out how you got there and what to do about it is entirely up to you. Also, if you don't remember to run that verify step, then obviously it's not going to help you. I only learned about it just now while writing up this post - far too late for the problem I was having.

I sure like the features, but the complexity can be a real challenge.

Words fail me sometimes when it comes to feed readers

What in the name of clowntown is going on here?

ip - - [04/Dec/2024:23:18:21 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:22 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:22 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:23 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:23 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:23 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:23 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:24 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:24 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:24 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:24 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:25 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:25 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:25 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:25 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:26 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:26 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:26 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:26 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:26 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:27 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:27 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:27 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:27 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:28 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:28 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:28 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:28 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:29 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:29 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:29 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:29 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:30 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:30 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:30 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:30 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:30 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:31 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:31 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:31 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:31 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:32 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:32 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:32 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:32 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
ip - - [04/Dec/2024:23:18:32 -0800] "GET /w/ HTTP/1.1" 200 229674 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"

Note: the post index page isn't the feed. It has never been the feed. Even still, what's up with the unconditional requests, and several per second? What is that supposed to accomplish? In what world does that make sense?

Do you ever wonder if feed reader authors point their stuff at their own servers? You'd think they'd notice this kind of thing.

Oh, also, this will no longer work.

Just cracking an embedded root password, no big deal

Yeah, I'm still here. I'm still processing things from last month. Obviously.

...

I just spent a few CPU hours on cracking something stupid and figured I'd share the results so nobody else goes through the same thing. Save some time and power, you know the thinking.

tukC0rEjV4g3g --> lvl7dbg

... and if this means anything to you, then you'll also be glad to know that the default value for "PermitRootLogin" of "prohibit-password" should be in place so nobody can get into your device that way. Test it yourself to be sure if you're worried. I sure did.

I've had a change of heart regarding employee metrics

I know that if you go back far enough in these posts of mine, you will find some real crap in there. Sometimes that's because I had a position on something that turned out to not be very useful, or in some cases, actively harmful. This sucks, but that's life: you encounter new information and you are given the opportunity to change your mind, and then sometimes you actually do exactly that.

Recently, I realized that my position on something else has changed over time. It started when someone reached out to me a few weeks ago because they wanted me to get involved with them on some "employee metrics" product. It's some bullshit thing that has stuff like "work output" listing how many commits they've done, or comments, or whatever else. I guess they wanted me to shill for something of theirs, because from my posts, clearly I was such a fan of making that kind of tool, right?

I mean, sure, way back in 2004-2006, I was making all kinds of tools to show who was actually doing work and who was just sitting there doing nothing. I've written about a number of those tools, with their goofy names and the "hard truths" they would expose, showing who's a "slacker" and all of this.

When this company reached out, I did some introspection and decided that what I had done previously was the wrong thing to do, and I should not recommend it any more.

Why? It's surprisingly simple. It's the job of a manager to know what their reports are up to, and whether they're doing a good job of it, and are generally effective. If they can't do that, then they themselves are ineffective, and *that* is the sort of thing that is the responsibility of THEIR manager, and so on up the line. They shouldn't need me (or anyone else) to tell them about what's going on with their damn direct reports!

In theory, at least, that's how it's supposed to work. That's their job: actually managing people!

So, my new position on that sort of thing is: fuck them. Don't help them. Don't write tools like that, don't run tests to see if your teammates will take care of basic "service hygiene" issues, and definitely don't say anything substantive in a performance review. None of it will "move the needle" in the way you think it will, and it will only make life worse for you overall. "Peer reviews actually improve things" is about the biggest crock of shit that people in tech still believe in.

Once again, if management is too stupid to notice what's going on, they deserve every single thing that will happen when it finally "hits the fan".

Make them do their own damn jobs. You have enough stuff to do as it is.

...

I feel like giving this a second spin right here in case I failed to reach some of the audience with the first approach. Here's a purely selfish way of looking at things, for those who are so inclined.

Those tools I wrote 20 years ago didn't really indicate who was slacking at working tickets or whatever. What they *actually indicated* was that the management at Rackspace, by and large, had no clue what was going on right under their noses. And, hey, while that was true, that can be a dangerous thing to say! You want enemies? That's a great way to get them.

So, why expose yourself? Suppress the urge to point out who's slacking. It will only come back on you.

Set me right on whether caring is even possible

I know I have a weird and kind of skewed view of the world, since I expect nobody else to care about some things, and then the world goes and surprises me by jumping on it anyway. About six months ago, I expected the worst for my whole "let's improve the feed reader ecosystem" thoughts, and figured it would go unnoticed, and yet it's been anything but.

Clearly, I am miscalibrated for what the world is willing to do. So, with that in mind, I figured that I need to bring more things up in this context to see what the reaction is instead of just discarding it early due to my own doubts.

Tonight's concern goes like this: I was flipping through some old posts and ran across something from 2013 where I said "your simple (shell) script is someone else's bad day". It was inspired by the "onboarding" process for dev servers at FB which I was going through that week, and it was a total shit-show. It had plenty of the aforementioned scripts that you couldn't restart without making things worse. It took way too long for me to get my hands on a Linux box that had things set up so I could crank on stupid www stuff.

In re-reading that post the other night, it got me wondering what I could do in this space to try to raise awareness of various topics. Maybe I'd look at a few old posts, decide that there's a cluster of problems ("idempotence" in this case), and that maybe it would be a net win for humanity if more people were slightly better at it. Then I'd try to address that directly... somehow.

My goal would be to get more people in this life who, at least for that topic, look at everything they encounter and go "what if this fails, and is restarted". I basically would love to see more people going "what if" for that particular topic. Worry about the robot and the five quarts of oil. (Go look at the 2013 post if that made no sense.)

I don't know if that's plausible. I actually worry that it's not even possible for most people, and that caring about such things is some weird genetic corner case that is born, not made, and it would be unreasonable to expect it from the rest of the world.

This is where I need input. I have this feeling that such topics are not seen as important by reasonable people, and that they honestly never will be. Just because it matters to me is no indication that it will resonate anywhere else. That much has been made crystal clear over the past few years.

Wouldn't it be amazing, though? Just imagine it: groups of people heading out into the world armed with additional things to consider when designing something, and the resulting improvements in reliability for the things they work on down the road.

Then we'd have to continue on to some other fundamental topic, and try to move the needle on that, too.

Finally, I should mention another possible outcome: those same people go into the world with that knowledge, go to apply it, and run into the same brick walls that some of us have already been hitting because nobody else sees the point. They get all bitter and dejected, emit their own "delightful burnout sauce" (I love that line, btw), and start writing their own screeds on things like this, and we start over.

You can bake an amazing cake, but if they aren't hungry for cake, well, you might have just wasted a bunch of time.

Friday night feed reader score report

Oh no, look out, it's another list of feed reader behaviors. This is based on my observations for those test keys which have been active in the past few days. Anything that stops polling eventually ages out of the report and won't show up here.

My thanks to those feed authors and contributors who have been cranking away to tweak things here and there. It's making a clear difference in the real world. I see more 304s and fewer 429s in general, and that makes me very happy.

Here is what I have to say for this time around, with some amount of aggregation applied. Some entries cover multiple test keys if the behavior is the same.

...

No UA. Sends bogus "" INM at startup, then makes up IMS values subsequently. Broken caching behavior that will miss updates in the real world.

Miniflux dev stuff. Seems a whole lot better now. Need to do some more pathological injections to see how they handle it.

Miniflux, older versions. Needs to upgrade. (Several different test keys).

SpaceCowboys Android RSS Reader 2.6.{31,32,33}. It's still doing unconditionals seemingly at random, and sometimes has bad timing. No idea what the hell is going on here.

FreshRSS dev stuff. Seems decent of late. This one also seems to be honoring Cache-Control "max-age" stuff, which is a bonus I never expected anyone to do. That's awesome. The useless referrers are also gone, so hooray! Lots of good stuff happening here.

FreshRSS/1.24.3. Didn't get tripped up on my last round of pathological header values. Will probably get even better when it picks up the latest dev work, like dumping the referrers (see earlier dev entry).

Audrey2 RSS reader, various versions. Seems pretty chill. Also seems to be honoring Cache-Control max-age! Great stuff.

Emacs Elfeed 3.4.1. No idea what controls the polling interval but it's doing fine. Some early weirdness that might be multiple instances running in parallel but nothing recently.

Slackbot. Broken beyond belief. Must have been some intern's project that's been left online to irritate people. Sends HEADs, gets 405s, sends more. Probably will force me to write "let's see what happens when you start getting 403s and 404s and 410s" and whatnot, and then I'll complain about it not honoring those, too. Watch and see.

Newsboat, various versions. Probably okay now in terms of caching behavior but I need to shove some more pathological cases at it to be sure.

feedmail.org/0. This one looks like it's trying desperately to honor the max-age thing, but sometimes it arrives a hair too early - on the order of about 100 msec here. So weird. How damn often is it considering these polls for it to be able to hit it that close? Does it ever rest?

SpaceCowboys RSS Reader, 2.6.{29,30,31,32,33}. This one doesn't seem to be doing the unconditionals or short timing any more. It did some early on but then it stopped the silliness.

Unread RSS Reader. Plugging away at a 21 minute interval, up from 15. At least they're conditional. (At least two distinct test keys.)

Some browser UA. Godawful timing, always unconditional. This would trip automated blocks on the real site, no doubt.

cry-reader v0.0. No idea how it picks its intervals, but they're always far enough apart and are always conditional, so that's great. Haven't managed to trip it up with any pathological header values.

Netvibes. Same super broken caching behavior, still.

SpaceCowboys ... etc ... 2.6.31. Never does IMS, only occasionally does INM, wildly unpredictable polling intervals. Not good behavior. No idea how there can be so much variance between these things (see above).

NextCloud-News dev. The caching behavior is getting better, and -huzzah- it has a real version number at last! This means it will be possible to treat all of those old versions as damaged and route around them. Would be nice if it didn't keep pulling / 3x and favicon 1x for every. single. feed. request. Those are wholly unnecessary.

NextCloud-News/1.0. This version is heavily broken in the caching department, but help is on the way (see dev version above). If you're running this, you should plan to upgrade to pick up those essential fixes.

NewsBlur. Usually okay but sometimes it does crazy stuff like 3 *unconditional* polls in two seconds. What? Why?! (UA is also crazy long since it's waving a dead chicken in the form of including a complete forgery of a Safari UA, (nested parens) and all).

Friendica, various versions. 100% unconditional requests. Occasionally shows up < 10s after the previous poll. Not good.

TTRSS. Not recommended.

ureq, various versions. Seems okay. Occasional short-spacing between polls, but at least they're conditional.

NetNewsWire. Multiple distinct test keys, but none of them have a version number so there's no way to know if anything's changed upstream. Since I started writing this list, one instance came back fast enough to demonstrate that it still has the buggy caching behavior.

haven, 100% unconditional, bumpy timing, unchanged from before.

inforss, multiple versions. Seems fine.

curl, various versions. It does conditionals properly and just has the usual 59m/60m fencepost scheduling thing.

Inoreader/1.0. This thing periodically flips to unconditional mode, and also has the 59m/60m thing going on. On the prod side of the house, it gets itself auto-blocked fairly often for bad behavior.

Feedbin. Seems fine now. (Three distinct test keys, same behavior.)

NetNewsWire. Probably still buggy - was about three weeks ago, at least. Need to see what happens when I do a fresh round of tests.

Liferea/1.15.3. Something is very wrong with the caching on this thing. It drops to unconditionals a lot, and recently started sending some bullshit 1970 IMS header with things like 5s and 14s intervals. No idea what the hell is going on there.

rawdog/2.24rc1. 138 days with no complaints from me.

NetNewsWire. Just like the last one: was showing signs of buggy cache behavior before, and probably will again when I next launch into the tests.

NewsBlur. Not quite as animated as the earlier instance but still uncorks an unconditional from time to time for some reason.

rss2email, various versions. Occasional unconditionals for some reason.

Newsboat/2.36.0. Jumpy timing, sometimes too fast. At least they're conditional.

CommaFeed, various versions. Has had caching bugginess in the past, and did during the last test. Probably still does.

Some browser UA. Keeps banging away with <1s unconditionals. Would definitely get auto-blocked in prod. Maybe multiple instances are running, but why would one always send unconditionals?

theoldreader.com. This seems to have switched behaviors to conditional polls in early September. I see a pair of too-quick polls since then, but even those were conditional so it's not that bad. This is also way better than before. Thank you for the fix, devs!

Generic "Go-http-client/1.1" UA. Seems fairly chill.

Rapids. This one turned a corner with some upgrades back in July and has been delightful since.

er0k feeds. Was upgraded at some point and then didn't trip over any of my "inject some crazy" stuff over the past month. Smooth sailing.

Emacs Elfeed 3.4.1. Mildly wacky behavior that suggests multiple instances banging away at the same key - either that, or it has really scary caching bugs. I hope it's the former and not the latter.

walrss/0.3.7. 145 days of complaint-free behavior. Nice.

Yarr/1.0. Just needs to tweak the 59m/60m fenceposting thing.

Some browser UA. This one settled down into a good polling interval and does conditional requests properly. Good enough.

Feedly. Still doing the unconditional requests daily for some reason. Fail. (Multiple distinct test keys, same behavior.)

feedparser/6.0.2. Just the 59m/60m thing. Otherwise is managing to do conditional (INM) requests so it's not the end of the world.

Reeder, various versions. This one acts strangely sometimes, and sends the wrong value in an INM header, like it got hit over the head and reverted to something from a few days *or weeks* before. Also sometimes sends unconditional requests. Could be multiple instances, so don't do that. One key = one instance of one reader. Anything else is chaos. (If this is you, ask me for more keys!)

Lots of mixed up stuff, but of late it's something claiming to be a browser that bangs away with unconditionals at 10 minute intervals. Would be auto-blocked in prod in a jiffy.

Yarr/1.0. Sends conditionals but has terrible timing - *way* too fast on those polls.

Bloggulus/0.4.0. Seems to be doing fine.

newsraft/0.27. This got some tweaks recently to deal with pathological ETag/LM behavior. Thanks for that!

feedparser/6.0.10. Still running too quickly: 29m/30m. They're conditional, but ... eh.

Mojolicious (Perl). 147 days of polls with no issues to report.

Some browser UA. Seems to be fine.

com.vanniktest.rssreader, various versions. Weird timing, as before. No idea what kind of scheduling it's using - repeats within 1-2s make no sense.

Broadsheet/0.1. Late update to this post: it's doing fine, but there's an unusual bit of wackiness showing up here. The URL it's using is checked into a github repo, and something or someone (not the author, and not Broadsheet) is pulling them. Maybe this is github trying to be "helpful". Joy.

[This entry originally said it was changing UAs and doing wild unconditional requests from time to time. It is not. Something else with the key is taking it for a ride. Raar.]

Occasional odd behavior - UA changes to something else, and it fires off multiple unconditional requests within a second. No idea if this is someone running multiple programs against the same key. (If this is you, ask for another key, or maybe get two, and stop using this one.)

feedbase-fetcher.pl/0.5. This settled into a good groove.

Artykul/1.0. The clown leak. I really need to just block them, or at least, start sending them really broken stuff to see what happens. Installing their app shouldn't have subscribed their backend to my early testing URL, but it did, and uninstalling it didn't stop it!

sshfs isn't really what I wanted for my backup situation

Okay, it seems I missed the mark yesterday while telling my story about a dumb little backup scheme, because a bunch of people have reached out to ask "why not sshfs"?

So, okay, some assumptions, first. I'm assuming from some basic reading that sshfs is one of those things that lets you take a filesystem that's on a distant host and mount it (with FUSE, sigh) on a local host. If that is in fact what it does, that is very much *not* what I want.

I don't want my stuff to be visible as a filesystem on the remote host. That is, I don't want the copies of my data to be sitting there, accessible to anyone who can get on that machine. I want it to be a amorphous blob of (obviously?) encrypted data. The only place where it should make sense is on the client machine after I've supplied the key.

Speaking of keys, that key material (like a typed passphrase) never leaves the client system, since all of the crypto stuff happens locally. If this stuff was a regular filesystem on the other end, then that host would need access to the key, and then it's no longer really under my control, right?

Given all that, let me come at this again.

I have a host. I want to back it up to something that's not in the same physical location, and ideally is rather far away. Having different power companies, ISPs, you name it? That's all a good thing in this case. (Also, "different fault lines" is always a consideration here.)

I have access to that host, but assume I don't have root access to it for whatever reason. I'm just a user on there, and I can ssh in, and I have a healthy amount of disk quota. I'm also welcome to use oodles of (network) bandwidth to do whatever I need to do.

I don't, however, want the contents of my backed-up data to be visible to anyone on that host, _even if_ they have root on it, whether legitimately or otherwise. The remote host is just being a dumb block device for me.

I don't want to have any extra daemons running that would provide network access to the blob o' bits that constitutes the encrypted image. It's a whole bunch of extra complications that I don't need or want, since ssh gives me a perfectly good pipe without changing anything fundamental about how that system works.

So, the end result is that I extend nbd over a wacky ssh connection to a Unix domain socket on the source machine, do my crypto magic to make it into a viable filesystem, then fsck it, mount it, and rsync my stuff onto it.

The only part of this I had to create was the aforementioned wacky ssh connection. nbdclient itself won't do a "ssh to far end, start nbdkit server, and then stay alive copying bytes around" thing. It's all about getting the connection up and handing it off to the kernel. It only sticks around if it's doing TLS mode and since that means running a persistent server on the far end, I'm not using it that way.

That's it. It's just one of those dumb things you build to make other stuff possible. Tools to make tools.

Janky remote backups without root on the far end

Sometimes I do dumb things to solve my own problems. This is one of those times. In this case, I wanted something that would give me access to a block device on a physically distant machine for backup purposes. I didn't want to do anything particularly fancy on the distant box, so rootly powers are out of the question. I just need disk space, a bit of bandwidth, and some CPU time every now and then.

Here's how it works. Perhaps you have heard of "nbd" if you're in the Linux world. It lets you load a kernel module on the client machine and then it'll turn a network type connection into a block device. That is in fact right in the name: "nbd" equals "network block device". Nice, right?

I wanted this, but didn't want it making a "bare" connection to the far end. By default, it's just a plain old TCP session to the other side. While you can rig it to use TLS, there are a bunch of problems with this. It means you need to leave a daemon running on the far end, and again, if you aren't root out there, that can be troublesome. It means the distant host ends up stuck with a listening port, and that's not always nice. It also forces you to deal with getting the whole certificate thing hooked up. Hope you're good at wrangling OpenSSL.

This seemed wholly unnecessary to me. I already have ssh access to the machines in question, so that right there will give me a relatively solid transport. I just needed to convince the nbd stuff to use it.

Now, let you stop you right here: if you're saying "SOCKS proxy" and/or "port forwarding", you have missed the part where I'd rather not have a persistent daemon on the far end that's listening on some port, *even if* it's "merely bound to loopback". That's still something running as me that presents an unnecessary orifice to unwelcome visitors. No thank you.

There's another way the nbd client can work: it will totally connect to a stream-style Unix domain socket. As long as whatever is on the other end of that socket speaks the right language, it doesn't matter where it is or what it is. Thus, I needed to make a Unix domain socket reach over the network to the other end of a ssh connection.

Here's what happened: I wrote some grungy plumbing stuff (my specialty) that sets up a Unix domain socket on my local machine. It does it in a directory where only I can access it, and it waits for a connection. Once it gets one, it forks, and the child fires up a ssh to the far end to invoke the nbd server (a trivial userspace thing I can leave in my personal bin directory).

Meanwhile, the parent process of the client sticks around and fires up a pair of threads to do a bucket-brigade thing. It takes any data from the Unix domain socket connection it just received and flings it at the ssh connection. The other one does the reverse.

This keeps going until one of the fds shows up as disconnected, at which point it shuts down everything and exits politely.

The other part of this is a currently a bit of *local* rootly scripting madness which points the NBD client at that socket, then does the crypto gunk to attach it, then fscks it, mounts it, and fires up rsync. Then after it's done, it undoes everything and declares victory.

It's not meant to be fast, and it's definitely not meant to be beautiful, but it does work. My only footprint on the far end is a giant blob of a file that is entirely meaningless to anyone without a suitable key. The data is only ever seen in a usable form here on the client machine which already has access to the original data by definition. I didn't have to install anything special on the far end, I don't have to run any daemons, and I didn't need root out there.

It's the type of thing that you can stand up and just let it run every now and then, kind of like Time Machine on a Mac. You hope you never need to use it, but it's there if you really get stuck somehow.

So, yeah, when my part of CA "falls into the ocean", I will bob up to the surface, swim to shore, and then walk to where my offsites are and restore from backups. You know, that whole thing.

Just basic sysadmin stuff. Nothing fancy.

Let the network tell you where you are: a nerd snipe story

I was successfully nerd-sniped a few days ago and figured I'd share my proposed solution with everyone just in case they could benefit from it. I've added a few of my own constraints based on expectations for how things could go wrong. So, if this seems familiar, maybe it is, but I've made it a little more complicated.

The situation is basically this: there's a large space with a bunch of dumb Linux boxes which are attached to displays of some sort. Different things are displayed depending on where it is in the space. This means the Linux boxes need some sense of identity to be able to tell the server "I need this particular set of stuff". They're "hands off" - nobody wants to log in to them to manage them.

So now let me present some of the issues:

Problem: relying on IP assignments is no good, since the DHCP mechanisms for that network can't be trusted. Things are sufficiently flaky to where you can't rely on it, and part of this is from the next item.

Problem: the hardware is wonky enough to where it gets replaced sometimes, so the MAC addresses also end up changing. This means any other on-board identifier (board id, cpuid, whatever) would also change. It would be nice if someone didn't have to keep updating mappings on the server every time the hardware got shuffled around.

Problem: it's a PITA to stamp out unique images for these things. The goal is to be able to stamp out a single image that's identical for all of them, and then have them all boot from that and figure out their identities some other way. (The same goes for netbooting, if you're thinking of that already.)

Problem: any other "ID marker" type solution that involves twiddling the set of installed hardware somehow means that you have to be sure each one ends up in a specific spot. If you have relatively unskilled people running around installing these things, this might not be what you want. It's easier to just say "make sure there's one plugged in at every space on this list" without worrying about matching specific units to specific spots, in other words.

Wouldn't it be cool if you could just grab a box, plug it in a given spot, and it would somehow start displaying the right set of media? That's what I was thinking when I came up with this idea.

So, how do you do it? My "solution" (which has its own caveats) comes in the form of LLDP.

For those who haven't encountered it yet, here's the scoop: certain higher-end Ethernet switches have the potential to _multi_cast identifying frames every so often. Any given station on such a switch might get a LLDP frame from it a couple of times per minute.

In that frame, you can see interesting things, like the MAC address(es) of the switch, its name, what OS it might be running (!), the management IP addresses (!!), and, finally, what the name of YOUR PORT is.

	Port Description TLV (4), length 4: eth2
        Port Description TLV (4), length 8: Gi1/0/24
          0x0000:  4769 312f 302f 3234

Yep, that's right, if you're on such a switch, you might well have something arriving on your interface every so often that says "oh by the way, you are connected to me on my Gi1/0/24". Another device on that same switch would get nearly the same announcement, but it might say "Gi1/0/48" or whatever else.

The point is: if you're that derpy little display machine, you now have a way to tell yourself apart from your "coworkers". You could build a tuple of (switch name, switch port) and kick that at the display server, and let it figure out what set of images/videos you're supposed to run.

Now, there's always a catch, and here it is: this assumes that you're not changing the mapping between ports on the switch and the actual RJ45s in the space. I'm assuming that port 23 on your switch will always be over there by the front door, port 24 will always be next to the bathrooms, and 25 will be by the patio door. You get the idea.

If those change, then yeah, you get to reconfigure everything, but at least it's still only on the server, and you don't have to deal with the dumb little display machines.

How do you see this? Well, tcpdump would be a start. A filter of "ether proto 0x88cc" will do the job, but make sure you stick a -v on there to make it actually expand the details (or it'll be pretty dull).

Some other things: a relatively low-end switch won't do anything in this space. It won't generate it, and it won't "absorb" it from other sources. So, if your listening post running tcpdump is plugged into such a beast, it might hear LLDP stuff coming from multiple other sources but it won't see anything from the switch itself. Yep, it's more than switches which potentially send these out. "Server class" systems with their own little management device ("BMC") glued onto them tend to do this, too.

Normally, such traffic would not cross the switch, but that requires a switch that actually "gets" it and knows to not forward it. Ordinary dumb switches will treat it just like any other bit of traffic and will send it on down the line.

I should also note that other protocols exist which serve similar purposes and it depends a lot on which switch vendor you've chosen. It might not be LLDP but it might be something else which ends in "DP".

If in doubt, sniff the network and find out!

Thoughts on working inside a data center suite

I have a few more thoughts on the whole topic of colocation. First of all, Joel wrote in with a couple of tips beyond the basic "screwdriver and flashlight" that I mentioned. He says you should bring hearing protection, a step-stool or small ladder, and a jacket if you get cold. I like this thinking, and figured I'd expand on this for the benefit of those wondering what this all means.

First up, these places are LOUD. Everything you can imagine has fans on it. Obviously there are massive air handlers in the suites, but the (proper server class) computers and switches and everything else are also rocking a ton of fans. Some of them throttle back when the CPU load isn't too high, but a fair number of these things actually have rather high CPU load and so they never throttle back.

I mean, it's 2024, and people are writing CPU-bound computational stuff in languages that are interpreted, single-threaded, and slow as shit. OF COURSE they're running their CPUs as hard as they possibly can. But I digress.

So yes, it's loud as hell, and you could benefit from some kind of active protection. Just don't do what I did one time by cranking up the music in regular earbuds to cover the noise. Yeah, you might be able to hear your tunes, but you won't be hearing much else afterward. Stupid move, I know... now.

The step-stool or small ladder is not always a given. My particular cabinet isn't super tall and I'm able to reach all of it, but this might not apply for everyone else. Alternatively, you might need to use it to support something from below while doing an install, especially if you're flying solo.

The jacket is another one of those things that you might not appreciate until you've been on the inside. It's not like the 90s when everyone just had a giant room and did their very best to cool the entire thing down as far as possible. These days, there are "hot aisles" and "cold aisles": two rows of machines face each other across a walkway, and cold air is blown into there. Then the air goes through the machines thanks to all of those fans and ends up on the back side where it joins with hot air from yet another row that's also backed up to it. Finally, it's drawn into a chiller to start the process over again.

If it's a solid concrete floor type of setup, then the air will have to come from above, but if you're on a raised floor, it'll probably just emerge from beneath. If you're in that kind of setup, make note of what you wear before making a trip lest you turn into Marilyn Monroe.

(Yeah, I just gave some advice that only applies to a subset of the people who will ever read this and need it. Yep. That happened.)

The nature of things is that you will inevitably need to access your rack from both sides to reach certain parts of the equipment, so it will be really hot sometimes, and it'll be really cold other times. This is why I will amend that advice to "bring a jacket that zips".

Now, if you're in the Bay Area like me, you have hopefully long ago internalized the wisdom that "you will never be far from a light sweater", and so you *already* have one of those that goes everywhere with you. If so, you're set. Be ready to adjust it as appropriate.

Other stuff? Once you're past the "drop a Raspberry Pi in there" stage, you should not just jump on ebay and buy the first rack mounted server that looks like it'll work because you hate hardware and want it over with. This is because not all racks and not all servers are created equal, and you might find out the hard way that it won't work. I came super close to screwing up *hard* due to this.

I would recommend first going in with a tape measure to figure out exactly what you're working with. What are the posts like in the rack? Are they fixed in place or can they be adjusted? Are they round holes or square holes? How long can a server be while still fitting into the rack? Will the door(s) still be able to close and latch?

Bear in mind that it's not just the length of the machine, but you also have to include the loops of cables which emerge from it: power and Ethernet at the very least. You really don't want to force a regular power or network cable into a "hard 90" type of situation to make it all fit into the space. With a little research, you can get power cords which have their own built-in 90 degree turn on the end where it plugs into the machine that will let you claw back a bit of space, and likewise for the network stuff.

Or, you know, you can just buy a shorter machine and use normal cables.

Measure twice, buy once.

Another complication: if your provider gives you power by way of a "zero U" PDU (basically a big power strip that stands up vertically), that's both a plus and a minus. It's a plus in that you're not burning any rack units by definition: it's "zero U". But, it can be a problem because it still takes up space, and if you have anything really long in there, it'll probably bump into it. This constrains you to only using spots in the rack which are not blocked by the sneaky little PDU. It's just another 3-D Tetris problem for you to solve.

More jabbering from me about non-clown hosting

Late yesterday, I put up a post about how to get into colocation in about the crappiest way possible. I skipped a bunch of details just to get it out there. The inspiration was based on finding out just how many people have no idea that this business model even exists.

I used to work in a cousin of this space: managed and dedicated hosting, and here's how this all lined up:

Colocation (as mentioned yesterday) gives you some space, some power, bandwidth, a network allocation of some kind (and/or ability to route your own stuff), and hopefully some decent HVAC to keep everything cool. It's your hardware in their space, and if something breaks on that stuff, you get to fix it. What if a drive fails? That's on you. Your switch or router goes insane? Same deal: that's all you.

You might be able to get some "remote hands" service from the provider for very simple tasks: reboot something, take a picture of the (blue) screen, that kind of thing. Some also let you pay them to go and do other stuff with a higher degree of complexity. Read the fine print.

Managed hosting is where you get access to a box somewhere, and are usually given root on it. You can do about what you want to it, but beware of the "spheres of support" as we used to call it - the hosting company's support people will only go so far. You might want Debian, but they only do RHEL. You get the idea. You can probably ask them to do kernel upgrades, troubleshoot why it's being slow or seems to be down, install and configure certain things for you, and so on.

If the hardware fails, that's on them. They rip open the box and replace the part. You still have to figure out how to make your stuff work before, during, and after the event. They probably won't do any migrations for you, since everyone's setup is an odd little snowflake and only the customer has any chance of knowing how it all works.

Higher-end setups might offer hardware firewalls, load balancers, backup solutions, and more. Now, granted, it's been almost 20 years since I last did this myself - I assume someone's still doing this somewhere.

Dedicated servers are a step down from managed. You (probably) get root on a box somewhere, and you can do a lot of stuff on it. Outside of the absolute bare essentials provided by the hosting company (like a Red Hat Network entitlement), they aren't going to do much for you. They might have rigged up some magic hardware to let you power-cycle what's just a boring old whitebox without explicit out-of-band access. They might also have some netboot magic to let you shunt it into a "repair environment" if you really screw things up.

All of the servers used for hosting this site were of the "dedicated" persuasion initially: first at ServerBeach, then at SoftLayer (which became IBM). The SB machines were relatively crappy whiteboxes shoved into a bread rack somewhere, while the SL/IBM stuff was on a Real Rackmounted Server with out of band remote access and the whole bit. They handled the hardware, and I handled the software.

I mentioned some of the details last year when departing the IBM shit-show and the Texas shit-show at the same time. This ended the era of dedicated hosting for me and moved me into colocation.

These days, there's a cabinet in a nice conditioned space not too far away. I initially parked an old switch in there and a derpy little Raspberry Pi, just like yesterday's post implied. This was to give me a little "platform" while the rest of the stuff happened. A few days later, I got some "real" hardware, hung that in there, and then migrated everything to it from the SL/IBM box.

That was a bit over a year ago, and things are pretty much unchanged. This setup should just sit there and run and run and run until something breaks beyond that which can be handled in software. Then I get to drive out there and flip some parts around and make it go again.

It's not rocket science. This way of doing things has always been possible. You don't have to deal with clown providers pulling the rug out from under you. There are a few dimensions in which the provider can mess with me now:

1. They could have something terrible happen - think flooding or a fire. It could take the whole location offline for an extended period, and it could well destroy the equipment inside, such that merely relocating it to another spot would not be sufficient to revive things.

2. They could decide they're done with this kind of business model, and now they're going to boot out all of the little people who pay a pittance per month. Maybe they're going to sell out to the latest clown provider who wants to melt the ice caps and drown the polar bears with their "AI" crap. (This would also apply if the private equity vultures swoop in and try to hollow the place out before nuking it entirely.)

3. They could crank the prices up beyond that which I could afford.

4. They could do something that somehow makes the basics incompatible: they stop supplying AC power, conditioned air, or IP over Ethernet. (I know, this is ridiculous, but I might as well mention it.)

5. They could shack up with some really evil people such that I no longer want to do business with them.

My responses would be approximately this:

1. Acquire some other hardware, install a fresh OS and restore my stuff from backups, then park it somewhere else.

2, 3, 4, 5. Find a new spot, then schlep the hardware there.

...

If this is old hat to you, great! It means you're probably a grizzled 1990s sysadmin just like me, consarn it! This isn't for you, then.

This is for the newer folks who might not have realized that there's an alternative to paying tribute to one of the three churches of the Clown: M, G, or A. If you want to "get your stuff online", there are other ways... and there always have been!

A terrible way to jump into colocating your own stuff

I've been wanting to do this for a while: basically, to write a really snarky post about the bare minimum required to run your own hardware in a colocation environment. I'm talking about doing as little as possible, and possibly screwing up bigtime while you're at it.

Based on chats with some of my friends, it seems like this model is not well known. One of them had a situation a while back where everyone was now working from home due to COVID, and so they would all VPN back to the office park where the servers were, including their NAS "box-o-disks". It was obnoxiously slow and laggy and failed a lot, too.

I said he could probably find a spot to colocate that stuff given that he lives in a major metro area, and sure enough, he did. A few weeks later, everything was parked in there, and the resulting boost in connectivity (throughput and availability both) made everyone so much happier.

I think they were also able to dump the remaining cruft at the office park as a result. Win-win, right?

So, without further ado, here's the terrible list I just scribbled down rather quickly to get this going:

0. Scrounge up at least an old dumb Ethernet switch (or *gasp* a hub!) and some random-ass hardware that'll run Linux, like a Raspberry Pi, or some old PC box or whatever.

1. Install Linux on the box. Turn everything off but sshd. Turn off password access to sshd. If you just locked yourself out of sshd because you didn't install ssh keys first, STOP HERE. You are not ready for this.

If you somehow survived, continue as usual.

2. Find a place that you can physically access that will sell you some chunk of a cabinet, rack, shelf, or whatever else. You have to be able to get there somehow: walk, bike, drive, take the bus, whatever ... *while carrying equipment with you*. Keep that in mind.

If it's on the other side of the country (or planet), it might not be what you want, in other words.

3. Make a deal with the place. Prepare to throw some money at them.

4. Wait for them to set up your account. Get the networking details. Make sure they know you're just going to be running a stupid little switch and not a full-on router.

They'll probably carve out a tiny little block of v4 space where their router is one of the IP addresses, and you get to use the rest. They'll probably do the same with IPv6, only it'll be a /64 because why the hell not? (If they don't do IPv6, ask for your money back and run away.)

5. Configure the Linux box to match those details: static IP assignments! Yes! No DHCP here!

6. Figure out how to get your laptop to work while plugged into that same switch within your IP space, and WITHOUT any sort of wifi access while it's still at your house. You won't be able to see the outside world but you will be able to see your server. Filter out the world pre-emptively, because where you're going, you're not going to be behind a firewall.

7. Haul yourself and your stuff to the actual co-lo space: switch, Linux box, laptop, power cords, and Ethernet cables. Consider bringing a screwdriver and a flashlight (any halfway decent place will provide those for you, but you never know).

8. Plug everything in and turn it on. Connect to the server from your laptop. Verify that the server can get out to the Internet. Verify that you can get back to the server FROM the Internet. (That is, plug your switch into their network. Obviously.)

9. Go home and set up the rest from there - you know, DNS, loading stuff on the actual box, and so on.

...

To the usual broken people: knock it off. I know that you don't HAVE to use Linux, and can install a BSD, or Windoze, or whatever the hell else. I'm giving you A path, not THE path. Now shut up.

Another feed reader score roundup

Hello again from the land of feed reader behavioral tests. I ran through the list of participants a couple of days ago and wrote up my results. This is only for those which had polled at least once in the past seven days relative to that point.

I'm going to group some of these together, but keep in mind that some behaviors are a function of however the user configures the program. Also, at this point I'm mostly focusing on their steady-state behavior, but any previously-reported goofiness at subscribe-time is still worthy of fixing for people so inclined.

Artykul's clown fetcher is still going. They're going to make me write a 404 generator and then a 410 generator, I just know it. Then I'm going to start complaining about them not honoring either of those. All unconditional, every 2-3 hours. Terrible.

feedbase-fetcher.pl/0.5 seems fine.

Broadsheet/0.1 seems fine.

com.vanniktech.rssreader, various versions has weird timing.

NextCloud-News/1.0 - all of these have broken IMS caching. There's a fix that's been submitted but it hasn't been merged, never mind rolled up into a release. Some old versions hammer the favicon.ico needlessly. I can only guess at who's running what, since they don't ever bump their version number. It's been 1.0 forever, and I assume this will continue unabated. This makes it really hard to direct particular version-based brokenness to an appropriate handler.

Some browser extension thing. Seems fine.

Mojolicious (Perl), seems fine.

NetNewsWire with buggy LM/ETag caching, as covered in a post a while back. I don't think this has been changed upstream yet. Also has terrible timing.

newsraft/0.25, which still has buggy ETag caching. It also has terrible timing.

Bloggulus/0.3.7, seems fine.

Miniflux/2.1.4 has buggy Last-Modified caching. Miniflux/2.2.0 does too. I thought it was squared away previously, but it's still doing crazy stuff when I do the pathological test cases to check for bugginess.

FreshRSS/1.24.0 has buggy Last-Modified + ETag caching. But, others running 1.24.2 and 1.24.3 have it fixed! The author actually reached out to let me know about this. Thanks!

Feedly/1.0 which went into a "once per day, always unconditional" thing months ago. Meh.

Reeder, various versions. Has the minor timing issue where sometimes it's 59 minutes, and other times it's an hour, as explained in a prior post. Call it the "cron alone is insufficient limiting for poll scheduling" thing, I guess?

feedparser/6.0.10, running too fast: 29/30 mins.

feedparser/6.0.2, doing the minor 59m/60m timing thing.

Yarr/1.0 with terrible timing.

Another Yarr/1.0 with the 59m/60m thing.

walrss/0.3.7, seems fine.

Emacs Elfeed 3.4.1 with some odd behaviors, like there are multiple instances running that aren't sharing the data on the last-modified/etag headers from the last poll. Another instance is doing the same thing.

er0k/feeds v0.2.0, seems fine.

Rapids, seems fine.

Go-http-client/1.1, seems fine, whatever this really is. Could use a more descriptive UA, if only to not run afoul of filtering (not me, but other places).

theoldreader.com has started doing If-Modified-Since and If-None-Match. This warranted a straight-up "holy shit" from me when I saw it in the list. I have no idea if this project had anything to do with it, and I really don't care. They're doing the right thing now, and that's great.

Tiny Tiny RSS, multiple instances from various users. Not recommended.

rss2email release-3.3.1, which keeps doing unconditional requests seemingly randomly. No idea why.

NewsBlur also randomly does unconditionals: 8 of 526 polls that way for one instance (and other counts for other instances). Also no idea why.

Feedbin, which seems to have fixed its ETag bugginess. Yay?

Inoreader/1.0. It has the 59m/60m thing, and also launches unconditionals roughly daily. Dumb.

Some curl-based thing (maybe even the CLI tool?) that's doing conditional requests nicely. Has the typical (minor) 59m/60m thing.

inforss/2.3.3.0, seems fine.

haven sends 100% unconditional requests. It also happens to have the 59m/60m thing, but the complete lack of conditional requests is the showstopper here.

ureq/2.9.1, seems fine.

Friendica/2024.08, also 100% unconditional requests.

SpaceCowboys Android RSS Reader/2.4.15 occasionally has weird timing, and it just goes unconditional randomly. No idea what's going on here, either. Another one is 2.6.31 and that seems fine, so I suspect there was a bugfix release in the middle.

Netvibes, with super broken LM+ETag caching.

cry-reader v0.0, seems fine.

Another undifferentiated browser extension thing, but this one is 100% unconditional, and wacky timing on top of that.

Blogtrottr/2.0, seems fine.

Unread RSS Reader, which is no longer unloading like mad onto the test server, but it has settled into a 15 minute poll interval. Too fast.

Slackbot/1.0, this is a relatively new one, and it's terrible. It's 100% unconditional polls with awful timing: always less than an hour between checks, sometimes much less. They also inexplicably send HEADs, and receive a 405 error from me, but of course come back and do it again a few minutes later.

The whole point of HEAD is if you want the metadata but have no interest in the content. If you want the content but only want a fresh copy if it's changed, that's why we have conditional requests, and again, If-Modified-Since has been in the RFCs since *1996*.

For sites that build the feed dynamically (i.e., not mine), HEAD represents the same amount of work: they get to build the feed, provide the metadata, and then throw it away.

Unless you actually want the metadata, and never anything BUT the metadata, then fine, send HEADs. Otherwise, forget it existed.

Audrey2 RSS Reader/0.7.1, seems fine.

rss2email/3.14, has the 59m/60m thing going on, otherwise it's fine.

...

I should note that if a feed reader's only problem is the 59:xx vs. 60 minute thing then they're actually doing pretty well. There are a lot of way worse things out there.

I'd love it if they could fix this, obviously, but that's just how I am.

"SRE" doesn't seem to mean anything useful any more

This seems to be a thing now: someone finds out that you worked as an SRE ("site reliability engineer", something from the big G back in the day) somewhere, and now all you're good for is "devops" - that is, you're going to be the "ops bitch" for the "real" programmers. You are the consumer. They are the producer. They squeeze one out and you have to make it sing and dance. You keep things running and you shut the hell up. You wear the pager so they don't have to.

I've seen this from the hiring side of things: when we were trying to hire well-rounded people and put up a job posting with "SRE" in the title, all of the sudden we got a bunch of applications from people who basically *were* ops monkeys. They wanted to be that and do that. That was their life, and they enjoyed it. Those of us on the hiring side were taken aback by this and didn't want that kind of hire getting into the place.

Clearly, somewhere along the line, someone lost the thread, and it has completely destroyed any notion of what a SRE was supposed to be.

Just so we're operating on a level playing ground here, I'll lay down my own personal definition of the term, and what I expected from people in that role and what I expected from myself.

To me, a SRE is *both* a sysadmin AND a programmer, developer, whatever you want to call it. It's a logical-and, not an XOR.

By sysadmin, I mean "runs a mean Unix box, including fixing things and diving deeply when they break", and by the programmer/whatever part of it, I mean "makes stuff come into existence that wasn't there before". In particular, I expect someone to run the *right* things on those boxes, to find the actual problems and not just reboot stuff that looks squirrelly, and that they write good, solid code that's respectful of the systems and the network. They probably write programs to make the sysadmin part of the job run itself. Automation for the win.

That, to me, is my first order approximation of what a SRE should be.

Now, there must be some reason I'm writing about this, right? Yes, and there is. I put out some feelers to see about maybe working with a small company that's building some interesting things. They're wrangling Linux, C and C++, embedded stuff, networking, and they're almost certainly going to need a certain amount of pickiness regarding correctness and security to keep bad people from breaking into stuff. Also, they didn't have the usual lists of godawful clown software that most places rattle off that you'd be expected to work with.

In short, it was a place that's Actually Building Stuff and isn't just throwing their money at one of the usual clown vendors. That's rare!

So, I reached out... and heard nothing. Then, much later, I reached out a different way, and eventually heard back: they looked me over, figured I'm a SRE and they have devops people already, so, uh, no thanks.

That's it. That's the whole thing. The door is closed.

I'm obviously not happy with this situation. In sharing with some friends, they also mentioned "having to have this conversation" (that they are not just an ops monkey) a great many times in the course of looking for employment over the years.

Clearly, things have gone to hell, and unless you WANT that kind of ops-only life, you probably don't want to sell yourself this way.

Just to be annoying, I'm going to rattle off an example of something an ops monkey would never do. I wrote this C++ build tool, right? I've mentioned it a few times in various posts over the years, and there have been a few anemic web pages on my main server talking about what it is and how it works.

I won't go into the full history of it here. A quick description is: it knows how to build C++ using #includes to figure out dependencies, and so you need not write a build script, a Makefile, a CMakefile, or any other config language file. This goes both for the stuff inside the local source tree, and for external system library dependencies: stuff like libcurl, a pgsql or mysql client library, GNU Radio, libmicrohttpd, jansson, or basically anything else you might think of. It knows how to use pkg-config to get the details for some well-known targets, and you can add entries to a config file to map the "#include <proj/lib.h>" onto the pkg-config name for anything else.

So, again, that's all old news. I first wrote that over a decade ago and have been using it all this time, with small improvements over the years. What's new? Well, a couple of months back, I decided it was finally time to make the thing run its operations in parallel to take advantage of a multi-processor system.

It now scans the targets to determine the dependency graph in parallel, then kicks off compiles, then kicks off the linker. Everything that used to be serialized has now been made as parallel as possible.

Obviously, you can't just do this as a rash decision. There are any number of things which can go terribly wrong if you don't manage your data structures properly for safe cross-thread accesses. You need to be able to put things to sleep and have them wake up later without having them needlessly "spin" on the CPU. You need to do depth-first processing and then have it kick off the "parents" as the "children" finish up. You still have to catch errors and stop the process where appropriate, and you also have to make sure you don't just boil the damn machine by launching EVERYTHING at once. "Just use std::async", it is not!

To give some idea of the impact, touching a file that gets included into a whole bunch of stuff forces everything above it in the graph to get rebuilt. This used to take about 77 seconds on the same 2011-built 8-way workstation box I've had all along. With just the early work on parallelization, it became something more like 21 seconds.

77s -> 21s, just like that. That's a lot of win that was just waiting to be realized by using the hardware appropriately.

Yeah, I did that. Little old me. This is not the behavior of someone who just twiddles other people's config files for a living.

Raar, I'm cranky.

A common bug in a bunch of feed readers

Yeah, it's another thing about feed readers. I don't blame you if you want to skip this one.

A reader (that is, a person!) reached out earlier and asked me to look at a bug report for a feed reader. It seems they passed along some of the details from one of my earlier posts, and it was closed with no action taken.

The program in question has cache problems, and it's something that's surprisingly common. A bunch of different programs do this, and it's interesting to wonder how they all came to this point.

Well, at least for some of them (the PHP-based ones), I think I know now: they're probably using the same library underneath, and it's been hacked to do some kind of "hash match" thing based on the body of the feed - like a md5sum or somesuch.

It goes something like this, apparently: fetch the feed for the first time. Then store the hash, the Last-Modified time, and the ETag.

Next time, send If-Modified-Since and If-None-Match using the stored values. But, then, if the hash of the body matches, return immediately and don't update anything else... *even if* the Last-Modified and/or ETag values changed! So, next time, they send the old values again, and it never gets any better.

Therein lies the shared bug: they're not designed around the notion of "always return the values you got last time from the web server". If they had been, they would not throw them away just because the hash matched.

So, when might the hash match when the LM or ETag values are the same?

Easy! When someone does this, or the equivalent:

webserver$ touch feed.xml

That will bump both of those fields and will leave the body unchanged.

Them not handling that case means they will start sending unconditionals until the body finally changes. That could be days or weeks later depending on how often something actually changes in the feed. Not good.

Decoding an old SkyScan wireless weather sensor

Many years ago, I was given one of those massive radio clocks that also reported indoor/outdoor temperature and humidity. They were apparently sold at Costco under the "SkyScan" brand, and had two parts: the (very large) indoor display which stood upright with a little plastic foot in the back, and a small sensor which was intended to be placed outside.

It was never quite 100%. It had issues receiving the 60 kHz WWVB clock signal and so it would get out of sync. It also apparently had its own rules for DST baked in, so when the US changed them in 2007, it would go stupid twice a year for a couple of weeks.

It had other problems. I once caught it saying that it was the 9th day of the 14th month, and oh hey, it's a Tuesday. Lousy Smarch weather or no, that's so unbelievably wrong I just had to stop using it.

However, I kept the sensor as more of a curiosity. In recent years I got into the whole business of tracking temperatures in certain interesting places, and this old sensor kept bugging me. The usual tools can see it doing its 433 MHz announcements but nobody seems to have decoded it, or at least, documented it. I've been taking a whack at it every now and then, and finally got it to a usable spot.

So, for the sake of the dozen or so people out there who might have one of these things nearly 20 years later and who want to use just the remote sensor, here's what I've learned about it.

The encoding is no frills: on is 1, off is 0. The on pulse or the off gap lasts about 3900 us (curiously close to 1 sec / 256, for what it's worth).

Transmissions are sent twice, twelve seconds apart. The first one has a particular bit set to 0, and the second one has it set to 1, so I've called it the transmission number. The whole process restarts every 60 seconds.

There appears to be some kind of preamble or sync which is just 10.

Bits 2-33 are followed by inverted versions of themselves in bits 34-65.

Humidity is BCD: 4 bits and another 4 bits, so [4] [7] for 47%.

The channel seems to be 3 bits but is only ever 1, 2, or 3.

Temperature is also BCD with 3 groups of 4 bits: [3] [2] [7] == 32.7C.

All of this has been determined empirically. There are timing diagrams for this and other similar devices on the usual "fccid" sites, but none of them match this scheme exactly.

So, from the top, 2 preamble bits (10), then...

4 + 4 humidity bits (BCD)

1 transmission bit

3 channel bits

1 sign bit (1 = negative temperature)

1 battery bit (1 = battery is low)

6 id bits

4 + 4 + 4 temperature bits (BCD)

... and then the part past the preamble repeats, inverted. This can be used to detect errors.

For the sake of grep bait: the SkyScan 81690 with a sensor bearing the FCC ID of OQH-000000-05-002, or perhaps OQH-CHUNGS-05-002. It might be a "C-8102" model, but again, all of the FCC stuff I've been able to find is not quite right.

An update on another feed reader from its author

I've been hearing from feed reader authors as a result of the whole "feed reader score" project. Last week, it was NetNewsWire, and this week I have an update for Unread Cloud.

The author put up a post which looks at what's been going on, and explains what happened with the poll frequency for a while on that one test instance. It also has some interesting graphs built from the raw "hits" table in the report.

I should point out how overwhelmingly reasonable everyone has been for this project. It's the sort of thing that could go over very badly - trust me, I've been in the position of "detecting badness" over the years at various jobs, and a lot of people do not appreciate it!

Instead, we've been slowly patching up a few small issues here and there across the collective fleet of feed readers, and hopefully this will benefit readers and writers both. Writers get lower "background radiation" for their servers, and readers get quicker updates without tripping rate limiters. Everyone wins.

To everyone, thank you for your efforts in this ecosystem.

Feedback on github, mangling code, and a feed reader

Before I launch into responding to some reader feedback, there's a strange meta-note that I'll mention in case it twigs any memories for other people who run web servers. There's some really strange set of web robots that seem to go crawling for URLs that end in something like "feedback" or "contact" or "support" and "help" and stuff like that. They'll pull an index page and then fetch *all of them* in one fell swoop. It's pretty clear that they don't give a damn about looking halfway plausible. Real people don't request pages like that.

What's really strange is that nothing else happens after that point. Maybe they're looking for pages that have feedback forms, or comment fields, or something like that?

Fortunately, I have no such things on my posts.

...

An anonymous reader asked for my Github URL. I actually used it with a paid account for a while, but then stopped using it a long time ago. I'm talking waaaaay back in 2012 when they were pulling all kinds of stupid "let's trust your data to this overly dynamic web shit" on a regular basis.

I moved out 12 years ago and haven't gone back. It turns out that it's rather simple to just set up a repo on an accessible box and point the participant boxes at it - the ones where I do dev work, the ones that run automatic builds for me these days, and so on.

They've done some dubious things with "public keys" that I disagreed with, too. Someone actually went and made a "ssh whoami" type thing where if you connect to it and present the public key that's on a github profile, it greets you by that profile's name. Cute.

(Aside: There's a DEF CON talk out this week which talks about how people *still* don't really know that sshd will totally tell a client whether a key pair is allowed in given just the *public key*. You should find the slides for "SSHamble". They're good.)

Since then, they had a bunch of crazy stuff involving the treatment of their employees and all of that kind of crap, but I was already gone. Then, hey, an old adversary from the 90s popped its head up and gobbled them up. Yep, Microsoft bought Github in 2018. That's the kind of devious move that is entirely consistent with their old ways of operating.

This is where a bunch of people who weren't even alive in the 90s pop up and tell me how they're different, and thus prove a point I was telling people about 10 years back: "you're not going to recognize them eventually, and younger folks will have no idea of the evil they've foisted upon the world over the years".

Basically, if I was an evil sort and wanted to keep my thumb on the scale of what the "dirty free software hippies" were doing, buying the thing that's practically become synonymous with "git" itself for many people would be a great place to start! So, when they went and did that, well, that's just cake.

That'd be like IBM buying Red Hat... but wait, that happened too!

It's ironic that a fully-distributed-capable system like git ended up with its own "network effect" that brought people into a single site to that degree. Sure, there are others, but you have to know something about the scene to even realize they exist in the first place.

I bet if that had happened in 1998 instead of 2018, it'd be "Microsoft ActiveGit" or some crap like that by now. They would have totally rebranded it to try to fit with the rest of their corporate hell.

When it finally does happen, try to act surprised.

So... TL;DR there's a github profile out there where you'd expect to find one for me, but it has no activity, and for good reason.

...

The previous reader also asked if I had ever taken someone else's code base and turned it into something that "bears little resemblance to the original creature". I'm pondering this one, and, well, I don't think I have? It's not like I turned a wolf into an eel or something like that.

Okay, there was this internal web service up at a place with WAY too much Python. It was (of course) written in Python as well, and I've mentioned it before. It was a tool which was intended to make it easier for people to collaborate on outages... you know, "SEVs", or whatever you might call them where you are.

It had been written by someone as a side project of sorts, and that particular company had no love for such pure engineering work. In fact, that person was on the verge of getting in trouble with their manager for hacking on it in the first place.

So, when I showed up and realized we needed some different behavior out of this tool, I reached out to the nominal owner, learned of the situation, and then got their blessing to go and hack on it, and hack I did.

I put in enough to get it to where other people could see where I was trying to go with it, and then a couple of solid developers showed up and started cranking on it. They fixed a bunch of stupid things I had inflicted upon the code base, and got it to where it was a joy to use.

Also, I had an intern come through who did an amazing job on all of the "followup" side of things - you know, getting a report together, bringing it to a review meeting, collecting follow up tasks, and making sure those actually happen. I had been doing it by hand, and it was awful. When he got done with it, it was a breeze.

Did it look like the original tool? I guess superficially it kind-of did. It still had a list of comments and a way to add a new comment. It picked up features for both the direct users and the reliability people behind the scenes, but it was still nominally an "incident manager", even though I had rebranded it "SEVPanel".

It effectively went from a young wolf to an adult wolf. No eels or other aquatic wildlife mangling was involved. That's all.

The better question is probably *why* I did what I did and set to work on an existing thing instead of trying to come up with something new. That's a long and messy answer. It has chunks of "because it was a right pain in the ass to start a new project there" along with "I was getting enough pushback just introducing the term 'SEV'", so changing the *host name* of the service from incidentassistant.${COMPANY}.${TLD} to something else was right out. I never would have heard the end of that one.

It was so hard to start a service there... ("how hard was it?") ... that one weekend, someone parked a critical piece of infra atop our "bookmark" shortcut service because it was needed to support a certain piece of legislation that had been passed by a city. It was a matter of "put it right there right then" or the company would have been in capital-T Trouble. Also, the company would have looked like giant assholes to the people who needed that level of service, but don't despair - they delivered on that sort of thing plenty of other times that year.

In the rush, the only real solution was to stack it atop something solely intended to support in-browser shortcuts for employees. Why there? It was simple, it was easy to hack on, and if they broke it for a few days, eh, it wasn't the end of the world. It was literally a thing that saved you typing out longer URLs when you wanted a shortcut to a particular internal web page.

As for why it was so urgent, apparently management sat on it as long as possible, THEN dropped it on the engineers. Typical crap for them.

And hey, now you know a bit more about the context of my "edit in prod" story from a few years back.

...

Another reader asked for my opinion of Tiny Tiny RSS because they are planning on self-hosting it. I pretty much covered the range of behaviors that are showing up to the feed test site in my summary post from the other day.

It's definitely doing a bunch of "double tap" polls all over the place, and I have no idea why. To be clear, I don't really care all that much WHY it happens, just that it does, and it happens a lot. It's not like a single user went and hacked something broken into the source. That doesn't fly when it shows up from multiple users who don't even know about each other.

Nope, whatever's going wrong is some weird stuff from upstream.

Feedback on feed stuff and those pesky blue screens

Wow, okay, things are cooking today. I posted last night with the results of the feed readers which were still actively pinging the test feed. I've heard from several feed reader authors, and they're cranking away at stuff! They obviously care a lot about doing the right thing.

One of them even put up a post already, and so I feel obliged to share it:

NetNewsWire and Conditional GET Issues

That post actually has a link to the report page, so if you want to see what a real batch of data looks like, there you go.

There are some non-feed-reader bits of feedback I'd like to address, too.

...

Hand-drawn analog clock pointing at 3:00 on Friday the 5th but with the 3 replaced with "SEV"

One reader asked for my take on the whole CrowdStrike thing. This is where they shipped some update to a bunch of Windows machines and nuked them. I get the impression that this was something that ran rather low in the stack and had the ability to pretty much take control of the whole machine... or kill it entirely, as it turned out.

When stuff like this happens, I fall back to pointing out something that we tried to make into a "thing" at FB in the fall of 2014. There had been *far* too many outages in a really short period of time. It had started with "Call the Cops" (10 years ago now - August 1st!), and a month later we were getting tired of outages every Friday at 3 PM - I started calling it "SEV O'Clock".

The notion we tried to get known far and wide was "nothing goes everywhere at once". This means that code pushes, config changes, and even flag flips should happen piecemeal. Don't go from 0 to 100% in a single step. Take a bit and space it out. If you can't space it out, ask yourself why, then see if you can make it happen.

I've seen some amazing things almost happen. One company was in the quiet period prior to their IPO when the execs know when it'll happen but the rest of us are in the dark. Basically, it could happen *any day*, and the bigwigs are out schmoozing with the press. It's the last time you want to tank the site, service, or whatever.

Once, I overheard some people saying they were going to convert the entire site's user database from goofy not-SQL db flavor A to goofy not-SQL db flavor B in one fell swoop on Sunday night. It was Thursday. I asked if they had a way to back out. Nope, it was a one-way trip. I asked if they could do it in stages. Nope, it was all or nothing.

"So you want to do something that will affect everyone at once, with no way back, on a weekend... during the IPO road show?"

They reconsidered.

To be clear, I have no idea if the CrowdStrike thing rolled out as fast as possible or if it had stages. It seems like if it had a phased rollout, then it might've nuked a few machines in a few places, but then the alarms would have gone off and they would have hit the big red STOP button... right? All of those hosts disappearing right after getting our update means something, doesn't it?

Well, look at that. "Breach hull--all die." Even had it underlined.

Two months of feed reader behavior analysis

Oh boy, it's been a while, so let's talk about feed readers. This whole scoring project has been running since the end of May, and much data has been collected. It's logged over 70K requests to tagged feed URLs in that period from 97 distinct keys - the unique IDs which should be used by a single install of a single program.

Two of them have sent over 6000 requests in the time they've been running. Another half-dozen or so have sent over 2000. Keep in mind that not all of these started at the end of May, either. While some of them go back to the beginning, quite a few are much newer than that.

I've gone through and summarized anything that's still checking in, using an arbitrary cutoff of "has polled in the past week".

Once again, while any one entry does not reflect a program since a user might've configured it really badly (or really well), the patterns become obvious pretty quickly when you see multiple instances of the same thing.

Unlike before, now I'm actually going to list the names and versions. It's time.

[IMS = If-Modified-Since header. INM = If-None-Match header.]

...

Netvibes. Has some kind of caching error - stuck on old IMS/INM values for the past month basically. I'm going to kick it again to see what happens, but it doesn't bode well. Has some dumb 3 second rapid-fire thing at feed init, too.

SpaceCowboys Android RSS Reader / 2.4.15 (previously aka "sweatpants", more or less, apparently). This thing has a polling interval that's all over the place and it likes to just drop the IMS/INM and do unconditional polls now and then. No idea why.

NextCloud-News/1.0. Super broken caching. Terrible mess.

NewsBlur. Rapid-fire thing at startup, otherwise seems fine.

Friendica/2024.06-dev. No conditionals, ever. Also bum rushes the server when first adding a feed by hitting a bunch of goofy not-the-feed URLs (which can't be seen in the report, but *I* can see it in the logs).

Roy and Rianne's Righteously Royalty-Free RSS Reader v 0.2. Seems to have settled down into a nice polling pattern. No complaints.

Tiny Tiny RSS/23.04. Nothing but unconditionals. But then it also does this < 1 second "OMG LET'S POLL AGAIN" thing. What's the point? Annoying.

ureq/2.9.1. Seems to be in a good pace/pattern now.

Feedbin. Hella broken IMS/INM caching, terrible pacing, regular forced unconditional polls, etc. They trip 429s in prod constantly. There are a bunch of these and they're all broken the same way, so I chalk that up to the program/service and not any one user's config.

NetNewsWire. Bigtime brokenness in their IMS/INM caching. No idea why, but it makes it generate unconditionals constantly.

haven. All unconditional requests. Pacing is all over the place.

Miniflux/2.1.4 (plus older versions previously). They had some kind of caching bug but it looks like it's good and dead with this version. Hooray for doing the right thing!

inforss/2.3.2.0. Good pacing, proper caching behavior. Nice.

Miniflux/2.1.3 and then 2.1.4. As above. Yay fixes.

Some random thing that's just curl. Seems generally okay. Does a cheeky thing every week where it drops the IMS/INM, perhaps to see if the web server admin is a clown. (Spoiler: I definitely am.)

Inoreader/1.0. All over the place with timing and unconditionals. Also shoves a useless referrer down the pipe every time, as if that is going to accomplish anything. Has a really nasty rapid-fire startup sequence that'll probably get it autoblocked before it can even get subscribed properly.

Feedbin. Bleh. See above.

Feedbin. Meh.

Miniflux/2.1.0 that needs to be upgraded to 2.1.4. Currently isn't showing signs of cache insanity because I haven't deliberately triggered it in a while.

NetNewsWire. Just like the earlier one, broken caching and crazy timing.

Liferea/1.15.3. Has a tendency to just drop everything and start making unconditional requests... one second apart? Not great.

rawdog/2.24rc1. Behavior is spot on. More like this, please.

NetNewsWire. Same bad caching and timing stuff.

Something with no User-Agent, or IMS, or INM headers. It's like... okay, what exactly is going on here? At least it polls super slowly.

NewsBlur. Just like the other one.

FreshRSS/1.24.0. This one latched bad IMS/INM values back in *May* and is still sending them. Hella broken.

Miniflux/2.1.3 and 2.1.4. Seems fine now that it's been upgraded.

Tiny Tiny RSS - multiple 24.05 and 24.06 builds. Slightly better than the 23.04 one mentioned earlier. This one does send conditional requests, but it also sends tons of unconditionals, and it does the "OMG 1 SECOND HAS ELAPSED SO I MUST GO GET IT AGAIN" thing. Not cool.

Tiny Tiny RSS 24.02 and 24.05 - various builds. Just like the previous entry.

CommaFeed/4.4.0, 4.4.1, 4.5.0, 4.6.0. This had some caching bugs for a while and was sending bad IMS values. It's been behaving lately. It's probably time for me to deliberately do some weird (but totally valid) behavior again.

Vienna/8052, 8219. Timing is all over the place, but at least it sends conditional requests.

Some claims-to-be-a-browser thing (probably a browser extension). Sends derpy 2000-01-01 IMS at startup for no good reason. Caching headers jump around and disappear sometimes, like maybe multiple instances are warring over the state storage or something. Far too many < 1s polls for my liking.

Feedbin. Feh.

Tiny Tiny RSS/24.04. As explained above, OMGs and all.

theoldreader.com. I'm pretty sure Google Reader's poller did conditional requests. This thing doesn't. Ever. They seem to get around blocks in production by polling every 6-7 hours and using multiple hosts.

Some unidentified "Go-http-client/1.1" that seems to have a decent pace and which sends INM headers. No idea what it is, but it seems to be polite about the job it's doing.

Rapids. This one got upgraded and is now doing both IMS and INM and looks great. Thanks to the author for the effort in changing it.

er0k feeds. This one also seems to have been upgraded at some point and seems to be keeping up just fine now. I definitely need to do some "inject some crazy" testing to see how it does.

NetNewsWire. Broken, as above.

walrss/0.3.7. Just works. Nice.

Yarr/1.0. Startup had a bunch of unconditionals but then it settled into sending IMS and INM headers. Timing seems a little squirrelly, like it's not actually waiting for at least an hour to elapse, but rather just starts on some schedule and is subject to however long the last run took.

Bad data. Different programs using the same key.

FreshRSS/1.24.0, 1.24.1. Bad caching, bad timing.

Some other "might be a web browser extension" thing. Does the same 2000-01-01 IMS rapid-fire stuff at startup. Sending caching values but it's running a bit quick.

Feedly/1.0. This one is just abusive. At some point it switched to sending nothing but unconditionals every 30 minutes. It's like, okay, this *fake* feed doesn't say much. So chill out, already. Comes from a bunch of different hosts, too. Annoying.

NetNewsWire. Bro-ken.

feedparser/6.0.2. Does cached requests fine, but the timing is squirrelly here just like some of the others. I wrote a post about why you can't just say 'start me every hour' and leave it at that a few weeks back.

Reeder, various builds. This thing is bizarre. It'll just flip and flop between INM values (things I served up as an ETag) from weeks in the past. Tell me that's not some kind of wicked caching bug. Also polls FAR too fast. Some programs have timing that is merely squirrelly. This is a whole cage of them.

Unusable data - multiple programs.

Unusable data - multiple programs.

Unread RSS Reader. Godawful poll timing. 6103 requests in 52 days is about one poll every 736 seconds _on average_, but they're hugely spread out. WTF? Put it this way: the list of unique intervals (nn seconds, nn minutes, ...) is *four pages tall* on my web browser.

Feedly. Also went to 30 minute all-unconditionals. Bah.

FreshRSS/1.23.1. Bad caching, like the others.

Yarr/1.0. Just like the other one: settled into sending IMS/INM but the timing is all over the place, and often far too fast.

Unusable data - multiple programs.

Miniflux/2.1.3 that needs to be upgraded to 2.1.4. As above.

Miniflux/2.1.3. Same.

Another unidentified "Go-http-client/1.1", but this one seems to have the weird timing issues like so many others.

newsraft/0.25. This one got into some bad places with stuck caching headers. Given the version number hasn't shifted, I assume it would happen again if I ran another test from my side. Wacky timing, too.

NetNewsWire. Broken record.

feedparser-perl/6.0.10 is doing conditionals but it's happening too quickly, and the timing is mighty inconsistent.

Mojolicious (Perl). Just works. No issues. Thanks.

NextCloud-News/1.0. Very broken caching. It's not even using the right timing to generate If-Modified-Since requests, and this has been a problem for years. This is the infamous "1800-01-01" IMS sender.

com.vanniktech.rssreader, various versions. Timing is all over the place, frequently far too fast - 2 seconds, 10 seconds? Also seems to "lose" the INM header now and then.

bdrss/4.0. Sends conditional requests properly. No idea how the timing works on this, but at least it's on the "long interval" side of things and not the "super fast polls over and over" like most of the others in this list. Maybe it throttles back when a feed is quiet? That's great, if so.

Broadsheet/0.1. Seems fine.

feedbase-fetcher.pl/0.5. Also seems fine.

Artykul/1.0. I installed an app on my Mac and it started checking in, and this stupid clown thing started checking in as well. The Mac app is long gone, but this albatross won't go away. Sends unconditional requests every single time, and it's visiting URLs you probably didn't intend to share with them! Wonderful. I wrote a post about this a while back.

...

There you go: a bunch of bright spots and then a bit of sadness.

Choose wisely.

Unintentionally troubleshooting a new way to filter traffic

I ran into a troubleshooting scenario the other day which ended up adding to the list of things that I need to check on when trying to figure out why packets seem to be disappearing. It went like this.

I showed up at a site where I'm running some weather station sensors and jumped on the console of one of the Linux boxes. My visit was about adding some sensors to some new areas, and I wanted to see how things were going. In particular, I wanted to see how the receiver on the local machine was doing, and what it had managed to log of late.

(Just imagine the port number is 1234 here.)

$ thermo_cli ::1 1234
rpc error: deadline exceeded while awaiting connection

... what? That made no sense. The thing was running.

I looked in 'ss' to make sure it was listening on that port and specifically was ready for IPv6 connections. It was.

LISTEN 0      1024         0.0.0.0:1234       0.0.0.0:*    users:(("thermo_server",pid=1141761,fd=4))                                                 
LISTEN 0      1024            [::]:1234          [::]:*    users:(("thermo_server",pid=1141761,fd=5))                                                 

I tried netcat instead... same thing. Instead of connecting, it just hung there. I looked in ip6tables... nothing. This host has no rules at the moment: nothing in 'filter', 'nat', 'mangle', etc. This should Just Work.

This site isn't running IPv6 natively due to a dumb ISP, but there are still ULA and link-local addresses, so I tried one of those from another host on the same network. That also went nowhere.

Looking in tcpdump, it was pretty clear: SYN comes in, nothing returns.

But, at the same time, I could change from that port to something like 22 and it would work fine. I'd get the usual banner from sshd.

Assuming it was something stupid I had done with my own code somehow, I ran another program as a test, then tried connecting to it over v6. It worked fine.

I straced it to make sure it was handing the IPv6 listener fd to poll(). It was. It wasn't getting any sort of notification from the kernel. But, actually, hold up, no SYN/ACK happened, so there's no way it would have gotten anywhere near userspace.

I stopped the service with systemctl and put up a netcat listening on the same TCP port (nc -6 -l -p 1234) ... and then I could connect to that just fine. So, it's not something magic about the port, somehow. It's just that port when it's going to the service which normally runs on it.

I started making a list to see what the patterns were. This box, this program, talking to ::1? Bad. Another box at this site, same program, also talking to ::1? Same problem.

Was it because this site has no v6 routing to the outside world? That makes no sense as to why ::1 wouldn't work, but, hey, one more thing to discard. I invented a fake route. Nothing happened (fortunately).

Next I started up a Debian VM on my laptop, hooked one of the radios to it, and started the receiver program on it by hand. It ran just fine, and accepted traffic over IPv6 to that same port without any trouble. It's on the same v6-route-less network as the other hosts, so what's up with that, right?

Maybe I did something stupid with the config file for the program, so I copied that across verbatim from one of the site hosts instead of just making a fresh one for testing. It didn't change things.

What if I dump the v4 listener on that port and just run the v6 listener? Nothing. What if I add a listener on another port? Nothing. Now that port also drops packets when I try to connect to it that way.

I don't know what it was about this last point, but somewhere around here, a couple of ideas finally connected in my head and I went "uh, systemd".

The failing instances were both running as systemd services. The successful instances (whether the thermo_server program, or my other test stuff) were just me doing ./foo from a shell.

That's when I thought about the hardening work I'd been applying to my systemd services of late. I've been taking away all kinds of abilities that they really don't need.

One of the newer tricks in systemd is that you can do "IPAddressDeny=" and then "IPAddressAllow" and keep a program from exchanging traffic with the rest of the world. For a program that's only ever supposed to talk to the local network, this was a good idea.

That's when I saw it: I had 127.0.0.0/8 and the local RFC 1918 networks on the Allow line, but *not* ::1, never mind the ULA prefix or the link-local v6 stuff. Adding ::1 and doing the usual daemon-reload && restart <service> thing fixed it.

Here's the deal: systemd implements that by injecting bpf program(s) when you ask it to filter traffic by IP addresses in the .service file. When this thing rejects traffic, it just drops it on the floor. It does this past the point where ip[6]tables would match it, and well before the point where it would generate a SYN/ACK or whatever else.

There are no counters associated with this, and it doesn't generate any messages in the syslog or whatever. The packets just disappear.

It's effectively equivalent to an ip[6]tables rule of "-j DROP", but at least those rules have byte and packet counters that you'd see incrementing when you're smashing your head against it. This just makes the packets disappear and nobody has any idea what's going on.

So, if you ever see traffic effectively being blackholed to the port or ports which are bound for a particular systemd-run service without anything showing up in your iptables (or let's face it, nftables) stuff, you'd better check to see if there are IPAddress* rules in the .service file. That might just explain it.

Hopefully you'll remember this and not waste a bunch of time like I just did.

What happened to my /edu page, and why it came back

I got some reader feedback the other day which amounted to "what happened to your /edu stuff" and "wasn't there a lot more in there?" These are legitimate questions, and I figured it's worth explaining what that was and then what happened to it.

Way back in 2013, I was post-one-big-company and pre-another, and got the idea in my head that people would want to watch screen captures of me writing code. I even had the thought that there might be some money to be made in the process.

I didn't like the fact that most of these "screen recordings" were literally videos of people's terminals (or IDEs, ick). They lost the ability to copy and paste from the screen since it was no longer character-based and instead was pixel-based. It also meant way more bandwidth to serve up such things.

Back then, I was on still on my ancient ServerBeach machine which ran its Ethernet port at 10 Mbps *half duplex*. I'm fairly sure they provisioned their network this way deliberately. It had the effect of limiting just how much load you put on their network since they didn't charge for bandwidth back then. (Now you know why YouTube used them for their video serving pre-Google, and why only their db stuff was on the Rackspace side of things!)

As a result, I didn't want to get into the business of serving videos over that anemic pipe. The link already slowed down way too much when someone would post an image-heavy post like the "Apple Maps sucks" series to HN or reddit or whatever.

That got me thinking, and I built something that would record my terminal as text, control sequences and all. Then I took a whack at writing a VT100 emulator, got far enough along and realized it had already been done (tty.js), and had a compatible license, so I grabbed a copy of that instead. Then I just chopped out all of the stuff that made it able to take input, and hard-wired it to my playback system.

Then I started putting up recordings of various dumb things, like me using my non-Makefile-based build system. The idea was that maybe if people saw me using a tool that didn't suck and which made my life better, they'd want a piece of it, too.

At some point, this craziness was linked up to my existing code which talked to Stripe, and so you could pay me a buck to add an item to your "account", and it would let you play back some "lessons".

Then I got hired to serve up cat pictures... or at least, to keep the serving of cat pictures working. All of my time and energy went into that, and things on this side of the world slowed way down and eventually ground to a halt. Weeks or months would pass with very little going on.

About two years into the cat-pic-wrangling gig, it was time to leave that ServerBeach machine behind in order to get IPv6 since they were still too clueless to offer it. All of the data was copied from one machine to another, but I didn't want to go to the work of rebuilding all of the CGI programs for RHEL 5, or validating that it actually worked. It required too much effort.

I just didn't have the energy to do much more than moving the web sites over and repointing DNS, and that included what it would have taken to make it work without the payment integration stuff - "free mode". The impact was that both /store and /edu were shut down.

That was pretty much it for /edu until someone reached out to me in May 2022 and said they were wondering what the page had been like, and if they could explore the old code. Fortunately for them, at that point in my life, I had cycles to spare, so I dug out some of the old stuff, chopped out the payment integration, and put some of the recordings of the old "protofeed" project online.

I didn't mention this anywhere, but if you happened to go there after that point, it would have Just Worked. It also means that some of the links in those old posts "came back to life", and that makes me happy on a certain nitpicky level that I don't expect most people to understand. (Cool [URLs] don't change, yadda yadda.)

So, if you haven't been keeping track, the virtual terminal is back online, and has been for about two years, but without new content.

If you like watching terrible code happen, you might enjoy it.

If you like badly-written web pages that tell you to reload to start over, you might really enjoy it.

How to waste bandwidth, battery power, and annoy sysadmins

Okay, let's talk about something other than feed readers for a moment. How about completely broken web browsers? Yeah, those.

This. This is a thing. Count the broken:

ip - - [28/Jun/2024:14:44:26 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:26 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:26 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:26 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:26 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:26 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:26 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:26 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:26 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:26 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:26 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:26 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:26 -0700] "GET /w/css/main.css HTTP/1.1" 200 1651 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/feed.png HTTP/1.1" 200 689 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed/ HTTP/1.1" 200 8052 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 
ip - - [28/Jun/2024:14:44:27 -0700] "GET /w/2024/05/27/feed//favicon.ico HTTP/1.1" 404 20 

First up, why in the hell do you need to request the same link 12 times? No, scratch that, 15 times, since it does 3 more after getting the css and feed icon.

Then it goes for the favicon, and what clown decided that the right way to request "/favicon.ico" is to prepend the base path to it? This cursed thing that Microsoft foisted upon us back in the 90s is supposed to be at the top level. It's not part of individual directories. That would be stupid.

And yet, this thing decides to beat the shit out of the web server while trying to get it.

I used to wonder just what could be this stupid. The user-agents on these bad requests aren't particularly helpful. But, then one day, I got lucky and noticed that the first request of the set has one very interesting little detail in it (while the others do not):

FxiOS/127.1

FxiOS. That is, Firefox for iOS. That by itself was enough to get me looking, and oh, look what I found.

Request spamming when visiting a site

Request spam for favicon and apple-touch icons on iOS 16 + Firefox 105

Request Flooding when opening app

favicon.ico is in / , it looks for favicon.ico in every directory except / .

Firefox on iPhone makes a flood of requests for icons

Lovely. So, if you're using this garbage, know that you're probably leaving a trail of badness in your wake.

A high-level view of all of this feed reader stuff

Yeah, I know, it's another post about feed reader stuff. I figured people are probably wondering what the results have been looking like. I also wanted to see things in aggregate, and fixing my mistakes turned out to really need a "30,000 foot view" approach.

So, here's a screen shot of the top third of my admin tool with the unique keys removed to give a taste of what we've all been collectively finding out.

feed reader score report table

This is showing the number of requests, how many were conditional or not, which ones presented a good If-Modified-Since value, one that was out of sequence ("oos"), or made up entirely. Then it does the same three counters for If-None-Match values.

Next up are the counters for cookies, referrers and query parameters which were needlessly presented for whatever reason, and finally it's whether the requests are showing up over IPv4 or IPv6.

I mostly rigged it this way so I could watch the "ims bogus" counts drop out as I fixed each set of entries in the database. (What a mess.)

There are more than a few keys which were issued but haven't been used, and that's what those grey rows are in the table. Even with that reduction, it still goes on much too far to present as a single screen shot - there are a *lot* of people taking part in this thing.

So, what can we learn from this view? The first row was just me running tests, so I deliberately tripped a bunch of badness to make it show up in the reports. Then there are a few boring ones which were more tests from me, and then it starts getting into actual programs.

See the one with 256 hits and 256 unconditional requests? That'd be the "leaked to the clown" situation from the other day. They're still showing up over and over to fetch that thing.

How about the one with 1258 bogus IMS values? That'd be a feed reader which is sending the feed update time, not the Last-Modified time. That's a nasty one since it looks conditional, but in practice it is not. Every one of those requests gets the full feed shipped out. (This is a rather popular program, so there are lots of these things hitting the real site all day every day. Groan.)

The really interesting parts of this are the ones which are consistently sending out-of-sequence If-Modified-Since and If-None-Match values. Out of sequence in this context means "they're sending a value, but it's an old one, not the most recent one served to them on their last hit". It seems we have managed to trip a LOT of caching bugs in these programs. They latch in some bad state, and just keep going like this forever. Not good!

One thing that has been gratifying about this is when feed reader authors have changed something in their code and then reached out to me for a fresh key. That lets them start from scratch with their new behavior and leave the old one behind. I'm always happy when this happens.

This also adds a new dimension that I'll have to start tracking, now that we have tests ending and others coming in to replace them: staleness. I'll only be looking at things which are still actively checking in so as not to penalize anyone who's improved their stuff. We will leave the old behaviors behind and focus on what's current.

That's the best you can hope for: a series of improvements.

Feed reader score project participants: I made a mistake

A short note to my feed reader score project participants: I screwed up something fierce the other day. I set some bad dates for the Last-Modified header. It's all done in code, and it's not based on an actual file, so it's possible to set *nearly* any value in there.

It started when I either fat-fingered or stupid-houred this date the other day:

Thu, 08 Jun 2023 00:04:45 GMT

That was in the GET handler for the test feed for about a day. Then, when I "fixed" it, I set it to something a few days in the future:

Thu, 20 Jun 2024 00:04:45 GMT

It was only June 16th when I did that. Oops.

Anyway, this started showing up as all kinds of crazy If-Modified-Since values being presented by clients, and I thought those clients were taking the value from my end and were "clamping" it to the current time. Instead, nope, it was all on my side.

Would you believe that Apache httpd will do that all by itself? Yep. If you have a CGI program which emits a Last-Modified header in the future, it'll totally squish it down to the current date/time instead.

Further complicating the matter is that I logged the values that I *thought* were being served up, and then used them to run comparisons and generate warnings when they didn't match.

My sincere apologies to anyone who spent the last couple of days chasing after a problem that was totally created by me on this end.

Now I get to figure out how to clean up my mess. Fun!

Leaking URLs to the clown

With all of this feed reader stuff going on, I've learned a few more things about goings-on in this space. Some of it is just strange.

During the early development stage of this project, I installed a couple of apps from the Mac app store on my machine to do some testing. Each one of them was given a unique URL so I could tell them apart. This meant I started seeing traffic from my laptop to the test server, which is exactly how it was supposed to work.

But, one of those unique URLs started getting requests from some random "cloud" service. There was no indication this would happen when I plugged it into the app. It just appeared, and it was running in parallel with the requests from my actual laptop. In fact, I've since stopped running all of those programs, and they're *still* polling it - just under every three hours, and always unconditionally. Great. I bet it'll keep going approximately forever.

Nothing on the app's web page suggests this will happen. It just does. If anything, the "all articles are available offline and without an internet connection" blurb on their web page suggests the opposite: the laptop does a fetch and keeps it locally. What a concept!

So, if you were thinking about using that particular app to read some feed containing something relatively private, guess what, they're reading it too.

Can you run in a tight loop and still be well-behaved?

Timing things to happen at specific intervals is yet another way that we collectively find out that dealing with time is a hard problem. I've been noticing this while working on feed reader stuff, and I realized that it can apply to other problems.

It goes like this: say you want to have a process that runs at most once an hour. You are okay with it taking a little more than an hour between runs, but really don't want to go faster than that. Maybe you have an arrangement with a service provider to not poke them too often. Whatever.

So maybe you rig something up using cron, and it looks like this:

15 * * * * /home/me/bin/do_something

Then, every hour, at 15 minutes past, cron will run your program. Unfortunately, this by itself is not nearly enough to deliver on your arrangement. It's not even the problem you might imagine at first, which is that system clocks can be sloppy and can get pulled around by external forces.

Nope, this has to do with the time it takes to actually do the work, and not accounting for that when allowing the work to proceed again.

Back to our cron job. We'll say it gets installed at midnight, so 15 minutes later at 00:15:00, it starts a run. Maybe it does a lot of work and talks to many sites over the Internet. Some of them respond quickly, but others are slow. Maybe their DNS is taking forever to resolve the hostnames. Maybe another site is offline and is just dropping packets, so you sit there until a timeout fires on your end. It burns a good minute doing this.

At 00:16:00, it finally gets around to doing the "once an hour" work, and it happens relatively quickly. Then it finishes and goes to sleep.

About an hour later at 01:15:00, cron will run your program again. This time, maybe all of the earlier work happens much more quickly, and all of it completes in 15 seconds. That means you get around to your "once an hour" work at 01:15:15.

Oops. You were supposed to wait at least 3600 seconds - that's one hour - between requests, but you just ran it after only 3555 seconds.

The problem is that you you can't just rely on the start time of your program to know if enough time has elapsed since it last did some work which is supposed to be rate-limited. You have to actually track the time when the work *was attempted*, and then do the math of "elapsed = now - then" to see if enough time has gone by.

I tend to think of the timeline for this sort of thing as a series of fenceposts, like this:


start       action      end     (rest of the hour here)
    |          |        |
    v          v        v
----*----------*--------*---------------------------------------->

To avoid violating rate limits, you have to time things from when the action happens, not when the program starts up. If you want to really be paranoid about it, then you'll want to time it from when the program is all done with its work and is about to shut down (but this is a lot harder).

What ends up being much easier is to just remember whenever the work last started and/or finished, even if it didn't succeed. It should never select a target for refreshing until it has been idle for long enough. The program must never assume "well, I'm running again, so it must be time to do my thing". What if the box just rebooted, or any of a number of other possibilities? What then?

Here's an easy way to know if a program is on the right track: could it be run in a tight loop without causing a giant mess for other people?

$ while true; do run-my-stuff; done

If you can run something in a loop like that and not have it beat the crap out of whatever it's supposed to periodically talk to, then you're probably headed in the right direction. It also means that if the program gets into a start-crash-restart loop some day, maybe it won't unleash a hellstorm on whatever it happens to talk to.

Running a program in an infinite loop like that might chew a lot of resources on the local machine, but that's (relatively) okay. It's your machine. Feel free to burn your own resources. Where it becomes troublesome is when it reaches out and starts burning those of other people.

As usual, the details are important here.

Some early results for feed reader behavior monitoring

I've had a few people ask me for results from the feed reader score project. It's been long enough to where I can start giving some details, now that we've had a good week or more of data collection.

There's one big thing to keep in mind here: I am assessing individual feed reader installations, including whatever config values the user might have set globally or on the test feed in particular. Those config values can be the difference between "amazing" and "get it away from me".

That means a single good entry doesn't necessarily mean that every install of that program will behave perfectly. It also means that a single bad entry doesn't mean that all of them will be terrible.

I've broken them down into a few groups.

Group A: No real complaints. They do their jobs quietly and don't make messes. Anomalies, if any, don't seem systemic and are probably just the result of the user clicking the "poll now" button (or equivalent). This is expected.

Group B: They tend to do spammy unconditional requests at startup, and usually at a needlessly fast rate, too - like less than a second apart. This is what most entries in group B have, and if that's their only problem, then fixing that would move most of them into group A. (There can be other small anomalies which put something here).

Group X: Unusable data. This can be because there's hasn't been enough data collected yet, like if someone just started it up, or if they shut it off before it ran for several days. It can also happen when someone points multiple feed reader instances (same version or not) at their unique tagged feed, or if they load it with a browser, curl, or similar.

Groups C, D, and F: Everything else (and I'm not identifying who's who, or what groups they might be in).

A few minutes ago, I went through all of the tests one by one and came up with my own assessment based on the available data. Ordering within a group is not meaningful.

Group A: instances of:

  • awkbot
  • rawdog
  • Awasu/3.3PE
  • walrss/0.3.7
  • Mojolicious (Perl)
  • com.vanniktech.rssreader:1.40.5, 1.40.6
  • Broadsheet/0.1
  • feedbase-fetcher.pl/0.5
  • ... something unknown that claims to be a web browser

Group B: instances of:

  • Liferea/1.15.3
  • NewsBlur
  • FreshRSS/1.23.1
  • bdrss/4.0
  • ... some unknown Thunderbird extension
  • ... another thing claiming to be a web browser

Anything not shown here is not being tested or is in another group, or I screwed something up and missed it. Contact me if you think I skipped your entry.

I should mention that there are a more than a couple of systemic bugs have been found across multiple reader programs:

Bug: It's entirely possible for a feed's Last-Modified value (seconds) to remain the same while the ETag (length + microseconds on stock Apache) changes. More than a few feed readers assume if they get the same value for Last-Modified, then they don't have to update the cached ETag value. This causes them to effectively make unconditional requests until the feed changes again. Watch out for shortcut evaluations in your caching code!

Bug: If-Modified-Since is only really valid if you were served it as a Last-Modified value previously. Readers are inventing values, or are sourcing them from the wrong layer of the stack. Don't do this.

Only use the last Last-Modified value for If-Modified-Since, and only use the last ETag value for If-None-Match.

Bug: Timing is too tight, and they aren't accounting for how long it takes to perform a poll. I'll probably do a separate post about this since it comes up in other things in the world, too.

Bug: Launching multiple identical requests at feed init time, and usually in a volley that triggers rate-limiting. There's something wrong with the network I/O design when this happens. Calls across the network are not "free" and should be executed sparingly. Don't discard the values only to fetch them again a moment later.

Reader feedback: feed reader scores and "like" buttons

Well, it's been an interesting couple of days. I got all kinds of feedback submitted from people who wanted to participate in the feed reader score service project. I spent quite a while answering every one of those and issuing keys to anyone who asked.

The whole time, the data has been coming in, and we've been able to start seeing some interesting patterns in the noise. There are a fair number of people who have absolutely perfect feed readers. They show up at reasonable intervals, send the right header(s), and just work with no fuss. It's fantastic.

For those not currently participating, here's how it works. I send you a link to a web page with the instructions and a unique key that's just a bunch of random hex digits. The web page tells you how to turn that into a unique URL that can be handed to a feed reader.

Then, there's another key-based URL which shows the report. I provided a text dump of what it looked like in Thursday's post, and that's all it was at the time. I've since put in the effort to make it actually generate a Real Web Page, complete with actual HTML and all of this stuff (imagine that). Now it looks more like this:

"HTML tables and lists of statistics"

I've removed the identifying marks (the key at the top, and the user-agent string at the bottom).

This program got a little frisky at startup and actually sent multiple identical unconditional requests for some reason. I don't know why, but it's not good behavior.

It also did some requests for some other made-up URLs that weren't supplied by the user (who was talking to me at the time), and I don't mean favicon or those apple-touch-icon things. I mean it took the URL provided by the user and glued on /other /stuff. That kind of stuff doesn't show up here, but I could add it later.

It's strange. Bum rushing a web server isn't a good first impression.

...

Separately, an anonymous reader asked if I would be willing to put up stats for the posts so they could see which ones were particularly popular. I actually don't really analyze that. I can tell on a basic level when things are popular because the "tail -f" of the logs really starts moving, and I can see an uptick on the bandwidth utilization.

But other than that, I have only my own random impressions of what's popular and what isn't based on remembering activity levels and the flavor of whatever feedback it might generate (or not). So, there's no data to provide, and I really don't want to build such a dataset anyway.

A post is a post. If it works out, that's great, but if not, eh. I have no particular reason to "tune" them. I don't have ads to serve, impressions to generate, or eyeballs to sell out to the highest bidder. It's just a whole mess of text.

Most of what I use my logging for is to handle abuse. Analytics? Feh.

The one exception is from 2020 when I noticed that a fair number of bug-tracking systems were leaking URLs (many internal) to specific posts about particular nerdly analyses of failure modes like "don't setenv in multi-threaded code" and "malloc(1213486160)" and all of this. The post I did about that links to the older posts and gives a little info on them and a few scant details about the incoming links. Such referrer data is almost completely dead now, since I just get bare hostnames without any paths. It's rare to see much more than that, so I don't expect a repeat of that post, well, ever.

They also asked for "like/dislike" and maybe more (FB style?) "reactions". I'm not likely to do that, either, for the same reason as why I don't have any kind of public comments: managing that kind of thing is serious work. If it's not managed and kept in the right groove, it will turn into a very bad place at the times of the week when people with lives are out living them, and THE ONE can run amok with nobody to tell them to shut the hell up. You know exactly which forums I'm talking about.

Also, I'm pretty sure that having me trying to operate the forum and yet also moderate it would quickly turn into yet another case of "don't do both" as I have few qualms about dropping the banhammer on bad behavior... but that tends to divide communities. (Been there, done that.)

Finally, I like the fact that all of this stuff is a forest of flat files. There are no connections to any sort of database in the serving path for the posts or the feed. That would be defeated if I added something to dip into a table and look to see how many "likes" it had before shoving the post out the door.

The entire /w/ path in the document root is about 151 MB at the moment. flicker (the current web server) is a monster that besides being physically massive also happens to have 128 GB of RAM in it. That means it can fit the entire thing into memory and basically keep it there. That means there isn't even a concern about how fast the "disks" are, since it only touches that stuff once per item.

You might have noticed that things are usually pretty snappy. That's a big part of why: it's doing as little work as possible.

If there was dynamic stuff going on, that wouldn't work nearly as well. For me to cross that particular divide, it's going to have to be for a very good reason.

The feed reader score service is now online

The "feed reader score" service that I mentioned earlier this week is now up and running. Several people reached out to me and I have sent them their unique codes and links to the instructions so they can get started.

As of about a day ago, we are now logging metadata on the requests, and as of a couple of hours ago, the reports are starting to be built. I'm still adding features to it, but it's already pretty clear that not all feed readers are built to the same standards.

This is a sample report for the sort of real programs that I see plenty on the actual site all day every day:

Time span analyzed: about 17 hours (63734.284600 sec)
-------------------------------------
Number of log entries: 34
-------------------------------------
Conditional requests: 0
-------------------------------------
❗ Unconditional requests: 34
One is normal for feed initialization.
-------------------------------------
Good If-Modified-Since values: 0
-------------------------------------
❗ Bogus If-Modified-Since values: 34
-------------------------------------
Good If-None-Match values: 0
-------------------------------------
User agent: (34) <redacted>
-------------------------------------
Useless cookies: none!
-------------------------------------
Useless referrers: none!
-------------------------------------
Useless query parameters: none!
-------------------------------------
(Still evolving.  Check back later.)

I've removed the user-agent info here so you'll just have to guess at which software this might be.

As you can see, in the short time it's been reporting in, it always sends requests with broken If-Modified-Since values. This makes it get a full copy of the feed every time, and that makes it an unconditional request.

The average polling interval appears to be about 17-18 minutes. If it ran all day, it would make about 45 requests and would use about 25 MB of data that's all redundant since nothing's actually changing.

So, it begins! If you want to participate, send me some feedback and let me know. Also, to that one "wants to be a good citizen" person who indicated interest but didn't supply an e-mail address, can you try that again? Thanks.

I'm going to give this "reader score" thing a spin

Wow, okay, the "feed reader score" post has been making some waves. There have been some good discussions about it, and very few of the usual failure modes for those forums.

I heard from a bunch of people who are feed reader authors or just feed reader users who care deeply about doing the right thing. They actually want to point their program at my proposed "score" site to see what it's doing, and so they can find out if it's behaving badly somehow.

I like hearing this! This kind of stuff warms my heart.

So, I've started taking steps to invest in this project. To that end, I ran out, had a couple of slices of pizza and some root beer, then came back and started digging in my parts box. I grabbed the old Raspberry Pi B+ (star of the armv6 post from last fall) and started bringing it up to date.

I got it all configured as a proper outside-facing box, then I actually *hopped in the car* earlier and drove it out to the data center to plug it in. It's there now, just waiting for me to start doing stuff with it.

Why a separate box? Well, I've done this kind of stuff before, and it's easy for a "test target" to turn into a smoking crater. This way, if it goes really badly, I can just turn it off and it won't affect the other boxes out there. It isn't affected by whatever rate-limiting I might do on the "real" web server.

Also, it caps the amount of resources that can be spent by this project - it's a goofy little box that'll barely move a few tens of Mbps across its godawful NIC. It's not going to be haul-ass fast or anything like that, but it doesn't have to be. It just has to log the incoming requests. Something else gets to do the analysis to figure out whether the observed behavior is good, bad, or just plain meh.

To be clear, the "feed" this thing will be serving is not going to have any real posts in it. You won't be able to read my latest stuff by subscribing to it. Also, I will warn everyone right now that I fully expect to have to yank the entire sub-zone out of the domain multiple times after people ruin one, and I have to NXDOMAIN them early.

The URLs you get for testing are not expected to be long-term durable, in other words. It all depends on how much abuse it gets. Considering you can figure out what a feed reader is doing after a day or two at most, that's probably not going to be a problem.

Why am I writing about this? I'm trying something different: basically, if I talk about it first, will that light a fire under me to make stuff happen sooner? This is not normally how I operate. I normally just show up with something already done. This kind of project is just big enough to where that won't work.

I guess it's time to ignite a petroleum product at zero-hundred hours.

So many feed readers, so many bizarre behaviors

It's been well over a year since I started serving 429s to clients which are hitting the feed too often. Since then, much has happened, and most of it is generally good news.

I've heard from users and authors alike of feed software. Sometimes the users have filed bug reports and/or feature requests and have gotten positive results from the project (or vendor). Other times, the authors of such software have gotten in touch, did some digging, found a few nuances of how their libraries work, and improved the situation.

Some of them are trying but are still not quite making it right.

Here's some of what's been going on.

...

At least one reader improved to not send a date from 1800. Unfortunately, it's now sending the wrong value in its If-Modified-Since headers. Instead of sending the value it obtained from the Last-Modified header on the past fetch, it's using the value from the "updated" header in the feed itself.

These are different layers of the system, and you can't mix their values together. They're close, but not exactly the same. This is how it works.

I get the impression from stalking their issues that they don't really control their HTTP requests very well because it's done by some other library. That's where the 1800 thing came from in the first place. It sounds like yet another case of using libraries that really don't do their jobs properly.

...

There's another one which hits every 2 minutes without fail, and there's no way to change it. I've even installed that app on a test account and verified this myself - it's hard-coded. As a result, it's the one feed user-agent I've had to block outright. It doesn't stop requesting things even when it hits a brick wall of 403s. Clearly, I need to use a bigger hammer.

...

A fair number of people are sending conditional requests, but are doing it every 5 or 10 minutes. This is ridiculous. I don't write that often, and never have. Polling more is not going to get you anywhere, and indeed, will now get you delayed so you get your updates much later than the well-behaved people. Knock it off.

It seems like most of these come from things which appear to just be jammed into web browsers as some kind of extension. From hearing from at least one developer, it seems like they don't do conditional requests as a matter of course. This, despite being part of a web browser ecosystem which has understood the notion of a conditional request and caching things locally for nearly three decades. Amazing.

...

A while back, I added a "Retry-After: " header to the feed. Anyone who gets a 429 will also get intel on when they should try back. It's in seconds, so it'll be something like 3600 or 86400 depending on which kind of request was sent in the first place.

There are feed services which will actually reset their countdowns every time someone trips a 429. I'm not doing that. Yet.

This is why noticing and honoring that header matters.

...

Oh, here's a new thing: goofy programs that try to "guess" the feed URL. I see all kinds of stupid requests to paths that might have a feed on it. This is a new level of density on the part of the authors of those programs.

Here's the thing. I've had metadata in the top of every single /w/ post *and* its index since some time in 2012. It looks like this:

<link rel="alternate" type="application/atom+xml" href="/w/atom.xml">

If you view source on this post or any other on the web, you'll see it up there, just hanging out.

I did that way back then because browsers used to care about RSS and Atom, and they'd put that little yellow feed icon somewhere in the top bar when they spotted this sort of thing in a page. At least in the case of Firefox, you could click on it, and it would throw the target URL to a helper of your choice.

I wrote a feed reader system at the time (remember fred?), and indeed, I could click on that icon and it would flip the feed URL over to my "subscribe to new feed" handler. It was easy.

Then, something happened, and browsers gave up on feeds, and the icon disappeared. I kept it there anyway, figuring people would make use of it. It's still the right way to programmatically find out where to get an Atom feed for the content you're looking at.

So what's with all of the groping around in the dark with made-up URLs?

...

This one blows my mind. I put together a page which has the feed URL on it as just plain text, not a link. I've seen people paste it into their feed reader and include spaces and even newlines. Seriously!

I know this because I get requests for things like "/w/atom.xml%20" over and over from feed readers which obviously don't notice they get a 404 every time.

...

Now we get to the part where I pitch a way forward, and nobody takes me up on the offer. The idea is basically this: I get some kind of commitment and support from the people who do feed reader stuff, and in turn, I build a new kind of web site which amounts to a "feed reader correctness score".

It would probably work like this: you load up a page and it hands you a special (fake) feed URL that is keyed to you and you alone. You plug it into your feed reader program through whatever flow and it will keep track of every single request to that keyed URL.

Then, after it had collected data for a while, a report would eventually become available. Just off the top of my head, the kinds of things it might say could look like this:

* Poll history: 46 checks in the past 48 hours (average 62 minutes)

* Request types: (1) unconditional (45) conditional

* If-Modified-Since timestamps: (45) matches (0) made up from whole cloth

* ETag hashes: (45) matches (0) made up from whole cloth

* Useless cookies sent: none!

* Useless referrers sent: none!

* Useless CGI arguments sent: none!

* User-agents: (40) FooGronk/1.0 +http://fg.example.org/ (6) FooGronk/1.01 +http://fg.example.org/

That's the kind of stuff I'd expect to see from a nigh-perfect reader. It connects at a reasonable pace, it sends headers with correct values, and it doesn't send along stuff like cookies that I never set in the first place.

But, okay, this is nothing but vaporware unless someone actually wants it, is willing to support it, and will commit to take actions based what it says.

There's a bigger lesson here: don't measure stuff if nobody's going to take actions based on the results. It only ever ends in misery. I wanted to write a separate post about this very topic, but figured I'd give a preview of it right here.

Okay world, surprise me. Do the right thing.

SSD death, tricky read-only filesystems, and systemd magic?

Oh, yesterday was a barrel of laughs. I've said a lot that I hate hardware, and it's pretty clear that hardware hates me right back.

I have this old 2012-ish Mac Mini which has long since stopped getting OS updates from Apple. It's been through a lot. I upgraded the memory on it at some point, and maybe four years ago I bought one of those "HDD to SSD" kits from one of the usual Mac rejuvenation places. Both of those moves gave it a lot of life, but it's nothing compared to the flexibility I got by moving to Debian.

Then a couple of weeks ago, the SSD decided to start going stupid on me. This manifested as smartd logging some complaint and then also barking about not having any way to send mail. What can I say - it's 2024 and I don't run SMTP stuff any more. It looked like this:

Apr 29 07:52:23 mini smartd[1140]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
Apr 29 07:52:23 mini smartd[1140]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
Apr 29 07:52:23 mini smartd[1140]: Warning via /usr/share/smartmontools/smartd-runner to root produced unexpected output (183 bytes) to STDOUT/STDERR:
Apr 29 07:52:23 mini smartd[1140]: /etc/smartmontools/run.d/10mail:
Apr 29 07:52:23 mini smartd[1140]: Your system does not have /usr/bin/mail.  Install the mailx or mailutils package

Based on the "(pending)" thing, I figured maybe it would eventually reallocate itself and go back to a normal and quiet happy place. I ran some backups and then took a few days to visit family. When I got back, it was still happening, so I went to the store and picked up a new SSD, knowing full well that replacing it was going to suck.

Thus began the multi-hour process of migrating the data from the failing drive to the new one across a temporary USB-SATA rig that was super slow. Even though I was using tar (and not dd, thank you very much), it still managed to tickle the wrong parts of the old drive, and it eventually freaked out. ext4 dutifully failed into read-only mode, and the copy continued.

I was actually okay with this because it meant I didn't have to go to any lengths to freeze everything on the box. Now nothing would change during the copy, so that's great! Only, well, it exposed a neat little problem: Debian's smartmontools can't send a notification if it's pointed at a disk that just made the filesystem fail into read-only mode.

Yes, really, check this out.

May 14 20:04:47 mini smartd[1993]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
May 14 20:04:47 mini smartd[1993]: Warning via /usr/share/smartmontools/smartd-runner to root produced unexpected output (92 bytes) to STDOUT/STDERR:
May 14 20:04:47 mini smartd[1993]: mktemp: failed to create file via template β€˜/tmp/tmp.XXXXXXXXXX’: Read-only file system
May 14 20:04:47 mini smartd[1993]: Warning via /usr/share/smartmontools/smartd-runner to root: failed (32-bit/8-bit exit status: 256/1)

There it is last night attempting to warn me that things are still bad (and in fact have gotten worse) ... and failing miserably. What's going on here? It comes from what they have in that smartd-runner script. Clearly, they meant well, but it has some issues in certain corner cases.

This is the entirety of that script:

#!/bin/bash -e

tmp=$(mktemp)
cat >$tmp

run-parts --report --lsbsysinit --arg=$tmp --arg="$1" \
    --arg="$2" --arg="$3" -- /etc/smartmontools/run.d

rm -f $tmp

Notice run-parts. It's an interesting little tool which lets you run a bunch of things that don't have to know about each other. This lets you drop stuff into the /etc/smartmontools/run.d directory and get notifications without having to modify anything else. When you have a bunch of potential sources for customizations, a ".d" directory can be rather helpful.

But, there's a catch: smartd (well, smartd_warning.sh) fires off this giant multi-line message to stdout when it invokes that handler. The handler obviously can't consume stdin more than once, so it first socks it away in a temporary file and then hands that off to the individual notifier items in the run.d path. That way, they all get a fresh copy of it.

Unfortunately, mktemp requires opening a file for writing, and it tends to use a real disk-based filesystem (i.e., whatever's behind /tmp) to do its thing. It *could* be repointed somewhere else with either -p or TMPDIR in the environment (/dev/shm? /run/something?), but it's not.

This is another one of those "oh yeah" or "hidden gotcha" type things. Sometimes, the unhappy path on a system is *really* toxic. Things you take for granted (like writing a file) won't work. If you're supposed to operate in that situation and still succeed, it might take some extra work.

As for the machine, it's fine now. And hey, now I have yet another device I can plug in any time I want to make smartd start doing stuff. That's useful, right?

...

One random side note: you might be wondering how I have messages from the systemd journal about it not being able to write to the disk. I was storing this stuff to another system as it happened, and it's in my notes, but I just pulled this back out of journalctl right now, and it hit me while writing this. Now I'm wondering how I have them, too!

Honestly, I have no idea how this happened. Clearly, I have some learning to do here. How do you have a read-only filesystem that still manages to accept appends to the systemd journal? Where the hell does that thing live?

The box has /, /boot, /boot/efi, and swap. / (dm-1) went readonly. The journals are in /var/log/journal, which is just part of /.

If a tree falls in a forest and nobody's around...

...

Late update: yeah, okay, I missed something here. I'm obviously looking at the new SSD on the machine now, right? That SSD got a copy of whatever was readable from the old one, which turned out to be the entire system... *including* the systemd journal files.

Those changes weren't managing to get flushed to the old disk with the now-RO filesystem, but they were apparently hanging out in buffers and were available for reading... or something? That makes sense, right?

So, any time I copied something from the failing drive, I was scooping up whatever it could read from that filesystem. The telling part is that while these journals do cover the several hours it took to copy all of the stuff through that USB 2->SATA connection, they don't include the system shutdown. Clearly, that happened *after* the last copy ran. Obviously.

Now, if those journal entries had made it onto the original disk, then it would mean that I have a big hole in my understanding of what "read-only filesystem" means even after years of doing this. That'd be weird, right?

Just to be really sure before sending off this update, I broke out the failing SSD and hooked it up to that adapter again, then went through the incantations to mount it, and sure enough:

-rw-r-----+ 1 root systemd-timesync 16777216 May 14 17:06 system.journal

The last entry in that log is this:

May 14 17:06:38 mini kernel: ata1: EH complete

There we go. Not so spooky after all.

Reader feedback on autoconf and bugginess in general

It's time for some responses to reader feedback.

One person mentions that running git log with "--full-history" should show the change that's buried within the commit that also was a merge. Unfortunately, that's not it. I tried that and a bunch of other things before landing on "git show" the other night. They did mention that it might be some other flag.

That plus help from a friend who's clearly been down this road before turned up the right answer: you probably want to add -c or -cc. Now, if you go looking for those in the actual manual for git log, you'll trip over the usual problem that searching for "-c" matches some other term that starts with c, in this case, clear-decorations. So, if you then search for " -c" (note: leading space only), then you'll find this first:

Note that unless one of --diff-merges variants (including short -m, -c, and --cc options) is explicitly given, merge commits will not show a diff, even if a diff format like --patch is selected, nor will they match search options like -S. The exception is when --first-parent is in use, in which case first-parent is the default format.

... and now you're getting somewhere.

Why this isn't the default, I have no idea. You're asking it to show a patch, so show a patch, already!

...

I saw some other comments saying that I should report some of the wacky brokenness that I tripped over, and I need to respond to that.

It's been made very clear to me that plenty of people don't give a shit about what I report as "broken" in their projects. This isn't necessarily about the world of free software/open source, either. It's happened plenty of times at actual jobs when I was there specifically to find badness.

Case in point: someone apparently had a conversation along the lines of "hey, it would be bad if the <redacted spooky people> got their hands on this stuff and it didn't work, so let's have someone who isn't one of the devs test this out first". Then they got a hold of me and ask me to Do some Stuff with their products.

I got my hands on the hardware and then the software, and in so doing, found all kinds of crazy problems in it, almost like nobody had actually tried to use the full extent of the hardware with their driver software.

I started asking questions of the engineers, trying to figure out what they have done. At some point they went radio-silent and I started getting responses from their boss. At that point I thought "it's good to be a contractor" and referred it to MY boss - the person who hired me for the gig. It was clearly an internal matter and not anything for me to deal with. (One of the rare perks of being a contractor, gotta say.)

So, back to what happened the other night. Did I find some badness in pkg-config or pkgconf or whatever variant they're using? I probably did! Could it be exploited? Maybe? Probably? Do I care that much? Not really. Why? Because *far too few people* actually give a shit about this kind of correctness.

This is fairly common when you set me loose in an ecosystem: I find all kinds of dumb "mosquito bite" things that make things less than pleasant for someone who isn't expecting it. They're the things that experienced users know how to route around because they start accepting it as "normal".

If my "trouble reports" tend to be seen as more of a burden than a value, why would I keep generating them?

For those few people out there who truly care about this sort of thing, obviously I love and cherish you and people who do what you do. Please don't think I'm painting you with the same brush. Just realize that you are very much in the minority, and I have no idea if I'm going to encounter one of you (unlikely) or one of the others (very likely) when I file a report.

Another thing: I'm not paying any of these people, so why should they listen to me? There are exactly zero consequences for them if my stuff goes ignored. They don't work for me, and it's not like I could fire them or somehow use some other leverage in order to make things happen.

...

I believe that really good products have at least one person behind them who's spotting all of the goofiness and is having some success in getting those things ironed out. It's not enough to just notice them, and it's not enough to do the kind of gunky eng work it takes to find out why something happened or how it broke.

Whoever does this work has to actually have traction in the organization so that these problems will actually be fixed. Otherwise, the first two parts have no real value, and only serve to burn out whoever's making the attempts.

I'd love to do this kind of work for a place that truly cares about it, to the point of providing air cover to those who are tasked with finding the problems. If the company doesn't actually care, that's "fine". Don't hire people who are tasked with trying to improve those things, then if there's a hypocrisy gap in your engineering culture.

Also, I get that companies change, and what's valued today is not what will be valued tomorrow. It seems like the least a good manager would do is *explicitly* say as much when that point arrives, so you can cut and run instead of trying to push back an unstoppable tide. When the company shifts from engineering, reliability and delighting the customers to growth at all costs and exploiting the customers, it's probably time to go.

Is anyone actually doing this now? I'd love to hear about it if so.

...

Finally, to the handful of people who say that the results of ./configure "will change all the time" because of the kernel or libc version changing... that's just wrong. You're making that up based on a technicality that does not actually track with real-world use of our systems.

I'm going to guess that most of the people who are saying this are going based on the fact that you CAN change the C library or kernel arbitrarily, but probably never have. Meanwhile, I was there way back in the day when we first got ELF kernels in the 1.3 world and had to install a new compiler in order to link it in the first place, among plenty of other godawful things we had to do back in the 90s. I've been there, and I'm telling you, this is not the way of the world right now. Systems just don't change that often. When they do change to that extent, it tends to be associated with a major version bump (unless someone's mighty clueless).

The whole point of having things set up for you by the OS builder is that it takes the kernel, C library, and other localizations into account. If someone is changing those things to where syscalls and library calls are shifting materially, they are essentially signing up to maintain their own OS variant, and should update the "how you do things on this OS" data at the same time if it affects such things.

This *already* happens. Companies take a base OS (like CentOS), and then they run a far newer kernel and/or glibc atop it for their own reasons. This forces a whole raft of other changes to take advantage of the new stuff: new versions of perf, strace, ethtool, mcelog or whatever else. I know this because I did this to help out as a certain employer went from CentOS 5 to 6 to 7 over the span of several years.

Their build systems already know how to build stuff on this hybrid environment, particularly to use the overlay glibc instead of the system one. So, someone is maintaining this, and they track whatever changes when it actually matters.

Hitting every branch on the way down

I keep seeing people saying that the answer to my complaints about autoconf is to rub *more* autoconf on the problem. I don't like this. In the general vein of "this should not be that hard", I decided to revisit something from two years ago and tried to use my build tool to generate my stuff on a fresh BSD-flavored install. (The exact flavor is unimportant here, and mentioning it by name would only trigger the weenies in the crowd, so I won't.)

I wanted to prove to myself that yes, my stuff can Just Build on other (i.e., not Linux or Mac) systems without resorting to the kinds of stuff that I wish we had collectively left behind in the 90s.

The OS itself was fine. The install process on a throwaway VM image was quick and painless. I knew how to get my usual tools installed - bash, nano, that kind of thing. Unlike last time, I opted to not do X and just focused on getting my stuff to build.

But then I made a mistake: I told it to install "protobuf" since I use that library in my build tool. That actually installed "protobuf-24.4,1" which is some insane version number I'd never seen before. All of my other systems are all running 3.x.x type versions.

Now, realize, I didn't know this was a mistake yet, and kept on going. I did manage to bootstrap my build tool into a usable binary, and then started a "build the world" process, at which point it blew up. It was complaining about not being able to find a library called "google/protobuf/arena" inside my personal source tree.

This made no sense, so I started digging, and found out that the "protoc" compiler in that version of the software spits out code like this:

#include "google/protobuf/thing1.h"
#include "google/protobuf/thing2.h"
#include "google/protobuf/thing3.h"

... you get the idea. It's a third-party library that's installed at the system level, and yet it's using "" like it's all chummy and hanging out with your code in your local repo. Yeah, no. That's wrong. They should be <> includes, like this:

#include <google/protobuf/thing1.h>
#include <google/protobuf/thing2.h>
#include <google/protobuf/thing3.h>

What's weird is... it *is* that way on all of my other machines - my Mac with Macports and my Debian/Raspbian boxes all generate those #includes with <> like they're supposed to, and everything Just Works.

I won't lie. This really made me angry at first. I was like, okay, they did yet another stupid thing upstream, and now everyone else is going to have to work around it. It got me thinking thoughts like "just how hard would it be to NOT use protobuf, anyway". I figured that this abomination would eventually filter down to Macports and Debian's apt repo and whatnot, and then I'd have to deal with it, or toss it.

After a few minutes of cooling off, it occurred to me that I could do something super-duper obnoxious: wrap protoc, and run a nasty little sed command afterward to flip the "" to <>. So I did that, and things proceeded. Awful.

Of course, then I ran into some other fun problems with my code, like IPPROTO_* definitions not being available. I have a wrapper for getaddrinfo() and it uses IPPROTO_TCP in the .ai_protocol field. I had all of the #includes that the man pages say to have for using that function, but that's not enough on this particular system.

I assume that there's some transitive #include on Macs and on glibc-flavored Linuxes that drags this in for me, but on this one BSD it doesn't work that way. The fix was simple enough, and mighty stupid:

#include <netinet/in.h>

And no, that's not listed in their getaddrinfo(3) manual page, even though IPPROTO_UDP and _TCP are both explicitly mentioned in it. Dig around online and you'll find this tripping up other people. That's the extent of my self-inflicted damage that had to be fixed to make it build: lack of a few #includes.

Stuff like this is why I tend to wall off calls into the C library with a bunch of compatibility gunk and then use my own interfaces above that.

At some point during this, I decided to go back into the protobuf git repo to see just when they decided to dump the angle brackets in favor of the double-quotes, and that's when I hit another wall of stupid. Apparently it's possible to change a git repo in such a way that "git log -p" will never show it. Did you know that? Before yesterday, I definitely didn't.

Here's how I discovered this: obviously, there was code that would do the <> stuff at some point. The last version of it I could find looked like this:

  std::string left = "\"";
  std::string right = "\"";
  if (use_system_include) {
    left = "<";
    right = ">";
  }
  return left + name + right;

It seems simple enough, if a little goofy: return "input" unless use_system_include gets set a few lines up, in which case it should return <input>. No big deal, right?

But... that code exists nowhere in the repo as it stands now. Silly naive me, I thought I could just "git log -p" and do a / search in less for "use_system_include" to find the commit which dropped it. I wanted to learn why they did this, because maybe they had a good reason, or basically, if I complained about it, what I would be up against.

I found nothing.

This started a terrible sequence where I started checking out different commits from the tree to see what it looked like at various points in the past. I got it down to a commit that contained the above code, and then one commit past that dropped it.

This must be it, right? I should be able to "git log -p" and see it, right? Nope.

commit d85c9944c55fb38f4eae149979a0f680ea125ecb (HEAD)
Merge: 7764c864b 0264866ce
Author: <removed because it's not their fault>
Date:   Mon Sep 19 14:10:44 2022 -0700

    Sync from Piper @475378801
    
    PROTOBUF_SYNC_PIPER

The next line in the git log output is the next commit. There's no "body" to this commit. It's just a "Merge:" and two other commits.

7764c864b and 0264866ce, right? I should be able to sync to those with git checkout and see which one dropped it, yeah? Well, I'll spare you the effort and just say that BOTH OF THEM have the old code in it.

So... this commit somehow drops the code even though it's merging two "ancestral commits" that both contain it, and there's no diff shown.

Confusing, right?

I don't know how I finally figured this out, but after a whole lot of cursing and thrashing, I found "git show <commit>" will FINALLY give me the results I want, ish. It contains the change which dumped the <> code and put in the new stuff.

--  std::string left = "\"";
--  std::string right = "\"";
--  if (use_system_include) {
--    left = "<";
--    right = ">";
--  }
--  return left + name + right;
++  return absl::StrCat("\"", basename, "\"");

There's no explanation or other context. Presumably that all got squashed out when it was exported from whatever they use internally.

"Why" is gone. I just have "when", and that's not very interesting: it was merged in September 2022, ho hum. That just means that whenever Linux distributions and Macports catch up with at least that point, I'm going to have to deal with this for real.

Oh, there's one more bit of batshittery which needs to be mentioned here. My stuff uses pkg-config to find out how to compile and link against these libraries, right? Well, when it was using this oh-so-new protobuf version, the commands it was running were so long, it was scrolling off my standard 80x25 terminal.

$ pkg-config --cflags protobuf | wc -c
    4326

Yep! 4 KB of cflags. Here's just the top part of it:

# pkg-config --cflags protobuf
-I/usr/local/include -DPROTOBUF_USE_DLLS -Wno-float-conversion 
-DNOMINMAX -Wno-float-conversion -DNOMINMAX -Wno-float-conversion 
-DNOMINMAX -Wno-float-conversion -DNOMINMAX -Wno-float-conversion 
-DNOMINMAX -Wno-float-conversion -DNOMINMAX -Wno-float-conversion 
-DNOMINMAX -Wno-float-conversion -DNOMINMAX -Wno-float-conversion 

... and it just goes on like this. It actually worked, though!

Finally, remember when I said that I made a problem by installing their "protobuf" package without realizing it? Yeah, it turns out they actually also have "protobuf3" which is a nice sane version just like the ones on my other machines, #include <...> and all. So, I removed the bad one, installed this other one, and dropped my sed hack.

What a night.

Going in circles without a real-time clock

I have a story about paper cuts when using a little Linux box.

One of my sites has an older Raspberry Pi installed in a spot that takes some effort to access. A couple of weeks ago, it freaked out and stopped allowing remote logins. My own simple management stuff was still running and was reporting that something was wrong, but it wasn't nearly enough detail to find out exactly what happened.

I had to get a console connected to it in order to find out that it was freaking out about its filesystem because something stupid had apparently happened to the SD card. I don't know exactly why it wouldn't let me log in. Back in the old days, you could still get into a machine with a totally dead disk as long as enough stuff was still in the cache - inetd + telnetd + login + your shell, or sshd + your shell and (naturally) all of the libraries those things rely on. I guess something happened and some part of the equation was missing. There are a LOT more moving parts these days, as we've been learning with the whole xz thing. Whatever.

So I rebooted it, and went about my business, and it wasn't until a while later that I noticed the thing's clock was over a day off. chrony was running, so WTF, right? chrony actually said that it had no sources, so it was just sitting there looking sad.

This made little sense to me, given that chrony is one of the more clueful programs which will keep trying to resolve sources until it gets enough to feel happy about using them for synchronization. In the case of my stock install, that meant it was trying to use 2.debian.pool.ntp.org.

I tried to resolve it myself on the box. It didn't work. I queried another resolver (on another box) and it worked fine. So now what, on top of chrony not working, unbound wasn't working too?

A little context here: this box was reconfigured at some point to run its own recursive caching resolver for the local network due to some other (*cough* TP-Link *cough*) problems I had last year. It was also configured to *only* use that local unbound for DNS resolution.

This started connecting some of the dots. chrony wasn't setting the clock because it couldn't resolve hosts in the NTP pool. It couldn't resolve hosts because unbound wasn't working. But, okay, why wasn't unbound working?

Well, here's the problem - it *mostly* was. I could resolve several other domains just fine. It's just that ntp.org stuff wasn't happening.

(This is where you start pointing at the screen if this has happened to you before.)

So, what would make only some domains not resolve... but not all of them... on a box... with a clock that's over a day behind?

Yeah, that's about when it fit together. I figured they must be running DNSSEC on that zone (or some part of it), and it must have a "not-before" constraint on some aspect of it. I've been down this road before with SSH certificates, so why not DNS?

I added another resolver to resolv.conf, then chrony started working, and that brought the time forward, and then unbound started resolving the pool, and everything else returned to normal.

By "everything else", I also mean WireGuard. Did you know that if your machine gets far enough out of sync, that'll stop working, too? I had no idea that it apparently includes time in its crypto stuff, but what other explanation is there?

Backing up, let's talk about what happened, because most of this is on me.

I have an old Pi running from an SD card. It freaked out. It took me about a day and a half to get to where it was so I could start working on fixing it.

This particular Pi doesn't have a real-time clock. The very newest ones (5B) *do*, but you have to actually buy a battery and connect it. By default, they are in the same boat. This means when they come up, they use some nonsense time for a while. I'm not sure exactly what that is offhand, because...

systemd does something of late where it will try to put the clock back to somewhere closer to "now" when it detects a value that's too far in the past. I suspect it just digs around in the journal, grabs the last timestamp from that, and runs with it. This is usually pretty good, since if you're just doing a commanded reboot, the difference is a few seconds, and your time sync stuff fixes the rest not long thereafter.

But, recall that the machine sat there unable to write to its "disk" (SD card) for well over a day, so that's the timestamp it used. If I had gotten there sooner, I guess it wouldn't have been so far off, but that wasn't an option.

Coming up with time that far off made unbound unable to resolve the ntp.org pool servers, and that made chrony unable to update the clock... which made unbound unable to resolve the pool servers... which...

My own configuration choice which pointed DNS resolution only at localhost did the rest.

So, what now? Well, first of all, I gave it secondary and tertiary resolvers so that particular DNS anomaly won't be repeated. Then I explicitly gave chrony a "peer" source of a nearby host (another Pi, unfortunately) which might be able to help it out in a pinch even if the link to the outside isn't up for whatever reason.

There's a certain problem with thinking of these little boxes as cheap. They are... until they aren't. To mangle a line from jwz, a Raspberry Pi is only cheap if your time has no value.

As usual, this post is not a request for THE ONE to show up. If you are THE ONE, you don't make mistakes. We know. Shut up and go away.

autoconf makes me think we stopped evolving too soon

I've gotten a few bits of feedback asking for my thoughts and/or reactions to the whole "xz backdoor" thing that happened over the past couple of days. Most of my thoughts on the matter apply to autoconf and friends, and they aren't great.

I don't have to cross paths with those tools too often these days, but there was a point quite a while back when I was constantly building things from source, and a ./configure --with-this --with-that was a given. It was a small joy when the thing let me reuse the old configure invocation so I didn't have to dig up the specifics again.

I got that the whole reason for autoconf's derpy little "recipes" is that you want to know if the system you're on supports X, or can do Y, or exactly what flavor of Z it has, so you can #ifdef around it or whatever. It's not quite as relevant today, but sure, there was once a time when a great many Unix systems existed and they all had their own ways of handling stuff, and no two were the same.

So, okay, fine, at some point it made sense to run programs to empirically determine what was supported on a given system. What I don't understand is why we kept running those stupid little shell snippets and little bits of C code over and over. It's like, okay, we established that this particular system does <library function foobar> with two args, not three. So why the hell are we constantly testing for it over and over?

Why didn't we end up with a situation where it was just a standard thing that had a small number of possible values, and it would just be set for you somewhere? Whoever was responsible for building your system (OS company, distribution packagers, whatever) could leave something in /etc that says "X = flavor 1, Y = flavor 2" and so on down the line.

And, okay, fine, I get that there would have been all kinds of "real OS companies" that wouldn't have wanted to stoop to the level of the dirty free software hippies. Whatever. Those same hippies could have run the tests ONCE per platform/OS combo, put the results into /etc themselves, and then been done with it.

Then instead of testing all of that shit every time we built something from source, we'd just drag in the pre-existing results and go from there. It's not like the results were going to change on us. They were a reflection of the way the kernel, C libraries, APIs and userspace happened to work. Short of that changing, the results wouldn't change either.

But no, we never got to that point, so it's still normal to ship a .tar.gz with an absolute crap-ton of dumb little macro files that run all kinds of inscrutable tests that give you the same answers that they did the last time they ran on your machine or any other machine like yours, and WILL give the same answers going forward.

That means it's totally normal to ship all kinds of really crazy looking stuff, and so when someone noticed that and decided to use that as their mechanism for extracting some badness from a so-called "test file" that was actually laden with their binary code, is it so surprising that it happened? To me, it seems inevitable.

Incidentally, I want to see what happens if people start taking tarballs from various projects and diff them against the source code repos for those same projects. Any file that "appears" in the tarball that's allegedly due to auto[re]conf being run on the project had better match something from the actual trees of autoconf, automake, ranlib, gettext, or whatever else goofy meta-build stuff is being used these days.

$ find . -type f | sort | xargs sha1sum
7d963e5f46cd63da3c1216627eeb5a4e74a85cac  ./ax_pthread.m4
c86c8f8a69c07fbec8dd650c6604bf0c9876261f  ./build-to-host.m4
0262f06c4bba101697d4a8cc59ed5b39fbda4928  ./getopt.m4
e1a73a44c8c042581412de4d2e40113407bf4692  ./gettext.m4
090a271a0726eab8d4141ca9eb80d08e86f6c27e  ./host-cpu-c-abi.m4
961411a817303a23b45e0afe5c61f13d4066edea  ./iconv.m4
46e66c1ed3ea982b8d8b8f088781306d14a4aa9d  ./intlmacosx.m4
ad7a6ffb9fa122d0c466d62d590d83bc9f0a6bea  ./lib-ld.m4
7048b7073e98e66e9f82bb588f5d1531f98cd75b  ./lib-link.m4
980c029c581365327072e68ae63831d8c5447f58  ./lib-prefix.m4
d2445b23aaedc3c788eec6037ed5d12bd0619571  ./libtool.m4
421180f15285f3375d6e716bff269af9b8df5c21  ./lt~obsolete.m4
f98bd869d78cc476feee98f91ed334b315032c38  ./ltoptions.m4
530ed09615ee6c7127c0c415e9a0356202dc443e  ./ltsugar.m4
230553a18689fd6b04c39619ae33a7fc23615792  ./ltversion.m4
240f5024dc8158794250cda829c1e80810282200  ./nls.m4
f40e88d124865c81f29f4bcf780512718ef2fcbf  ./po.m4
f157f4f39b64393516e0d5fa7df8671dfbe8c8f2  ./posix-shell.m4
4965f463ea6a379098d14a4d7494301ef454eb21  ./progtest.m4
15610e17ef412131fcff827cf627cf71b5abdb7e  ./tuklib_common.m4
166d134feee1d259c15c0f921708e7f7555f9535  ./tuklib_cpucores.m4
e706675f6049401f29fb322fab61dfae137a2a35  ./tuklib_integer.m4
41f3f1e1543f40f5647336b0feb9d42a451a11ea  ./tuklib_mbstr.m4
b34137205bc9e03f3d5c78ae65ac73e99407196b  ./tuklib_physmem.m4
f1088f0b47e1ec7d6197d21a9557447c8eb47eb9  ./tuklib_progname.m4
86644b5a38de20fb43cc616874daada6e5d6b5bb  ./visibility.m4
$ 

... there's no build-to-host.m4 with that sha1sum out there, *except* for the bad one in the xz release. That part was caught... but what about every other auto* blob in every other project out there? Who or what is checking those?

And finally, yes, I'm definitely biased. My own personal build system has a little file that gets installed on a machine based on how the libs and whatnot work on it. That means all of the Macs of a particular version of the OS get the same file. All of the Debian boxes running the same version get the same file, and so on down the line.

I don't keep asking the same questions every time I go to build stuff. That's just madness.

Port-scanning the fleet and trying to put out fires

There was this team which was running a pretty complicated data storage, leader election and "discovery" service. They had something like 3200 machines and had something like 300 different clusters/cells/ensembles/...(*) running across them. This service ran something kind of like etcd, only not that.

The way it worked was that a bunch of "participant" machines would start an election process, and then they'd decide who was going to lead them for a while. That leader got to handle all of the write traffic and it did all of the usual raft/paxos-ish spooky coordination stuff amongst the participants, including updating the others, and dealing with hosts that go away and come back later, and so on. It's all table stakes for this kind of service.

This group of clusters had started out relatively simple but had grown into a monster over the years. Nobody probably expected them to have hundreds of clusters and thousands of machines, but they now did, and they were having trouble keeping track of everything. There were constant outages, and since they were so low in the stack, when they broke, lots of other stuff broke.

I wanted to know just what the ground truth looked like and so started something really stupid from my development machine. It would take a list of their servers and would crawl them, interrogating the TCP ports on which the service ran. This was only about 10 ports per machine, so while it sounded obnoxiously high, it was still possible for prototyping purposes.

On these ports, there were simple text-based commands which could be sent, and it would return config information about what that particular instance was running. It was possible to derive the identity of the cluster from that. Given all of this and a scrape of the entire fleet, it was possible to see which host+port combinations were actually supporting any given cluster, and thus see how well they were doing.

Early results from this terrible manual scraping started showing promise. Misconfigurations were showing up all over the place - clusters that are supposed to have 5 hosts but only have 3 in practice with the other two missing in action somewhere, clusters with non-standard host counts, clusters in the wrong spots, and so on.

To get away from the "printf | nc the world in cron" thing, we wound up writing this dumb little agent thing that would run on all of the ~3200 hosts. It would do the same crawling, but it would happen over loopback so it was a good bit faster by removing long hauls over the production network from the equation. It also took the load of polling ~32000 ports off my singular machine, and was inherently parallel.

It was now possible to just query an agent and get a list of everything running on that box. It would refresh things every minute, so it was far more current than my terrible script which might run every couple of hours (since it was so slow). This made things even better, and so we needed an aggregator.

We did some magic to make each of these agents create a little "beacon" somewhere any time they were run. Our "aggregator" process would start up and would subscribe to the spot where beacons were being created. It would then schedule the associated host for checks, where it would speak to the agent on that host and ask for a copy of its results.

So now we had an agent on every one of the ~3200 hosts, each polling 10 local ports, plus an aggregator that talked to the ~3200 agents and refreshed the data from them.

Finally, all of the data was available in one place with a single query that was really fast. The next step was to write a bunch of simple "dashboard" web pages which allowed anyone to look at the entire fleet, or to narrow it down by certain parameters - a given cluster (of these servers), a given region, data center, whatever.

With all of this visible with just a few clicks, it was pretty clear that we needed something more to actually find the badness for us. It was all well and good to go clicking around while knowing what things are supposed to look like, but there were supposed to be rules about this sort of thing: this many hosts in a cluster, no more than N hosts per failure domain, and more.

...

Failure domains are a funny thing. Let's say you have five hosts which form a quorum and which are supposed to be high-availability. You'd probably want to spread them around, right? If they were serving clients from around the world, maybe you'd put them in different locations and never put two in the same spot? If something violated that, how would you know?

Here's an example of bad placement. We had this one cluster which was supposed to be spread out throughout an entire region which was composed of multiple datacenter buildings, each with multiple (compute) clusters in it, with different racks and so on down the line. But, because it had been turned up early in the life of that region when only a handful of hosts had existed, all of them were in the same two or three racks.

Worse still, those racks were physically adjacent. Put another way, if the servers had arms and hands, they could have high-fived each other across the hot and cold aisles in the datacenter suite. That's how close together they were. One bad event in a certain spot would have wiped out all of their data.

We had to write a schema which would let us express limits for a given cluster - how many regions it should be in, the maximum number of members per host, rack, (compute) cluster, building, region, etc. Then we wrote a tool to let us create rules, and then started using that to churn out rulesets. Next we came up with some tools which would fetch the current state of affairs (from the agent/aggr combo) and compare it to the rulesets. Anything out of "compliance" would show up right away.

...

Then there was the problem of managing the actual ~3200 hosts. With a footprint that big, there's always something happening. A location gets turned up and new hosts appear. Another location is taken down after the machines get too old and those hosts go away. We kept having outages where a decom would be scheduled, and then someone far away would run a script with a bunch of --force type commands, and it would just yank the machines and wipe them. It had no regard for what they were actually doing, and they managed to clobber a bunch of stuff this way. It just kept happening.

This is when I had to do something that does not scale. I said to the decom crew that they should treat any host owned by this team as off limits because we do not have things under control. That means never *ever* running a decom script against these hosts while they are still owned by the team.

I further added that while we're working to get things under control, if for some reason a decom is blocked due to this decree of mine, they are to contact me, any time of day or night, and I will get them unblocked... somehow. I figured it was my way of showing that I had "skin in the game" for making such a stupid and unreasonable demand.

I've often said that the way to get something fixed is to make sure someone is in the path of the badness so they will feel it when something screws up. This was my way of doing exactly that.

We stopped having decom-related outages. We instead started having these "fire drill" type events where one or two people on the team (and me) would have to drop what they were doing and spend a few hours manually replacing machines in various clusters to free them up.

Obviously, this couldn't stand, and so we started in on another project. This one was more of a "fleet manager", where a dumb little service would keep track of which machines the team owned, and it would store a series of bits for each one that I called "intents".

There were only three bits per host: drain, release, freeze. Not all combinations were valid.

If no bits were set on a host, that meant it was intended for production use. If it has a server on it, that's fine. If someone needs a replacement, it's potentially available (assuming it meets the other requirements, like being far enough away from the other participants).

If the "drain" bit was set, that meant it was not supposed to be serving. Any server on it should be taken off by replacing it with an available host which itself isn't marked for "drain" (or worse).

The "release" bit meant that if a host no longer had anything running on it, then it should be released back to the machine provisioning system. In doing this, the name of the machine changed, and thus the ownership (and responsibility) for it left the team, and it was no longer our problem. The people doing decoms would take it from there.

"Freeze" was a special bit which was intended as a safety mechanism to stop a runaway automation system. If that bit was set on a host, none of the tools would change anything on it. It's one of those things where you should never need to use it, but you'll be sorry if you don't write it and then need it some day.

"Drain" + "release" meant "keep trying to kick instances off this host and don't add any new ones", and then "once it becomes empty, give it back".

Other combinations of the bits (like "release" without "drain") were invalid and were rejected by the automation.

I should note that this was meant to be level-triggered, meaning on every single pass, if a host had a bit set and yet wasn't matching up with that intent or those intents, something should try to drain it, or give it away, or whatever. Even if it failed, it should try again on the next pass, and failures should be unusual and thus reported to the humans.

...

Then there was also the pre-existing system which took config files and used it to install instances on machines. This system worked just fine, but it only did that part of the process. It didn't close the loop and so many parts of the service lifecycle wound up unmanaged by it.

Looking back at this, you can now see that we could establish a bunch of "sets" with the data available.

Configs: "where we told it to run"

Agent + aggregator: "where it's actually managing to run"

Checker: "what rules these things should be obeying"

Fleet manager: "which machines should be serving (or not), which machines we should hang onto (or give back)".

Doing different operations on those sets yielded different things.

[configs] x [agent/aggr] = hosts which are doing what they are supposed to be doing, hosts which are supposed to be serving but aren't for some reason, and hosts which are NOT supposed to be running but are running it anyway. It would find sick machines, failures in the config system, weird hand-installed hack jobs in dark corners, and worse.

[agent/aggr] x [checker] = clusters which are actually spread out correctly, and clusters which are actually spread out incorrectly, (possibly because of bad configs, but could be any reason).

[agent/aggr] x [fleet manager] = hosts which are serving where that's okay, hosts which need to be drained until empty, and hosts which are now empty and can be given back.

[configs] x [checker] = are out-of-spec clusters due to the configs telling them to be in the wrong spot, or is something else going on? You don't really need to do this one, since if the first one checks out, then you know that everything is running exactly what it was told to run.

[configs] x [fleet manager] = if you ever get to a point where you completely trust that the configs are being implemented by the machines (because some other set operations are clear), then you could find mismatches this way. You wouldn't necessarily have to resort to the empirical data, and indeed, could stop scanning for it.

For that matter, the whole port-scanning agent/aggr combination shouldn't have needed to exist in theory, but in practice, independent verification was needed.

I should point out that my engagement with this team was not viewed kindly by management, and my reports about what had been going on ultimately got me in trouble more than anything else. It's kind of amazing, considering I was working with them as a result of a direct request for reliability help, but shooting the messenger is nothing new. This engagement taught me that a lot of so-called technical problems are in fact rooted in human issues, and those usually come from management.

There's more that happened as part of this whole process, but this post has gotten long enough.

...

(*) I'm using "clusters" here to primarily refer to the groups of 5, 7, or 9 hosts which participated in a quorum and kept the state of the world in sync. Note that there's also the notion of a "compute cluster", which is just a much larger group of perhaps tens of thousands of machines (all with various owners), and that does show up in this post in a couple of places, and is called out explicitly when it does.

Sometimes the dam breaks even after plenty of warnings

Oh dear, it's popcorn for breakfast yet again. Another outage in a massive set of web sites.

It's been about 10 years, so let's talk about the outage that marks the point where I started feeling useful in that job: Friday, August 1, 2014. That's the one where FB went down and people started calling 911 to complain about it, and someone from the LA County sheriff's office got on Twitter to say "knock it off, we know and it's not an emergency".

Right, so, it's been well-documented what happened that day, even on the outside world - SRECon talks, a bunch of references in papers, you name it. It was time for "push", and as it was being seeded, that process pretty much consumed all of the available memory (and swap) on the smallest machines.

Then there was this program which ran on every box as root, and its job was to run a bunch of awful subprocesses, capture their outputs, parse them somewhat, and ship the results to a time series database or a logging system. This program is the one that had the infamous bug in it where it would call fork() and saved the return value, but didn't check it for failure: the -1 retval.

So, later on, it went to kill this "child process" that never started, and did the equivalent of 'kill -9 -1', and on Linux, that whacks everything but yourself and pid 1 (init). Unsurprisingly, this took down the web server and pretty much everything else. This was pre-systemd on CentOS 5 machines running Upstart, so the only things that "came back" were the "respawn" entries in inittab, like [a]getty on the text consoles.

This is how we were able to fire up a remote console on one of the affected machines and log in and see that there was basically init, the shell that had just been started, and this fbagent process which was responsible for assassinating the entire system that morning.

The rest of the story has also been told, which is where it took me a couple of weeks to figure out why we kept losing machines this way, and when I did, I found the source had already been patched. Another engineer unrelated to the fbagent project had been hitting the same problem, decided to go digging, found the "-1" pid situation leaking through, and fixed it.

Even though the fix was committed, it wasn't shipped, because this binary was big and scary and ran as root on (then) hundreds of thousands of machines, and the person who usually shipped it was on vacation getting married somewhere. As a result, the old version stayed in prod for much longer than it otherwise would have, complete with the hair-trigger bug that would nuke every process on the machine.

All it needed was something that would screw up fork, and on that morning, it finally happened.

What hasn't really been told is that the memory situation had been steadily getting worse on those machines that whole summer. We had been watching it creep up, and kept trying to make things happen, but by and large, few people really cared. Also, people had been adding more and more crap to what the web servers would run. Back in those days, you could just tell your endpoint to run arbitrary code, and it basically would, right there on the web server!

Case in point: people had started running ffmpeg on our web servers. They decided that was an AWESOME place to transcode videos. By doing that, they didn't have to build out their own "tier" of machines to do that work, which would have meant requesting resources, and all of that other stuff. Instead, they just slipped that into a release and slowly turned up the percentage knob until it was everywhere.

ffmpeg is no small thing. One instance could pull nine CPU cores and use 800 MB of memory - that's actual memory, not just virtual mappings. Also, this made requests run really long, and when that happened, the "treadmill" in the web server couldn't happen sufficiently quickly.

What's the treadmill? Well, when you have memory allocations for a bunch of requests that then finish, you have to garbage-collect them eventually. My understanding is that the treadmill essentially worked by waiting until every request that had been active at the same time was also gone, and then it would free up the resources.

This is a little confusing so think about it this way. These machines were true multitasking, so they'd possibly have 100 or more web server threads running, each potentially servicing a request. Let's say requests A-M were running and then request N started up and allocated some memory. The memory allocated by N would only be freed once not only N was done, but A-M too, since they had overlapped it in time. If any of them were sticking around for a while, then N's resources couldn't be freed until that first one exited.

Given this, it's not too hard to see that really long-running requests effectively limit how often the "treadmill" can run, and thus how often the server will release memory for use in other things.

Also, there were other things going on which were just really expensive endpoints which could chew a gig of memory all by themselves. This was NOT scalable. You simply couldn't sustain that on these systems.

Basically, if you were to make a time-traveling phone call to me a few weeks before "Call the Cops" happened, and ask me what I was worried about, "web tier chewing memory and going into swap" probably would have been pretty high on the list.

To give some idea of how long this had been going on, that year, July 4th (a national holiday) fell on a Friday, so we had a "three day weekend". When this happened, the site didn't get pushed. This mattered because push would usually get the machines to free up a bunch of memory at once and generally become less-burdened.

A regular two-day weekend would leave things looking pretty thin by the time Monday's push rolled around, but a three-day weekend made things a lot worse... and this was a full month before everything finally broke.

So, yeah, the site broke that morning, but it's not like it was too surprising. The signs had been visible for quite a while in advance. Imagine standing on top of a massive dam and you start seeing one leak, then two, then four, and so on. You try to get help but it's just not happening.

Of course, once the dam actually fails, then somehow you find the resources to get people caring about dam maintenance. It's funny how that works.

Today's only half of the leap year fun

It's that time again, when code written in the past four years shows up in our lives and breaks something. Still, while you're enjoying the clown show of game companies telling people to manually set the clocks on their consoles and people not being able to fill up their cars, keep one thing in mind:

Only half of the fun of a leap year happens on February 29th.

The rest of it happens in ten months, when a bunch more code finds out that it's somehow day 366, and promptly flips out. Thus, instead of preparing to party, those people get to spend the day finding out why their device is being stupid all of the sudden.

So, if you got through today unscathed, but are somehow counting days in the year somewhere, you now have about 305 days to make sure you don't have your own Zune bug buried in your own code.

...

One more random thought on the topic: some of today's kids will be around to see what happens in 2100. That one will be all kinds of fun to see who paid attention to their rules and who just guessed based on a clean division by four.

1 << n vs. 1U << n and a cell phone autofocus problem

Maybe 15 years ago, I heard that a certain cell phone camera would lose the ability to autofocus for about two weeks, then it would go back to working for another two weeks, and so on. It had something to do with the time (<some unit> since the epoch), the bits in use, and a fun little thing called sign extension.

I got some of this from a leaflet that was posted around where I worked at the time. It was posted in areas where the public could see it, so I figure it's fair game.

Here's a nice little test program to show what I'm talking about:

#include <stdio.h>

static unsigned long set_bit_a(int bit) {
  return 1 << bit;
}

static unsigned long set_bit_b(int bit) {
  return 1U << bit;
}

int main() {
  printf("sizeof(unsigned long) here: %zd\n", sizeof(unsigned long));

  for (int i = 0; i < 32; ++i) {
    printf("1 << %d : 0x%lx | 0x%lx\n", i, set_bit_a(i), set_bit_b(i));
  }

  return 0;
}

This does something mildly interesting when run on a 64 bit system:

$ bin/exp/signext 
sizeof(unsigned long) here: 8
1 << 0 : 0x1 | 0x1
1 << 1 : 0x2 | 0x2
1 << 2 : 0x4 | 0x4
1 << 3 : 0x8 | 0x8
...
1 << 28 : 0x10000000 | 0x10000000
1 << 29 : 0x20000000 | 0x20000000
1 << 30 : 0x40000000 | 0x40000000
1 << 31 : 0xffffffff80000000 | 0x80000000

Meanwhile, the same code on a 32 bit machine is relatively boring:

$ ./t
sizeof(unsigned long) here: 4
1 << 0 : 0x1 | 0x1
1 << 1 : 0x2 | 0x2
1 << 2 : 0x4 | 0x4
1 << 3 : 0x8 | 0x8
...
1 << 28 : 0x10000000 | 0x10000000
1 << 29 : 0x20000000 | 0x20000000
1 << 30 : 0x40000000 | 0x40000000
1 << 31 : 0x80000000 | 0x80000000

Gotta love it.

A vintage network attack called smurf

In the vein of my "flash" story from a few years ago, here's one about "smurf".

Back around 1997, there was something new going around in the realm of net abuse: "smurfing" a target. This one involved a nice little trick that let you send out a relatively small amount of traffic and let someone else turn it into a much larger amount of traffic, and then that response would be directed onto your target.

This required two bits of cooperation from the environment. First, you had to be able to transmit a packet of some sort with the source address set to your target. Yes, this does mean "spoofing" the source address, and any responsible ISP should filter that nonsense on egress, but back then it was all too infrequent.

Then, you also had to send it to either the network or the broadcast address of some particularly juicy network that was laden with hosts that would reply to that sort of thing.

For example, let's say you had a network 192.0.2.0 with the netmask 255.255.255.0 (a /24). Then in that case, .0 would be the network and .255 would be the broadcast address. Back in those days, firing a packet at either of those would usually make the router spew it out to the *Ethernet* broadcast address, and so it would hit every host on that subnet which could then decide to reply or not.

So, just imagine a packet "from" your target, seemingly addressed to dozens or hundreds of machines, which then all answer at once. The attacker sends out a single ~1500 byte ping (for example), and the victim receives that multiplied by however many hosts decide to reply - not great!

There were some things which could be done about this. Routers eventually got a config knob that let you turn off "directed-broadcast" or similar, so anything arriving from the outside for a network or broadcast address would just be dropped on the floor. Unix boxes of different flavors also started gaining the ability to have packet filtering rules. (This took far too long in some cases.)

Besides that, people running networks could follow various best practices and not let traffic that's claiming to be from somewhere else in the world egress from their network. Any packet like that is either a misconfiguration on someone's part (possibly yours), or maybe some dummy trying to attack someone else. Either way, it needs to be tracked down and dealt with.

Sometimes people would use this kind of stuff as a reason to "block all ICMP", and then they would just create other problems like breaking path MTU discovery, and that would cause connections to hang with large packets and weird non-Ethernet-MTU-sized links.

Another related attack tool back then was called "fraggle" and it did the same sort of directed-broadcast shenanigans, but which used UDP instead of ICMP. The effect was the same, and it also got around anyone who thought that filtering all ICMP was somehow a good idea.

The old days weren't always good days.

LDAP differ feedback and the "666" I missed

It's another round of feedback, because there's been a lot going on.

I must admit that I did not expect my post about diffing LDAP to have such a response. I honestly just wanted to tell a story about some mildly rebellious activity I had seen happen and then had decided to do myself, and it turned into a whole thing. Lots of people wrote in to say that they have also been doing it, and others have started as a result of that post! That was not my intent but the net effect is definitely pleasing to me, so it all worked out.

Now, in response to some specific comments - a few people wrote in to say that the "epitaphs" internal page/service at Google (yep) now allows you to line some stuff up before you leave. That way, when you leave and your entry shows up, it'll have something on it that you submitted directly, and you don't have to "bounce through a friend" or whatever.

I think part of this is that people don't realize just how long it's been since I was plugged in to that ecosystem. I left in May of 2011 - almost thirteen years ago now! I thought it was broken badly enough to leave all the way back then, and this was after being there for about four and a half years. It still amazes me when I find out that people willingly go there, but I've had to tell myself to shut up about that and just advise them to "get in, get paid, and get the hell out".

Seriously, get in, take their money, and go. Whatever tech darling status they had was gone a LONG time ago. I dare say that I watched it curl up and die from the inside. I can't even imagine what could possibly be left inside there now that so much time has passed.

It occurs to me that sufficiently young people who are just now entering the industry fresh out of school (or whatever) have no idea what it used to be like. They've only known the current versions of things, and probably figure it's as bad as everywhere else, so why not, right? I guess it's hard to argue with that. Just never make the mistake of thinking that it's special somehow. Those days are gone gone gone and they aren't coming back. Companies which are that massive just can't deliver that kind of environment.

...

In response to the WPA3 stuff and badness happening after 11 hours, Ewen (and a few other people) wrote in and said that I should have been looking at minutes, not seconds. 39960 / 60 gives 666 minutes. Oops. Yeah, I guess I missed the forest for the trees there. 666 minutes would do the job, for sure. \m/ rock and roll?

...

Other people said they were diffing far more than just the list of unixnames in LDAP. They used it to detect people getting promoted when their titles changed, and other things like that. I honestly didn't care about detecting that, and I don't think that the LDAP (really AD behind the scenes) system I was poking at even stored such things.

A fair number of companies have a glossed-over view of things for their permission systems that don't reflect whatever HR has for those same people. Everyone in LDAP might be a "software engineer", but in the actual HR system they might have 100 different varieties for "new grad" and "testing" and "server" and "app" and all of these other dumb things that they think they need. That means you might not see anything change when people get promoted, change teams, or shift around between different parts of the company.

While I'm talking about titles, I will mention one thing: it's interesting that certain companies talk about why they hide levels for random "ICs" (individual contributors, i.e., not managers), but then go ahead and make a big deal out of managerial titles.

Seriously, one company in particular had everyone be some sort of Software Engineer or Production Engineer or something like that without saying that this person was a 3, or a 4, or a 5, or whatever on up the line.

Meanwhile, that same company let you see that a given person was a Manager (5, 6), a Director (7, 8), or a VP (9, 10) with just a glance at their profile page.

The same sort of visibility was not afforded to the ICs at those same higher levels. You had to "just know" that so and so was "one of the 10s" or whatever.

Also, for anyone who hasn't already seen my thoughts on the matter somehow: you are not your level, and your level is NOT an indication of basically anything more than how much they like you. It is only loosely linked to your ability at the bottommost rungs of the "career ladder", and only when management is being forced to adhere to it. If the right people like you, your level will rise. If they don't, it will stagnate. Your technical abilities are *almost* completely disconnected from there.

There is one notable exception I can mention though: if you are somehow able to do something that nobody else can/will do, they will "put up with you" as long as the amount of whatever you bring in outweighs the costs of you being, well, you. But, once you start asking for things or try to do stuff that goes against what they personally want, the balance will tilt, and once it goes past center into the other side, they won't give a damn about what you can deliver any more.

I should note this has little to do with what the business needs at that point. They're probably in it for themselves, and they don't care that chasing someone out is not the right thing for the business. Indeed, they will probably bail out for greener pastures a few years later.

...

Some people from a few very large tech companies that are currently doing layoffs have pointed out that their "epitaphs" or equivalent isn't always accurate. There are groups of people who will be "off limits" and so won't come up in the reports. Obviously, once management has gotten to that level of involvement with the day to day operations of such a tool, it can be considered compromised.

That's pretty much a given: things start out as a simple hack, then grow into a small community that knows about it, and sometimes end up becoming well-known and even legitimized. But, more often than not, these same systems will be co-opted by whoever's running the show in terms of hiring and firing, and it'll stop providing useful data.

At that point, I guess you have a choice: you can try to build another thing from scratch for yourself, or you can admit that the company is too far down the road of corporate lockdown hell, and live with the fact that it'll never be accessible the way it had been.

...

Finally, there's at least one person who was visibly annoyed by the fact that I said "uid" and "(unix account name)" in the same breath. Guess what? That particular LDAP (again, really AD) system I was dumping DID use a field named "uid" to put in the unixnames. Sure, there were also *numeric* uids in the Unix sense, but that lived elsewhere in other fields, like, oh, uidNumber. Surprise surprise.

I just love it when there's this assumption that I must be screwing it up by default. To that person: ask yourself if you'd assume that about every random post on people's web sites that might mention such a thing, or only certain ones. If it's only certain ones, I bet you know which ones, and why.

It's obvious. You're making it clear that you're part of the problem. Knock that shit off.

And, as for me making mistakes, hell yes I make mistakes. I screw up all kinds of stuff and have to go back and fix it and/or explain what happened where appropriate. Anyone who follows the feed will tell you that old posts will "mysteriously" spring back to life with a recent modified time and a handful of tweaks applied. Most of those fixes come about from people sending feedback and going "hey I think X might be Y instead". It happens all the time, and I'm talking about 13 years of posts here.

If you think I made a mistake back in that post by saying "uid" and "unix account name" instead of "numeric unix account number", you could just hit the feedback button and say as much. But you know that it's way more impactful to assume that I'm a dumbass and don't know what the hell I'm talking about and do that on a big orange sewer.

You do know what "projection" means, right?

Okay, enough of that.

Figure out who's leaving the company: dump, diff, repeat

One common element of the larger places where I've worked is that they tend to have a directory service of some sort that keeps track of who's an employee and who isn't. You can learn some interesting things by periodically dumping that list and then running comparisons against the previous dump.

A certain company had this rolled up into an internal service called "epitaphs" where an entry for a person would appear a day or two after they "disappeared from LDAP" - meaning, they left the company. Then other people who still worked there could add comments like "went back to school", "moved to Idaho to raise sheep", that kind of thing.

This had an interesting side-effect that you couldn't write to your own "epitaph" because by definition you had to already be gone from the company for your page to exist. Someone else who knew you had to add it. I actually received an e-mail to that effect one time: "I'm leaving, so when it shows up, please add XYZ". I was pleased that they trusted me to do that, and a few days later, I pasted it in as requested.

Another place I worked didn't have anything quite like this. There was the "internal profile" where you could see that so and so worked at the company from <date> to <date>, but there wasn't any sort of periodic update available. I decided to roll my own. It didn't take much in the way of effort, really. A cron job on my dev server (a physical box in a datacenter with access to my home directory) woke up a couple of times every day and dumped the entire list to a file. Then it compared it to the last one, crunched it down to just the uid (unix account name) field, and appended the results to a log file.

Over time, various other people learned about this, and since I had left it world-readable, they were able to leave up a "tail -f <path>" to keep tabs on it, and sometimes something surprising would show up during the day. People would sometimes just vanish. Other times, there were bizarre things going on that added a bit of context.

The log entries looked like this:

Thu Feb 08 18:26:42 PST 2024 : uid: <someone>

That was enough to let you go digging and find out more if you actually gave a damn about why that particular person no longer worked there. Otherwise, it didn't flood you with useless data.

One time, I pasted in a line like that into an IRC channel and that <someone> popped up and said "yeah, I don't work here any more". It turned out their account had been deactivated, but they still had a client connected. When I mentioned their account name, they got a notification, flipped to that window, and replied. We had a few minutes to chat about it.

It was weird saying farewell to someone that way. Normally, the electronic lines of communication are severed early on. I think what happened here is that the IRC servers only checked auth at connect-time, and then nothing went back to make sure that sessions remained associated with current employees. (It's a bit of a hard problem.)

Another time, some manager type said they were going to be late for a meeting because of some "dumb manager thing" they had to do. Sure enough, a few minutes into that meeting, a line scrolled across showing the deactivation of an account of one of their direct reports. Obviously, they had to go into one of those HR meetings where they showed someone the door.

I'd say the best time to start doing this is when you start at a company, or when that company grows big enough to actually have LDAP or whatever. That means the second-best time would be today.

Incidentally, the 'comm' tool is great for this sort of thing.

comm -2 -3 <(grep ^uid: old | sort) <(grep ^uid: new | sort)

... and there you go.

Now, this sort of thing is not perfect. If you don't catch errors, the first time it fails to dump and yet diffs a full list against an empty list, it'll look like everyone quit. This is not what you want. Also, once you work at a big enough company, there WILL be days when some automation will run amok and "fire" everyone, and every account will be deactivated. This will happen more than once if you stay there long enough.

Incidentally, if someone gets mad about you running this sort of thing, you probably don't want to work there anyway. On the other hand, if you're able to build such tools without IT or similar getting "threatened" by it, then you might be somewhere that actually enjoys creating interesting and useful stuff. Treasure such places. They don't tend to last.

Feedback: lots more WPA3, and then some

It's time for me to respond to some recent feedback. As usual, this is a mix of topics and the responses are pretty much off the cuff, so strap in and hold on tight.

...

At least one person mentioned the 11 hour WPA3 problem on my Raspberry Pis and asked if I was experiencing clock drift. This is kind of funny to me since I've been picky about keeping clocks synced in my personal and professional lives these past few years. So, no, not really. All of those Pis have chrony installed, and it's doing a great job of keeping their clocks disciplined.

I was the crazy person who spent $300 of my own money to buy a GPS-to-NTP box when the unmaintained corporate infrastructure at a certain job was down to its last "proper" time server and was in danger of failing itself. It never came to that, but we got mighty close that winter. If it had fallen over, then I would have "backfed" time into production from the corporate network using a little GPS antenna puck in the window by my desk. How crazy is that?

I've gone to lengths to make things right, put it that way.

...

Also regarding the WPA3/Pi stuff, someone said "if it is truly cursed, it will be 11 hours and 360 seconds". That would be, what... 11:06:00? I don't think that joke landed with me. I was hoping it would involve "666" somehow, but (11 * 60) + 360 is 39960.

They did mention that an 11:06:40 period is a rollover from 9999999 to 10000000 jiffies with Hz set at 250. That is, if you tick at 250 Hz, after 11 hours, you'll be at 9900000 ticks, and another 100000 ticks past that is 400 seconds, hence the 06:40 part.

Of course, for that to break something, some clown would have to be expressing the time as ASCII digits, and then breaking when it got "too wide". I mean, it happens. It happened to KDE back in September 2001 when time_t went from 999999999 to 1000000000. That was a "fun" one to deal with.

...

Niels asks how I keep my posts "so level-headed". I guess that's in the eye of the beholder, to be honest. I've had situations where I've deliberately taken every bit of emotion out of my actions, and *still* had people calling it out after the fact.

After a few decades of this, I've concluded that a lot of these results were decided long before I opened my mouth or started writing. They basically have a problem with the overall concept of me saying stuff, and exactly what gets said doesn't matter a whole lot. They just use the words I choose in order to sprinkle it throughout their takedowns.

The way you can know this is happening is when the same words come from someone else who has a different *ahem* "presence" in the same space, and then nothing bad happens, or perhaps it's taken as good. That happens too, and it's really irritating since it serves to confirm to you that it may be the 2020s, but it's really just a number in terms of people being narrow-minded and generally pig-headed.

...

A reader writes that I should try wpa_supplicant 2.10 on my RPI, and, well, I'm sorry to say that I've done that and then some, and it didn't help. Check it out - wpa_supplicant as built from upstream (hostap's git repo) *does* bring the link up... for about 10 seconds. Then it throws an error and kills it. It looks like this:

Jan 28 13:32:24 rpi5b NetworkManager[847]: <info>  [1706477544.7638] device (p2p-dev-wlan0): supplicant management interface state: associating -> associated
...
Jan 28 13:32:34 rpi5b wpa_supplicant[1209]: wlan0: Authentication with <AP> timed out.

It's a consistent 10 seconds. I was feeling like torturing myself that afternoon, so I started screwing around with the code. It took a while to find where things were happening, but I finally just extended the timeout. It would hold on there a bit longer, then it would die after the new timeout.

So, finally, I just commented out the part where it complains about that and tears down the link, and you know what? It stayed up. So, I went back a little bit, and just let it bring the link up, then suspended it with ^Z and went about my day.

Three hours later, it was fine. Two hours after that, same thing. I kept checking on it into the night. Then, finally, it died early the next morning, even with wpa_supplicant *still suspended*. Here's what it looked like from the AP side:

Mon Jan 29 01:19:23 2024 daemon.info hostapd[6371]: ath1: STA <pi 5's mac address> IEEE 802.11: disassociated

And yeah, that's about 11 hours after I started the last experiment. (I should mention that having to wait 11 hours to verify things absolutely sucks.)

What I gather from this and what a few people have told me so far is that the actual association is kept running by the device itself, and whatever you have running on the host machine (iwd or wpa_supplicant) is basically along for the ride.

Also, I could swear I found a patch from someone to one of those projects that basically says "it's gonna die after 12 hours anyway, so just let it happen and restart the connection then". I can't find it while writing this, but it doesn't matter. Here's what's wrong with "just let it fail": NetworkManager has a limit for exactly how much crap it'll accept, and after the default of 3 retries, it'll just leave the link down.

Sure, you can override this, but now you're stuck with Yet Another Behavioral Patch for any of your machines which might be affected by this. Have fun with that.

Oh, and, obviously, forget about actually DOING anything over that link while it's being restarted. If you care about reliability, this is not a tenable situation.

Once again, if you care about having decent "modern" (as in 2020) wifi on a Pi, go get yourself a little USB barnacle that's supported upstream and go on with life. Or, better yet, just use hardwired Ethernet and forget that the thing even has a radio on board.

I'll pour one out for the sanity of anyone who doesn't have a choice in the matter. At least you're not suffering by yourself.

Stamping production binaries with build info

As my assortment of dumb little home-grown utility programs grew over the years, I found myself needing to know when a given binary was built. Sometimes things exist outside the realm of a packaging system, and so the binary itself needs to convey that metadata from build time.

I've seen some places solve for this by having giant placeholder strings baked into their binaries that they then reach in and "stamp" later, turning the "XXXXXXXX" or whatever into "Built by foo@foo.blah.evilcorp on ...". While that approach mostly worked, it was too spooky for me and I decided to stay away from it.

My system is something that uses a little nuance of C++ that I've mentioned a couple of times already. It's not the cleanest thing and it does involve a bit of groaning, but it works. In case anyone else wants to try it in their projects, here's how I set it up.

First, I have this buildinfo/base.h, and in it, I define a struct called Details, and it has all of the fields I care about - times, hostnames, usernames, commit hashes, that kind of thing.

There's also this:

extern std::optional<Details> details_;

Yes, that is globally visible, but it's inside a namespace so the crap factor is reduced somewhat. It's a necessary evil.

I also have a buildinfo/base.cc and it actually creates that variable:

std::optional<Details> details_;

There's also a GetBuildDetails function which will return the value of details_ if one exists, or a suitable error if not.

Now, you might be saying "it'll never have a value, so it'll always be an error", and you're mostly right. Just from what I've described so far, that in fact is the case. buildinfo/base.{cc,h} rolls up into buildinfo/base.o, and that gets linked into my programs during a normal development type build. If one of those programs calls the GetBuildDetails() function, then yes, they get the "sorry, nobody home" error response.

But, I have a way to inject the build info when I do a "production" build. This kind of build has slightly different config settings in my build system, and one of them tells it to "stamp" the binary.

The way this works is where the evil starts slipping in. On stamped builds, my build system writes out a file called buildinfo/overlay.cc. This file #includes "buildinfo/base.h" to pick up the definition of Details and the 'extern' for details_ itself. Then it rattles off a bunch of variables and their values (build time, build host, build user, ...) then it defines a class called Overlay.

Overlay's constructor has one job: it reaches into details_ and populates a bunch of fields with the values from those earlier variables.

Then the thing that actually makes it run shows up, and here's the "spooky action at a distance":

static Overlay overlay;

Just by having that line in the file, it will cause the program to create an instance of that class shortly before it reaches main() as long as it's linked into the final binary. When that constructor runs, it will populate details_, and then any code run from the rest of the program will see the build info.

This is convoluted, so I'll restate it here for clarity: it's the difference between linking "a.o", "b.o" and "c.o" into "prog", or "a.o", "b.o", "c.o" *and* "overlay.o" into "prog". If you don't link in that extra object, the sneaky stuff never happens, and it stays unpopulated. Using a std::optional wrapper saves us from the jankiness of using a bare pointer... or worse.

There are some bonuses from using the intermediate variables instead of just having a bunch of .field = "val" type things in the part where details_ gets initialized. For one, if those variables are not set to static, then they'll be visible to things like debuggers and certain other tools. Then you can do something like this:

$ gdb lavalamp_server -q
Reading symbols from lavalamp_server...
(gdb) print buildinfo::kBuildTime
$1 = 1707186971
(gdb) 

That's pretty neat, right? Analysis of a binary at rest? You can even do this without going through a debugger if you really want to.

Finally, how does the build tool handle this? It's a bit more of the special-case stuff for something that isn't just an ordinary build. If "stamping" of binaries has been requested for the active build type, then it generates a fresh overlay.cc and compiles it to overlay.o.

Then, in the link stage, if it's in "stamp mode", it inserts that object file as just another dependency of the build target as if it had been discovered by way of a '#include "some_dir/some_lib.h"' or whatever. This adds it to the list of objects passed to the linker, and a fresh binary pops out a few moments later.

I'm a fan of this technique since it only really adds two places in the build system where things go off and act a little strangely: the init sequence of the build tool when it's first built, and the link sequence of the target(s) to make sure it gets "injected".

For anyone who's worrying about "repeatable builds" or somesuch, I will point out that nothing's stopping you from having yet another build type which is otherwise as described above but which puts known placeholder data into the details_ variable. In that world, you should be able to go through the entire process and still get the same output, even on a different date, on a different box, and as another username.

I no longer wonder about which version of a binary is in "prod".

Hold on there: WPA3 connections fail after 11 hours

What a night. I hit upon something that got WPA3 working on some Raspberry Pi systems and excitedly put up a post to share the good news. Then I went away for a while, and this morning found something new: the damn things won't stay connected for more than 11 hours. All three of them failed in the same order that I changed them over.

The timeline is something like this:

01:03:51 NetworkManager [...]: new IWD device state is connected
[...]
12:04:39 iwd[...]: Received Deauthentication event, reason: 0, from_ap: false

Now that's something, considering none of the tunable aspects of the WPA3/SAE setup on my network are set to 11 hours. But, what's this? You do a search for "received deauthentication event" and "11 hours" and what do you find but a cursed Infineon developer community post on that very topic.

It's from October 2021 (!) and it's from someone who's using the same sort of chipset (Broadcom/Cypress/Infineon CYW stuff) NOT on a Pi, and they get the same "deauth" thing when on a WPA3 network. If then drop down to WPA2, it stops. The thread runs for close to a year, and then just stops cold in August 2022 with no resolution.

So, what'll it be? WPA2 mode and not being able to get onto any networks that have gone to 6E or 7, or WPA3 mode and having it fall off the network every 11 hours?

Here, I'll make a prediction: someone will say "it works most of the time, so that's fine, we won't be fixing this". Considering the level of crap people put up with in their tech these days, that'll probably be the steady-state.

My conclusion: this entire ecosystem is deeply cursed.

WPA3 on Raspberry Pi 3B+, 4B and 5B with iwd (or not...)

Okay, it's been several months since I last wrote about WPA3 on Raspberry Pi hardware. Now, I have some good news: it now mostly works, assuming you're willing to do a little tinkering. You no longer have to wrangle custom firmwares and binary blobs into place. That's been done for you.

One important thing here: I'm only talking about Raspbian/Raspberry Pi OS here, and then only bookworm (12). If you're running something else, none of this may apply. For all I know, it might have been working all along if your distribution figured it out sooner.

So then, if you have a 3B+, 4B or 5B on bookworm, get ready to rock.

Every so often I look at the list of package updates coming down the pipe through apt for my systems, and usually groan at the usual round of CVE patches. But, this time I saw something rather different. It's a change to "firmware-brcm80211", and what's this? Something specific to the CYW43455 that the 3B+ and later use? Oh, that's interesting, right?

  * brcm80211: cypress: Use a generic CYW43455 firmware
    - Version: 7.45.234 (4ca95bb CY) CRC: 212e223d Date: Thu 2021-04-15 03:06:00 PDT Ucode Ver: 1043.2161 FWID 01-996384e2

I wondered if this would stop the madness, and so applied it and rebooted. "iw phy" showed the good news - at the very bottom, it now supports both SAE_OFFLOAD and SAE_OFFLOAD_AP. This means it can actually do SAE... but you're going to have to say goodbye to wpa_supplicant in favor of iwd.

If you switched over to NetworkManager for consistency with the rest of the bookworm changes, this is not a big deal. If you're running your network some other way, you get to figure this out yourself.

The steps work out like this: apt update, then apt upgrade so you get the firmware-brcm80211 from January 17th (or later, once this post gets old, I guess). Then apt install iwd.

Assuming you've been running your Pi in a crappy non-WPA3 network, go into nmtui or equivalent and disable that network. In a minute or two, you're not going to need it any more.

Then disable wpa_supplicant and drop a bit of config into /etc/NetworkManager/conf.d/iwd.conf:

[device]
wifi.backend=iwd

Reboot ... or do equivalent wrangling of drivers and binary crap to get it to unload and reload the fresh stuff. Then run "iw phy" again. If it says SAE_OFFLOAD and SAE_OFFLOAD_AP, you're ready to proceed. If not, well, something's wrong, and you should turn on the crappy old non-WPA3 network again.

Assuming it worked, then go into nmtui or whatever and tell it to activate the actual WPA3-only network. Paste in your PSK and let it go, and a few seconds later you should be in business.

That's it. That's all it takes.

Now then, what about the 3B, you might ask. It's on a different chip that doesn't even do 5 GHz, so changing the 43455 firmware wouldn't help any. It seems to be a 43430 instead, and I have no idea if there's any chance of getting a similar firmware change for it. Obviously, if this changes, I'll post something about it, too.

I can't comment on the other models, like the Zeroes and the other weird little boards they have. I only have access to these relatively normal models, and that's what I was able to work on.

Go forth and have better networks!


January 24, 2024: This post has an update.

C++ time_point wackiness across platforms

It's a new year, so let's talk about some more time-related shenanigans. This one comes from the world of writing C++ for multiple platforms. A couple of weeks ago, I was looking at some of my code and couldn't remember why it did something goofy-looking. It's a utility function that runs stat() on a target path and returns the mtime as a std::chrono::system_clock::time_point. This is nicer than using a time_t since it has sub-second precision.

The trick is getting it out of a "struct stat" and into that time_point. The integer part is simple enough: you use from_time_t on the tv_sec field. But then you have to get the nanoseconds (tv_nsec) from that struct into your time_point. What do you do?

The "obvious" answer sounds something like this: add std::chrono::nanoseconds(foo.tv_nsec) to your time_point. It even works in a few places! It just doesn't work everywhere. On a Mac, it'll blow up with a nasty compiler error. Good luck trying to make sense of this the first time you see it:

exp/tp.cc:14:6: error: no viable overloaded '+='
  tp += std::chrono::nanoseconds(2345);
  ~~ ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__chrono/time_point.h:65:73: note: candidate function not viable: no known conversion from 'duration<[...], ratio<[...], 1000000000>>' to 'const duration<[...], ratio<[...], 1000000>>' for 1st argument
    _LIBCPP_INLINE_VISIBILITY _LIBCPP_CONSTEXPR_SINCE_CXX17 time_point& operator+=(const duration& __d) {__d_ += __d; return *this;}

Nice, right? It tells you that there's something wrong, but the chances of someone figuring that out quickly are pretty slim. For the benefit of anyone else who encounters this, it's basically this: a system_clock::time_point on that platform isn't fine enough to represent nanoseconds, and they're keeping you from throwing away precision.

To make it happy, you have to jam it through a duration_cast and just accept the lack of precision - you're basically shaving off the last three digits, so instead of something like 0.111222333 seconds, your time will appear as 0.111222 seconds. The nanoseconds are gone.

I assume you might find other platforms out there which don't support microseconds or even milliseconds, and so you'd hit the same problem with trying to "just add" them to system clock time point.

At any rate, here's a little bit of demo code to show what I'm talking about. As-is, it'll run on Linux boxes and Macs, and it'll show slightly different results.

#include <stdio.h>

#include <chrono>

int main() {
  std::chrono::system_clock::time_point tp =
      std::chrono::system_clock::from_time_t(1234567890);

  // Okay.
  tp += std::chrono::milliseconds(1);

  // No problem here so far.
  tp += std::chrono::microseconds(1);

  // But... this fails on Macs:
  // tp += std::chrono::nanoseconds(123);

  // So you adapt, and this works everywhere.  It slices off some of that
  // precision without any hint as to why or when, and it's ugly too!

  tp += std::chrono::duration_cast<std::chrono::system_clock::duration>(
      std::chrono::nanoseconds(123));

  // Something like this swaps the horizontal verbosity for vertical
  // stretchiness (and still slices off that precision).

  using std::chrono::duration_cast;
  using std::chrono::system_clock;
  using std::chrono::nanoseconds;

  tp += duration_cast<system_clock::duration>(nanoseconds(123));

  // This is what you ended up with:

  auto tse = tp.time_since_epoch();

  printf("%lld\n", (long long) duration_cast<nanoseconds>(tse).count());

  // Output meaning when split up:
  //
  //        sec        ms  us  ns
  //
  // macOS: 1234567890 001 001 000  <-- 000 = loss of precision (246 ns)
  //
  // Linux: 1234567890 001 001 246  <-- 246 = 123 + 123 (expected)
  //

  return 0;
}

To bring this full-circle, that's why I have that ugly thing in my code to handle the addition of the tv_nsec field. Without it, the code doesn't even compile on a Mac.

Stuff like this is why comments can be very important after the fact.

A year-end wrapup of responses to reader feedback

It's time for some end of the year feedback. I get a bunch of comments and questions from people through my contact page, and sometimes this is the only way to reply. Other times, a response is also suitable for a wider audience.

...

Igor asks:

Have you played around with cling? Seems you may have the knowledge to break it or request the developers to enhance it with some useful feature(s). Like suppose doing scripting in it?

I didn't even know what cling was until I read this and went looking. It seems to be a clang/LLVM-based C++ interpreter that's interactive. I guess I don't have a need for something like that. As far as deliberately breaking things, I have no doubt that my luck would lead to any number of bad outcomes. However, I have enough of that in my life already without seeking out new stuff just for the sake of having something to break.

Now, if there was a good reason for me to do it, that would be another story. (I am available for mercenary purposes, put it that way.)

...

An anonymous reader asks:

What's the specs of this new web server?

It's only new to me. It's actually fairly old. I think it's roughly 2014 vintage. I bought it from one of those vendors who resell old servers. I figured "it's a server, not a toothbrush" and so the notion of using some random old box was not a problem for me. It runs Linux and is plenty quick, so I'm happy enough.

I guess this is a good point to tell the story of actually installing this thing. Despite working at places with millions of Linux boxes over the years, I'd never hung a server. I'd done routers and dialup boxes, but those were all relatively short. They'd hang from the front posts and were plenty happy with life that way.

flicker, on the other hand, is a monster. It's so long the cabinet doors almost didn't close. I had no idea that length was a dimension that might be a problem. I clearly didn't realize just what I was getting when that order was placed, and when it showed up in a giant box, I started thinking "what have I done?"...

Initially, the machine barely fit into the cabinet with the power cord being squished on one end and the network cables being squished on the other. I came back about a month later and swapped it to a new power cord that has a 90 degree bend built into the server-side plug. This clawed back about an inch on that side and let the whole thing slide back a little bit which took the pressure off the Ethernet cables.

I took the now-permanently-crimped (and damaged) power cable, cut it in half so nobody would find it and try to use it, and tossed it.

It was the sort of thing that anyone with experience would point at and go "ha, you screwed up", and indeed, I did. It turned out to not be fatal to the project of moving to colocation, but it was mighty close.

Next time, I'll pay attention to this sort of thing.

Lesson learned: go see the cabinet and measure it before ordering something that's almost three feet long!

(It's an Ivy Bridge flavored dual-socket hex core Xeon with 128 GB of memory and a couple of SSDs. It also has a far faster pipe than the previous box. It's a complete monster for how I use it. Oh, and it's not in Texas, which has turned out to be very important.)

...

Regarding my "char buf[1048576]" thing from the other day where I blew the stack, a reader says:

You can use boost::thread instead. It lets you specify the stack size.

Honestly though, I don't *want* to specify the stack size. I'm fine with defaults. Also, any answer which involves "boost" means I'm asking the wrong question. I don't ever want to get stuck with something like that. There's been too much drama in my life from boost dependencies in years gone by.

The obvious way around the "problem" for me is to do something like a vector<char>, push it out to whatever size, and then let read() use it as a buffer. This is pretty much what I end up doing any time I have to deal with old-school C library functions from my C++ code.

Like, okay, mkdtemp(). It wants a char* template, like "/tmp/test_thingy.XXXXXXXX". It *reaches into* that space and alters it, changing the Xs to some random gunk that turned out to be unique - this is how you avoid /tmp race attacks. This is more annoying than it sounds.

You might think "oh I know, I'll use a string, then hand the .c_str() pointer from it to mkdtemp". But, no, c_str() gives you a *const* pointer, and mkdtemp is going to violate that const-ness. Your compiler will stop you from doing this. You then must either lie to it and do some nasty casting, or you decide that's not going to work and find another way.

This is why I have a bunch of stupid shims to deal with these scenarios. It's just that I never really shimmed read() before - the old hacky way of having a buffer on the stack was never a problem.

As for "why is that buffer so big", well, I like to cut down on the number of syscalls required to inhale a file: a bigger buffer eats the whole thing in a single go. A tiny buffer would need to keep coming back until it finished. This sort of thing used to matter more when machines were much slower.

Besides, I hate spammy syscall situations. read() or write() with a single byte? Argh!

...

From another reader:

hey are you still running wg barebones on your mac? I just made the mistake of installing and got the 'cant uninstall' behavior with additional 'nothing happens when you dbl click on app icon' . just wanted to compare notes

Yep, I still am doing this on my personal machine and also those of family members (in my duties as "holiday sysadmin"), having removed the "app store" version a while back. I get feedback from random people who find that three year old post about once a quarter, and they all seem to have the same problems. I take that as a pretty strong signal to not try it again just yet.

I should note that if you're using Macports, upgrade to Sonoma, and then try to build wireguard-tools or several other packages, it'll fail due to various Apple-related stupidity. I'll do another post about that eventually.

...

Finally, a bunch of people wrote in about the "clang + C++ + original armv6 Raspberry Pi = unusable binaries" to say that it explained some oddities they had been seeing on their own stuff, so I think that turned out okay. That's really what I aim for in writing these posts: visibility into problems, and confirmation that you aren't alone in experiencing something. Sometimes it's technical, but other times it's about squishy meatspace stuff. Both are valid.

Smashing the stack for pain and misery

I need to remind people how easy it is to forget just one of the many gotchas of working on this ridiculous computer stuff. One missed nugget of data at a critical moment can leave you scratching your head and going "WTF" for longer than would otherwise seem reasonable.

Here's something that happened to me last week. I was working on a stupid little utility that runs on my machines and lets me keep tabs on what's going on with systemd. If it gets unhappy because any of the services have stopped running, then this thing will let me know about it. For the handful of systems I have to worry about, it gets the job done.

Now, since I'm in "holiday mode", I'm largely working on my laptop instead of sshing back to a Linux box somewhere else. This laptop is a Mac, so it's mostly compatible with what I'm doing. Obviously, it doesn't run systemd, but that wouldn't stop me from tidying up a tool in test mode. I was working on this thing, and noticed it started blowing up in strange places. Also, it was a really strange "bus error". To me, that says "binaries on NFS" or "unaligned access on some architectures". I'm not doing either sort of thing here.

gdb was not really an option at that moment for various annoying reasons so I resorted to "debug via printf" - putting little notes to say "I got here" and whatnot. They kept changing. I'd think I had it nailed down, and it would move!

Eventually, I got it down to something truly odd: it was blowing up in a worker thread, and it was the point where that thread started up and read in a config file from the disk. The line of code looked something like this, where I call into one of my own helper libraries:

auto raw = file::ReadFileToString(kDefaultConfigPath);

Okay, I said to myself, let's find out what's going on in that function and started sprinkling my "I got here" notes into there. One of those notes was at the very top of that function and just said "got into ReadFileToString". It never ran.

I removed the call to that function. It stopped crashing.

So, what's in that function that's so spooky? Well, it opens a file descriptor, does the usual sanity checks on it, and then creates a buffer that it'll pass to read()... and herein lies the problem:

  char buf[1048576];

Yep, just having that there was blowing the stack, and the bus error is how it manifested in that particular arrangement of function calls within the worker thread.

That's right, if you're already pressed for stack space and then enter a function with something like that, you might just explode. Here's a contrived example with an even bigger buffer to demonstrate it with just a single innocent-seeming function call:

mac$ cat bs.cc 
#include <stdio.h>

#include <memory>
#include <thread>

static void do_thing() {
  char buf[1048576 * 8];
  buf[0] = '\0';
}

int main(int argc, char** argv) {
  if (argc != 1) {
    printf("running in worker thread\n");

    auto worker = std::make_unique<std::thread>(&do_thing);
    worker->join();
    return 0;
  }

  printf("running in main\n");
  do_thing();
  return 0;
}

The fun part is that on a Mac, the flavor of error changes between "bus error" and "segmentation fault" just by shoveling it into a thread.

mac$ ./bs 
running in main
zsh: segmentation fault  ./bs
mac$ ./bs foo
running in worker thread
zsh: bus error  ./bs foo
mac$ 

Nice, right? Further complicating matters is that on a boring old x86_64 Linux box, it gets reported as a segmentation fault both ways.

linux$ ./bs
running in main
Segmentation fault
linux$ ./bs foo
running in worker thread
Segmentation fault

A simple twiddling of the ulimits will change the behavior ever so slightly:

linux$ ulimit -s unlimited
linux$ ./bs
running in main
linux$ ./bs foo
running in worker thread
Segmentation fault
linux$ 

Fun fun fun. Obviously, I need to rethink the way I manage my buffers.

Patching around a C++ crash with a little bit of Lua

Seattle Mariners mug with a note inside: Celebrating the pursuit of EXCELLENCE in the face of repeated disappointment

Sometimes, inside a company, you find someone who's just so good at what they do and who has the fire burning inside them to always do the right thing. It's easy to fall into a spot where you just go to them first instead of chasing down the actual person holding the pager that week. At some point you have to tell yourself to at least *try* "going through channels" to give them a break.

But still, you wind up with some tremendous stories from when they came through and saved the day. I love collecting these stories and I periodically share them here. This is another one of those times.

Back up about a decade. There was something new happening where people were starting to get serious about compressing their HTTP responses to cut down on bandwidth and latency both: fewer packets = fewer ACKs = less waiting in general. You get the idea.

A new version of the app for one particular flavor of mobile device had just been built which could handle this particular flavor of compression. It was going out to alpha and then beta testers, so it wasn't full-scale yet. When it made a request, it included a HTTP header that said "hey, web server, I can handle the new stuff, so please use it when you talk back to me".

On my side of the world, we didn't know this right away. All we knew was that our web servers had started dying. It was one here, then one there, then some more over in this other spot, and a few more back in the first place, and it was slowly creeping up. This wasn't great.

We eventually figured out that it was crashing in this new compression code. It had been added to the web server's binary code at some point before, and it obviously had a problem, but I don't think we had a good way to turn it off from our side. So, every time one of these new clients showed up with a request, their header switched on the new code for that response, and when it ran, the whole thing blew up.

When the web server hit the bad code, it not only killed the request from the alpha/beta app, but it also took down every other one that same machine was serving at that moment. Given that these systems could easily be doing dozens of requests simultaneously, so this was no small thing! Lots of people started noticing.

That's when one of those amazing people I mentioned earlier stepped in. He knew how to wrangle the proxies which sat between the outside world and our web servers. It had a scripting language which could be used to apply certain transforms to the data passing through it without going through a whole recompile & redeploy process for the actual proxies.

What he did was quick and decisive: it was a rule to drop the "turn on the new compression" header on incoming HTTP requests. With those stripped from the request, the web server wouldn't go down the branch into the new (bad) code, and wouldn't explode. We stopped losing web servers, and we were now in a situation where the pressure was off and we could work on the actual crash problem.

I should mention that we were unable to just switch off the new feature in the clients. The way that clients found out what features to run in the first place was by talking to the web servers. They'd get an updated list of what to enable or disable, and would proceed that way. But, if the web server crashed every time they talked to it, they would never get an update.

That's why this little hack was so effective. It broke the cycle and let us regain control of the situation. Otherwise, as the app shipped out to more and more people, we would have had a very bad day as every query killed the web servers.

And yes, we do refer to such anomalies as a "query of death". They tend to be insidious, such that when they show up, they take down a whole multitenant node and all of the other requests too. Then they inevitably get retried, find another node and nuke that one too. Pretty soon, you have no servers left.

To those who were there even when they weren't on call, thank you.

clang now makes binaries an original Pi B+ can't run

I have a bunch of Raspberry Pi systems all over the place, goofy things that they are. They do dumb and annoying jobs in strange locations. I even have one of the older models, which is called just the B+. You can think of it as the "1B+" but apparently it was never officially branded the 1.

If you have one of these, or perhaps an original Pi Zero hanging around, you might find that C++ programs built with clang don't work any more. I ran into this as soon as I started trying to take binaries from my "build host" (a much faster Pi 4B) to run them on this original beast. It throws an illegal instruction.

This used to work in the old version (bullseye). It now breaks in the current one (bookworm). I figured, okay, maybe it's doing some optimization because it was built on the 4B. So, I went and did a build on the B+ natively. It also broke.

So I backed off another level to a much simpler reproduction case: just declare main() and return. That still broke.

Looking this up, there are a bunch of screwy dead-end forum posts where people go back and forth asserting this package is installed and that's making the compiler go stupid, or it's because they did the "lite" install vs. the "recommended" install, or who knows what.

I wanted to do better than that, so this afternoon I picked up a brand new SD card, blew the whole "desktop + recommended" OS image onto it, booted *that*, then installed clang, and...

raspberrypi:~/prog$ cat t.cc
#include <stdio.h>

int main() {
  return 0;
}
raspberrypi:~/prog$ clang++ -Wall -o t t.cc
raspberrypi:~/prog$ ./t
Illegal instruction

Awesome. It can compile something it can't even run. What's the bad instruction? gdb will answer that in a jiffy.

(gdb) disassemble
Dump of assembler code for function main:
   0x004005a4 <+0>:	sub	sp, sp, #4
=> 0x004005a8 <+4>:	movw	r0, #0

movw. That's not in armv6l, apparently. So yeah, this compiler is effectively cross-compiling for armv7 (or something) by default. That's not very useful.

You can work around this by grabbing the compiler by the lapels and saying "build for armv6, punk", and it will give you a working binary:

raspberrypi:~/prog$ clang++ --target=armv6-unknown-linux-gnueabihf -Wall -o t t.cc
raspberrypi:~/prog$ ./t
raspberrypi:~/prog$ 

How and why did it get to that point? I can only imagine it's some default that got bumped from version 11 to version 12, and somehow nobody noticed? I guess nobody still runs these old things anywhere?

So weird.

That time Verisign typo-squatted all of .com and .net

A little over 20 years ago, Verisign did something mighty evil: they effectively typosquatted every single unregistered domain in the .com and .net top-level domains. They could do this because they controlled those from the registry side of things, and it was trivial to slam something that would make it resolve.

Reactions from people like me who had systems to run and spam to block were swift and universally negative due to all of the collateral damage it caused. Here's one situation it created: sometimes you'd have a user who thought they were being clever, and they'd put something like "nospam" in the from address in their e-mail client. Thus, they'd try to send mail as luser@nospam-example.com instead of just @example.com.

Before Verisign pulled that crap, the mail servers of the day would have just rejected it as an invalid domain that didn't resolve (in DNS). Once it was online, it WOULD resolve, and so the mail system would accept it. Utter chaos.

Or, how about all of the random people who would mistype something? Instead of getting a failure to resolve error from their browser and/or organization's web proxy, it would just pop out to that stupid Verisign site.

It should surprise nobody that a bunch of us sour sysadmin types did things about it rather quickly over the next day or so. I still have the snippet of cruft to drop into a sendmail.cf in my notes from back then. I got this from Usenet, apparently:

Local_check_mail
R$*                     $: $>canonify $1
R$*<@$*.>               $: $1<@$2> strip the trailing . if present
R$*<@$+>                $: $(verisign $2 $)
R64.94.110.11           $#error $: "550 Real domain name required for sender address"

Incidentally, if any of you can still parse that syntax in your heads just by reading it, I'm so sorry. Take it from me - that ability does go away eventually, but you have to stop supplying it with new data.

Anyway, what that did was to exploit the limitless programmability of sendmail's config language to resolve the supplied domain. If it came back to 64.94.110.11, it would reject it right then and there. That was the IP address they were using for their little marketdroid-fueled fever dream, and I bet there are *still* systems blocking it 20 years later (and they probably have no idea why).

Over the next couple of days, a bunch of other things happened. ISC wrote a "delegation-only" feature for BIND (aka named, the occasional remote sudo implementation). You could use this to say that the "com." or "net." zones were only allowed to provide delegation to other zones. That is, within the bailiwick of "com.", it could say that "example.com." has a nameserver at <foo> with IP address <bar>, but that's it. It couldn't come right out and say that something had an A record outright.

Now, this worked great, but it wouldn't have been terribly difficult for them to sidestep it. They could have had it delegate all of those things down to some other level which then would have had a blanket answer for any incoming questions. Fortunately, this did not happen.

This whole bout of stupidity lasted about three weeks, and then it disappeared. They've never done it again, but plenty of other providers of recursive resolver action have pulled it since then as a matter of course. Screw up the way DNS works in order to fellate the advertisers? Brilliant!

Remember this? That was 2019.

I find it strangely fulfilling that this event has garnered a Wikipedia article in the years since. Hopefully people will never forget what happened back in 2003 with the DNS.

My rants about TP-Link Omada networking products

Perhaps you've been running Ubiquiti stuff for a while, and you've been disappointed by their stock issues, their goofy software issues, and the general lack of quality all the way around. Maybe you turned your eyes to the TP-Link Omada ecosystem. I'm here to warn you that the grass is not greener on that side of the fence. It may in fact be spray-painted.

First, some context. I'm the family sysadmin - not by choice, but because nobody else would do it. When I visit family, I have to fix their stuff. There are some gearhead types and I do my best to make them happy. Various ISPs are starting to sell services that are well above 1 Gbps. This is typically symmetric fiber stuff.

That's the situation with one of the sites I support, and their existing Ubiquiti stuff from years gone by became a bottleneck once they had that installed. Obviously, they want to get this greater-than-gig performance wherever possible. That means a derpy Windoze box or two, and that brought on a whole hellscape of dealing with resource conflicts the likes of which I hadn't seen in 20 years.

But no, this isn't about that. This is about TP-Link. I was pointed at this ecosystem as a possible escape from the clowntown that is Ubiquiti, so that's what I bought this time around: one of their gateway boxes (calling it a router would be too kind), a switch, and a hardware controller for local control - none of that cloud crap here, thanks.

It's been a new bit of stupid every week with this stuff. First of all, the switch is really best suited for a closet at a business, not anywhere in someone's home. It has dinky little fans that run pretty hard all the time, with all the noise that entails. People who replace them invariably get fan errors and then the thing eats itself within a year. (Maybe the switches fail by themselves either way - the jury is still out on that.)

The latest build of their controller software flat out does not work on Safari. I mean, sure, it loads up, and then the browser starts doing something indicative of a horrible big-O blowup factor somewhere in their Javascript. It'll hang for a minute at a time any time you move the pointer around. Or, it'll prompt you to download a file CONSTANTLY. Like, WTF kind of content-type brain damage are you doing? It doesn't happen in Firefox or Chrome, apparently, but it still goes to show that they gave zero fucks about even TRYING TO LOGIN in Safari when they were developing it. You know, the browser that every Mac and iOS device ships with?

So, you have to roll back the controller to get out of this mess. Doing that wipes your config. Fortunately for me, I discovered this during my early shakedown testing at my own residence before hauling it out to the site, and there was no actual config to lose.

Next up, their NAT implementation is just plain obnoxious. Typically with this kind of stuff, if you fire a packet from the inside to the outside, the source address gets changed from RFC 1918 or whatever you're using internally to whatever you have on the outside. That much works. What also happens here on the TP-Link ecosystem is that they mangle your source port, too. This affects both UDP and TCP.

Why does this matter? It makes NAT hole-chopping tricks much harder to pull off. Normally, you can do such fun things as configuring WireGuard to punch through from either side by lining up the ports exactly. This will let two sites connect to each other without going through a third fixed spot. This is very handy if that third spot goes down and you need an OMFG backdoor into your networks!

This does not work if the source ports change. At that point, you have to resort to all kinds of nasty birthday paradox type stuff to figure it out, and that requires Actual Work to pull it off and keep it working. Me, I don't want to put TailScale everywhere. But I digress.

Last week, something very bad happened that I haven't managed to troubleshoot since I'm remote and can only do limited things from here. HomeKit stuff stopped working. By that, I mean that viewing the home from off the local wifi said the usual "no hubs online" thing. But, stranger still, HomeKit *clients* on that wifi also couldn't connect *outward* to other spots! They, too, got the same thing about no hubs about other HomeKit locations... even when those locations were actually fine and worked for other people.

The only commonality was crossing that Omada-powered network. I had some luck in this case since there's a Mac out there which I can hop into and beat into submission, and beat I did. I figured maybe it was something goofy about the routing to Apple's cloud stuff, and started shunting all of the traffic through a tunnel. Nothing helped ... until I also switched DNS resolution on that Mac to something I controlled instead of using whatever resolver is inside the TP-Link gateway box.

Once I did that, it started working again. Even after I turned off the tunneling, it kept going. This was enough for me. I stood up unbound on a couple of Raspberry Pis out there and changed the DHCP config to make sure clients would resolve things through them instead of the ER8411 gateway. It took a while, but eventually, everything stopped being stupid.

Now, big caveat here: I don't know 100% that it was the resolver in the thing. I wasn't on site, and could only do so much without kicking myself out of the network, since my access came in through those very devices. Also, my troubleshooting abilities are limited with this crap for yet another reason I'll get to later.

Then there's what happened this morning. One of my Pis behind this setup decided it wasn't going to run one of its WireGuard links. The other link on the same interface (going to another external host) was fine. The other link on the other interface was fine. The other Pi's two links were also fine.

It was just this one particular association that wasn't working. So, into tcpdump I went yet again, looking at it from both sides of the link. The exchange I saw from inside looked like this over and over:

their_internal_ip.AAAAA -> my_external_box.BBBBB: udp data
(no reply)

But, from the outside world, it looked like this:

their_external_ip.CCCCC -> my_external_box.BBBBB: udp data
my_external_box.BBBBB -> their_external_ip.CCCCC: udp data
their_external_ip -> my_external_box: ICMP port CCCCC unreachable

So yeah, even though it had JUST sent traffic to me from that port, upon reply, the gateway box was rejecting it. This to me says "really terrible IP connection tracking setting and/or implementation that dropped the association and is somehow not picking it back up".

This WG link has a keepalive on both ends. There's no excuse for this. It should be established in the firewall as soon as a packet goes out, as one did above. But the ICMP error indicates otherwise.

Note that the port-unreachable error is not coming from the Pi itself. The Pi was only sending actual traffic and had no idea why it wasn't getting any responses. WG won't switch source ports by itself, so it just keeps smacking its head into the wall ... over and over and over.

And this brings me to the final point of frustration: I wanted to ssh into the damn gateway to to see what they were doing to screw things up so badly. It took a while to find the knob to enable ssh, and once that was on, I found the ultimate insult: it's a completely neutered interface. You can't do anything useful. It's busybox, the "ip" utility, and something that apparently lets you point it at a controller for when the adoption process doesn't work.

su? sudo? Forget about it. You don't even have /proc in there - so no ps, no w. You can't run dmesg because it doesn't exist (and they probably lock down the kernel ring buffer anyway). You are a luser, and you will never be able to do anything useful from this setup.

When pressed, tech support tells people that such things are unsupported when using the controller - that is, the dumb pointy clicky web-based UI that takes the setup and pushes it out to the devices. You know, the one that broke on Safari in the latest version. They're locking you out _on purpose_.

Finally, I haven't run into this one yet since the ISP for this site is still in the dark ages in terms of providing access to the ENTIRE Internet, but it sounds like they don't do any sort of IPv6 firewalling. So, if your ISP switches that on and you suddenly get an allocation, look out world! It's the wild west on your network!

So, let's recap the suckiness here.

0. The switch is stupidly noisy.

1. Their latest version of the controller just does not work in Safari.

2. You can't easily roll back the controller when it does suck. You'd better save the config from the old version before you upgrade, just in case you ever have to go back. And, if you never ran that particular old version, you're doubly screwed.

3. Their NAT implementation mangles source ports needlessly. Sure, some scenarios call for it. They do it constantly.

4. *Something* broke HomeKit comms really badly, and switching recursive DNS services for clients away from whatever the gateway box provides fixed it. It's probably some terrible DNS forwarder implementation but I have no way to be sure at this point.

5. The NAT apparently dropped an assoc this morning and never put it back. I couldn't get my tunnel going until I restarted it to pick a new source port on the client. Completely ridiculous.

6. Forget about ssh to troubleshoot things. The hood is welded shut. You will never know what's really going on when one of the other items decides to rear up and bite you in a sensitive place.

7. They apparently have no IPv6 firewalling based on what other people have reported in various places. (This is the only one I haven't actually encountered myself... yet.)

So now what? I'm honestly looking at returning to what I was doing in the 90s: building my own Linux boxes with enough horsepower to handle the networks in question. It worked then and it'll work again now. Things will still break, but at least I'll be able to use my actual experience to do something useful about it. Right now, I can do nothing. My hands are tied.

Why did I think these clowns had any idea what to do? I've been both inside and outside of this world, and it's pretty clear that they do not. Just look at how awful these products really are.

Asahi Linux folks are doing us a solid with WPA3 fixes

Thanks to a bit of anonymous feedback this morning, I have some good news about the Raspberry Pi WPA3 thing. Apparently the good folks over at the Asahi Linux project have taken up the cause of fixing the upstream kernel situation. It seems this will happen by throwing out some existing implementation that didn't work anyway, and perhaps my posts were confirmation that they were in fact crap.

This is great! Hopefully this will lower the barriers for regular people who just want things to work and don't want to patch kernels and drivers and maintain their own forks of things.

So, yay for that person cleaning up the mess of some dumb big companies. Thanks for pushing on this and not letting the suck stop you.

...

For the record, I've tried this on a 3B, 3B+, 4B, and now a 5B. Did I buy the 5 expecting the wifi to suck? I sure did. Did I expect to write a post bagging on it? You know it. I did my research to see if anyone else had mentioned it, and when that came up empty, I hit the store and picked one up, then came home, tested it, and wrote the post.

While waiting for this fix to come down the pipe to be usable on your systems, there are any number of alternatives. Unfortunately, they all amount to a barnacle that consumes one of your USB ports, but they do work. They tend to actually behave better with tools like Kismet, they do WPA3, and they don't make the kernel panic when you look at them funny!

If this is you, hit the USB-Wifi main menu and start digging around. Note in particular the "plug and play" list. I grabbed some weird $40 Alfa thing I had never heard of before based on a recommendation from the list. It worked great for sniffing things and generally screwing around. It also ran WPA3 as a client just fine.

This is the Linux experience I remember from the 90s: poring over compatibility lists and making sure you buy the right thing every time. That's why it's so vexing that the Pi people would keep shipping this thing in this state. You don't want your customers to keep buying these wifi barnacles, do you?


January 24, 2024: This post has an update.

Still no love for WPA3 on the Raspberry Pi 5

About a year ago, I wrote about trying to make WPA3 (wireless security) work on a Raspberry Pi 4, among other things. It didn't work then and it still doesn't work now.

In the past couple of weeks, they released the Pi 5, and it's been making the rounds through the usual people, but somehow, nobody's talking about whether it'll do WPA3 or not. So, I'll break the silence and save everyone a lot of work: it still has the same CYW43455 wifi+bluetooth chip as the Pi 4, so it has the same limitations: no WPA3 support, at least, not right now. Maybe some day, someone will do something about the driver situation, but given that nothing has changed in almost a year since my last post, I'm not going to hold my breath.

This Broadcom / Infineon / Cypress wifi situation has other ramifications beyond just the Raspberry Pi ecosystem. Let's say you are like me and you bought one of the "early 2015" Macbook Pros in 2017 since the then-current ones had the terrible new keyboard and still hadn't figured out the whole USB-C thing yet. You didn't get Ventura (13) and are stuck on Monterey (12), never mind Sonoma (14).

So, maybe you thought "I know, I'll install Linux on this thing since it still has some life left in it". Once you do that, you will discover that you are in the same crappy wifi situation. You won't be able to join a WPA3 network in Linux, either. The hardware is clearly able to support it since it worked as a macOS install, but it's just not going to happen on Linux. It's the same sort of software problem.

None of this is news, but the updates about this mess seem to not get enough visibility. Note: "seriously unmaintained and years behind on features and firmware integration". Wonderful.

This puts the Raspberry Pi squarely in the same bucket as "random Internet of Shit devices" when it comes to wifi compatibility. You are going to have to keep a terrible auxiliary wireless network around for a very long time in order to support them as-is.

Going forward, you can just ignore the built-in hardware and buy a dumb little USB wifi adapter which is actually supported by regular kernels. Then you just plug it in, configure things, and go on with your life. (Or, you know, pull actual cabling to it, and live life with the stability of hardwired Ethernet.)

It's kind of amazing that this situation persists, given how the Pi is supposedly intended for entry-level people who need a low-cost platform to learn about stuff. What happens when they encounter a WPA3-only network? They do exist, and they will only become more numerous over time.


November 7, 2023: This post has an update.

getaddrinfo() on glibc calls getenv(), oh boy

There are more than a few bear traps in the larger Unix environment that continue to catch people off-guard. One of the perennial favorites is thread safety, particularly as it applies to the environment manipulation functions under glibc. The usual warning is that if you run multiple threads, you'd best not call setenv, because if someone else calls getenv, you run a decent chance of segfaulting.

The last time I talked about this was 2017 and I said something goofy like "see you in 2022" at the bottom. Well, it's a little late, but it's come up again, and this time it's biting people who aren't even using C or C++!

This one is in Go!

Last time, I said that sometimes people run afoul of this by using mktime() which calls getenv()... and then sometimes pick *that* up by using libzip. Here, it's a little different. getaddrinfo() calls getenv(). Did you know that? Before a few minutes ago, I sure didn't! Check out the resolv.conf (5) man page - it looks at LOCALDOMAIN and RES_OPTIONS.

getaddrinfo() is pretty much required if you want to connect to anything beyond mere legacy IPv4 gunk, so it's not like you can avoid it. You're probably going to call it quite often if you are opening connections over the IPv6 Internet.

If you're on Linux, and you're using glibc, you're probably a passenger on this boat. Try not to drill any more holes.

Thanks to Evan for the tip on this one (and good luck with the fix).

Administrivia: new page/feed generator is now live

Back in March, I wrote about how the HTML generation worked for all of these posts. In short, it was mostly "loose approximations of HTML via printf", and it was terrible. It generally worked for the past 12 years, but I knew how wrong it was.

One thing I didn't mention in that post was just how bad the feed generation had become. I was doing a CDATA thing and was just spewing out the HTML inside of that. In theory, if I had put a "]]>" in a post, it probably would have broken the entire feed.

Then there are the finer points of the metadata for the feed. It was using an id of "tag:rachelbythebay.com,writing-2011" which I could have sworn was fine at the time I picked it, but which turns out to actually be illegal for that scheme. I fixed that for both the feed and the entries themselves, so if you see duplicates of the last 100 posts, or the entire feed shows up somewhere else, that might be why.

Those and many more things caused the w3c Atom validator to scream *quite* loudly about it being broken. A lot of people sent me feedback about this over the past few months.

That's all gone. A few minutes ago, I threw the last switch to finally cut over the entirety of the /w/ files to the new stuff. This meant that every single index.html has been regenerated. Quite a few corrections have been applied at the same time. It took me a very long time to go through all of these posts and convert my raw HTML shenanigans into meaningful commands that will be parsed by the generator.

Every view of things now actually makes sense. I can now write <foo> as an example in a post and it will come out escaped properly on the output side. I can put an & in the post without having to literally type in &amp. Yes, I'd been having to manually do &lt; and &gt; and all of this... if I remembered. If not, well, there'd be a "live" tag hanging out in the post!

Or, there'd be a broken tag. Last week's post about ASCII protocol buffers and config files actually had a "<pre" without a ">" in it, and then it just went into the contents. I bet you didn't see "syntax = proto2" in that thing as a result, but trust me, it was there!

There are a bunch of other little stylistic changes in here. The feed icon in the banner of both the top index and the individual posts no longer links to the feed itself, but rather a page that explains what to do next. This is a cheesy way to sidestep the "someone clicked on the feed in their browser and used up their unconditional request" thing for a little bit.

Anyone who was using a phone or other smaller device probably noticed that if you loaded a sufficiently old post, it wouldn't quite fit the screen properly whereas newer ones would. This is because I had added a little "viewport" magic in the headers at some point, but had only applied it to the template file (!), and never rebuilt all of the old posts. This also meant the post footers were all slightly different, depending on when it was last rebuilt. Now all of the posts have that viewport thing.

I also adjusted the line-height based on some feedback from at least one reader who commented that it would make it easier to read. I find it hard to argue with that, and didn't see anything bad about it, so that's in there too.

You should also notice that preformatted blocks now sport a different background color and a border to set them apart from the rest of the post.

There are probably some other things I've forgotten, too.

I'm sure there are going to be some anomalies, so if you see something that seems broken to you, go ahead and fire off some feedback. I do appreciate it.

ASCII protocol buffers as config files

While I don't go on the Orange Site any more, I still make enough trips through the larger space of similar sites to get some idea of what people are talking about. Last week, the topic of interest seemed to be YAML and how evil it is. I can't argue with that. Every time I've crossed paths with it, I've been irritated by both it and whoever decided to use it for their stuff.

The discussions invariably start talking about alternatives, and frequently end up on JSON. This is unfortunate.

I've mentioned this before in passing, but have never given it a whole post. Today, it graduates to having a whole post about the topic.

The topic is: ASCII-form protocol buffers used as config files.

This was a tip given to me something like 17 years ago when I was "on the inside", and it's turned out very well. Protocol buffers have a canonical ASCII representation, and it accepts comments, too! You get the benefits of not having to write a scanner or lexer combined with a system in which everything is explicitly specified, right down to the data types.

Here's an example of such a file:

# **** Contains auth data: must be thermo:thermo 0660 or better ****

db_conninfo: "host=localhost dbname=foo user=xyz_role password= ...

# barn (hardwired)
server_info {
  host: "172.25.161.10"
  port: "18099"
}

# barn (backup wireless on IoS network)
server_info {
  host: "172.25.225.10"
  port: "18099"
}

# office (broken 20230825)
# server_info {
#   host: "172.25.161.17"
#   port: "18099"
# }

sensor_location {
  name: "loft"
  model: "Acurite-Tower"
  id: "1563"
  channel: "A"
}

sensor_location {
  name: "entry"
  model: "Acurite-Tower"
  id: "2375"
  channel: "B"
}

There. That's not terrible, right? It has a bunch of common stuff that gets repeated as needed for my different servers and sensors. There's also a string that gets handed to Postgres to connect to the database. And yes, notice the comments everywhere.

Over in protobuf-land, this is what the .proto file looks like for that config format:

syntax = "proto2";

package thermo;

message LoggerConfig {
  message ServerInfo {
    required string host = 1;         // 192.168.31.67
    required string port = 2;         // 18099
  }

  message SensorLocation {
    required string name = 1;             // room
    required string model = 2;            // Acurite-Tower
    required string id = 3;               // 1015
    required string channel = 4;          // C
  }

  required string db_conninfo = 1;

  repeated ServerInfo server_info = 2;
  repeated SensorLocation sensor_location = 3;

There's one important bit here: I'm using "required" here since this is a config file format and NOT something that will be passed around over the network. It lets me cheat on the field presence checks, and this is the one case where it's acceptable to me.

If you're using protobuf for anything that gets handed around to something else (RPC, files that get written by the program, ...), whether across space *or time* (i.e., future instances of yourself), use optional and explicitly test for the presence of fields you need in your own code. You have been warned.

How does the program use it? First, it reads the entire config file into a single string. Then it creates a LoggerConfig (the outermost message) and tells the TextFormat flavor of protobuf to ParseFromString into that new message. If that returns true, then we're in business.

I can now do things like hand config.db_conninfo() to Postgres or iterate over config.server_info() or config.sensor_location() to figure out who to talk to and what sensors to care about.

Is it perfect? Definitely not. It's software, which means it will never truly stop sucking, like all other software. It's a dependency that will now follow you, your code, and your binaries around like an albatross. It's yet another shared library that has to be installed wherever you want to run.

But, hey, if you're already paying the price of using protobuf in your projects for some other reason, then why not use it for config storage, too?

I screwed something up in that last post about Hue

In short: I should have looked at my notes instead of relying purely on my memory of a random event from four years ago.

Right, so, the other day I wrote a post eviscerating the Philips (Signify) Hue situation, in which they are heading full steam into enshittification. I said that I didn't want to use Home Assistant becuase of Javascript and a "curl | sh" attitude.

Yeah, that's where I screwed up. That's not them. That's actually Homebridge, aka homebridge.io.

People have been commenting that HA is Python, not JS, or something like that. I had to go back to my notes from January 2019 to set this straight, and, well, they go like this:

I was trying to find a way to run some existing "real" security cameras without going full-on cloud mayhem. The users in question wanted to see it in the HomeKit ecosystem, so I went looking for solutions of that sort. The idea was to get this thing running, document the protocol, and then figure out if I could do it some other way.

So I end up on their wiki for installing this on a Raspberry Pi. At that point, I had one just sitting around collecting dust, and figured "what's the harm in trying". First thing up, they wanted me to "curl some-url | sudo bash".

Hell no, I'm not doing that. There are so many things wrong with that philosophy. The whole point of cutting actual releases is that you get people to cluster around a handful of known "artifacts" (you know, tarballs and the like), and then you can work up some kind of reputation based on that. If that actual release ends up in some distribution like Debian, you can be sure that exact version is being seen by a fair number of people.

curl | sh basically says "I don't give a damn" and "give me whatever you want" at the same time.

I figured I could at least grab the script, read it, and parse it myself, then run the commands by hand. This meant I had to add their apt repository. Ugh. First up? "apt-get install -y nodejs" ... oh boy.

But wait, no, then it went on from there.

I tried "npm install -g homebridge" but that wasn't happening. It wouldn't go until I added "--unsafe-perm". Oh, gee, that's not sketchy *at all*.

At this point I was glad this was all happening on what was effectively a throwaway machine. It did get installed, and started up, then displayed a QR code, and once past a few warnings, it did in fact show up in Homekit.

Of course, without plugins, it wouldn't do anything, so that meant going back for MORE node stuff. I found 72 pages of them on npmjs.com.

I picked something really stupid to export a temperature value as a test. I put a config stanza in config.json to try to activate it, and nope, it started dying. So I took the config-sample.json and copied that in its place, and that much worked, and told me what I needed to do: the blob of JSON crap from the plugin's page is supposed to be an entry in the accessories array of the config.json. That was not at all obvious.

I didn't want to do much more with this, and that's about where it ended.

So, yeah, bagging on Home Assistant for being JS and making you install it with sketchy curl pipelines? That was a mistake. I got it mixed up in my head with Homebridge based on something I did four years ago.

I loaded up the HA page on github, and oh hey, Python.

Mmm, yeah. Righto. Okay then.

How about them Knicks?

The Philips Hue ecosystem is collapsing into stupidity

If you've gotten into the home automation thing in the past few years, it's possible you set up some Philips Hue devices along the way. This was an ecosystem which had a bunch of bulbs, switches, outlets and a hub that spoke Zigbee on one side and Ethernet on the other. It was pretty much no-nonsense, never dropped commands, and just sat there and worked. Also, it integrated with the Apple Homekit ecosystem perfectly.

Unfortunately, the idiot C-suite phenomenon has happened here too, and they have been slowly walking down the road to full-on enshittification. I figured something was up a few years ago when their iOS app would block entry until you pushed an upgrade to the hub box. That kind of behavior would never fly with any product team that gives a damn about their users - want to control something, so you start up the app? Forget it, we are making you placate us first! How is that user-focused, you ask? It isn't.

Their latest round of stupidity pops up a new EULA and forces you to take it or, again, you can't access your stuff. But that's just more unenforceable garbage, so who cares, right? Well, it's getting worse.

It seems they are planning on dropping an update which will force you to log in. Yep, no longer will your stuff Just Work across the local network. Now it will have yet another garbage "cloud" "integration" involved, and they certainly will find a way to make things suck even worse for you.

If you ever saw the South Park episode where they try to get the cable company to do something on their behalf and the cable company people just touch themselves inappropriately upon hearing the lamentations of their customers, well, I suspect that's what's going on here. The management of these places are fundamentally sadists, and they are going to auger all of these things into the ground to make their short-term money before flying the coop for the next big thing they can destroy.

What can you do about it? Before you say "Home AssistantHomebridge", let me stop you right there. Javascript plus a "curl | sudo sh" attitude to life equals "yeah no, I am never touching this thing".

Instead, I have a simpler workaround, assuming you just have lights and "smart outlets" in your life. Get a hold of an Ikea Dirigera hub. Then delete the units from the Hue Hub and add them to the Ikea side of things. It'll run them just fine, and will also export them to HomeKit so that much will keep working as well.

I will warn you that Ikea isn't perfect here, either. They won't plumb through the Hue light/motion/temp sensors or the remote controllers to HomeKit. This means you lose any motion sensor data, the light level, and the temperature of that room. You also lose the ability to do custom behaviors with those buttons, like having one turn something on and then automatically switch it off a few minutes later. (Don't laugh - this is perfect for making kitchen appliances less sketchy when unattended.)

Also, there's no guarantee that Ikea won't hop on the train to sketchville and start screwing over their users as well.

My hope is that someone with good taste and some sensibility in terms of their technology choices will make something that does Zigbee on one side, Homekit on the other, and is at least as flexible as the Hue setup that existed originally. Until then, it's going to be yet another shit show.

And people wonder why I don't trust these things.


October 3, 2023: This post has an update.

Expressing my laziness in concrete ways

I'm a lazy programmer sometimes. Let me tell you a story about something I wrote earlier this year that's not exactly the finest set of programs ever produced. It's about the whole feed rate limiting thing on my web server.

I've been getting reports of people who run into the block even when they didn't do anything wrong. They didn't start up something that polled every two seconds and pulled the full ~500K feed every single time, for example (and yes, this has happened at least once).

No, the problem goes like this - someone sees the little orange feed icon up there (on the web view, that is) and clicks on it and gets a screenful of XML. My server also goes "okay, you just got the feed". Then they take the URL, hand it over to their feed reader, and it reaches out and tries to make the same request. My server says "hey wait a minute you clown, you JUST GOT IT", and rejects it with a 429.

See the problem? It can't tell the difference between a pairing of a one-time human request + their feed reader's startup sequence and someone who's actively hammering the thing. It's because the thing is relatively stupid. It knows about IP addresses, request types (conditional or not), and elapsed times. That's it.

In order to support some kind of "you're going to make a handful of closely-spaced unconditional requests at startup but will be good thereafter" leniency, it would have to actually have some thought put into it. Now you're talking about more of a "token bucket" system, or something else of that sort where it does some time-based accounting and allows for "bursty" behavior at first. That means tracking a lot more than just "the last time you got a full copy of the feed".

But you know what? That's work. It's not fun, it's not interesting, and it doesn't do me any favors besides avoiding receiving feedback messages from confused users of feed readers. So, I've been lazy, and I haven't done it. I've instead done a bunch of other things which also had to be done and had slightly better contexts.

I'll admit something else: I don't have a ready solution to this. I've never written a burst-handling inflow system before. It would be different if I could just reach back into my head and go "oh yeah, this is just one of those things from XYZ project". But nope, this time there's nothing in the past to "borrow" in the present.

Also, this feels more like a "moving average" type of problem, which then means *actual math*, and that's just not my bag, normally. So, I find reasons to do something else. Repeat as necessary.

Again, most feed readers and their users are doing just fine. This is something I have to do in order to deal with the pathological cases who are small in number but large in impact. I suspect that just a handful of them take up way more resources than all of the normal, happy, good people put together.

As with so many technologies, they would all be unnecessary if not for the people who are causing the problems. It's why some of us get wistful for the "old days" when the net was far smaller and the amount of bad behavior was accordingly tiny - eternal September and all that.

Do I want to write rate-limiters? Hell no. I'd rather do anything else.

The customer stuck due to a hurricane who needed ssh

One problem with working in a customer support environment is that you tend to lose track of just how many tasks you've completed. After a few hours, most of them get pretty fuzzy, and by the end of the week, only the most notable ones stand out. A month later, it's even worse than that. This is just how it goes when there's so much quantity passing by.

This is why I tried to take notes about a handful of them as they happened. After a certain point, the memories start losing "cohesion" (whatever) and then it might as well be "fiction inspired by real life".

A fair number of my posts are sourced from these notes. It's how I can still give some details all these years later without making them up.

Here's something that came in one night almost 20 years ago while working web hosting tech support.

A customer wrote in. They opened an "emergency: emergency" ticket, which is usually reserved for "OMFG my server is down please fix0r" type events. It actually had a HTML blink tag baked into the very string so it would blink in our browsers. It was hard to miss.

It was a Monday night. What they said, more or less: "Three things. I have a Tuesday deadline. I'm stuck in (some airport) because of the weather problems from Hurricane Jeanne in Atlanta. I can't connect to port 22 because the wireless in the airport seems to firewall it off."

"So, if not for that, I wouldn't call this 'emergency'. Also, I can't get to webmin to add another port myself. So, can you open up another sshd on port NNNN (since I know that gets through) so I can get to the machine?"

They ended this with a "Thank you" with a bunch of exclamation points and even a 1. (Whether they were trying to be KIBO or B1FF, I may never know.)

They opened this ticket at 7:40. About five minutes later, one of our frontline responders saw it in the queue (probably noticed the *blinking*), mentioned it out loud, and after a short discussion assigned it to one of the people on the floor.

At 7:50, we responded, stating that some iptables magic (shown in the ticket) had been done to let sshd answer on port NNNN in addition to port 22. Also, there was a note added to clarify that this was made persistent, such that it would persist across reboots. The customer was then asked to try connecting and to let us know if that didn't work out.

Why iptables instead of a second ssh daemon? It was way faster, for one thing, and time was of the essence. You could run the two commands: one to add the rule, and one to make it persistent, and then the customer is good to go. Standing up a second sshd instance on a separate port back in those days would have meant wrangling init scripts to make a second version that points at a slightly different config file. Also, it would create something of a maintenance issue down the road as that forked config would quickly become forgotten.

Sure, someone could have run "sshd -p NNNN", but then they'd have to make sure it kept running, and if it got whacked somehow (reboot?), the customer would be screwed again with their deadline looming.

Also, in terms of cleanup, the customer could just flip the -I (insert) in the iptables command to -D (delete) and save it to make it disappear for good later. Tidying the second-sshd thing would have more fiddly.

In any case, the customer came back a few minutes later, thanked us for the work, and promised to clean it up when they were clear of the problem. We didn't hear back, so things apparently worked out.

I hope they made their deadline.

Memories of a really goofy phone from the late 80s

I had this really bizarre telephone for some years in the 90s and 2000s. While mine is long gone now, I figured I'd talk about a little to establish that yes, this thing did exist, and to also hopefully inspire some Youtube types to find one and dissect it in a video.

It was called the FV 1000, dubbed a "Freedom Phone" model by Southwestern Bell, and it was a giant plastic piece of awful. It had certainly *sounded* cool when it was described in that electronics clearance catalog (Damark, maybe), but actually using the thing was another story entirely.

You see, it was supposed to be a "voice phone" ... as in voice-activated. While I got mine somewhere around 1990, I've been able to find evidence of it existing as far back as December 1987. So, imagine how mind-blowing that was back then: "wow! dialing the phone with my voice! In the 80s!".

Yeah well, Siri it was not.

Here's how it worked. It had no number buttons and no dial (you know, the spinny bit on a rotary phone). On the front, it just had some cursor keys (left, right, down, up) and a "store" button. The actual handset was unexpectedly lightweight and had a button at the top behind the earpiece. (I think there was also a reset button under a flip-up door, for what it's worth.)

Oh, and even more confusingly, you didn't get a dial tone when you picked it up. Picking it up off the very flimsy "hookswitch" presented you with a locally-generated tone that meant "okay, I'm waiting for you to talk to me now".

What you had to do was push the button down and say a command word like "DIAL", then wait for it to do the "ke-bwoop" confirmation noise and show "DIAL" on the single-line display. Then you'd read out the numbers one by one, waiting after each one for it to confirm. "1" *wait* "2" *wait* "0" *wait* "2" *wait* "4" *wait* "5" *wait* "6" ... you get the idea.

Then at the end, I think you just released the button and it would then "execute" the "command" you had just painstakingly built one word at a time. You'd hear the dial tone at last, and it would actually dial the number, and then it would connect things through and in theory you could talk like normal.

If this description is making you think "this thing sounds really slow", you'd be right. So, okay, naturally it had some memory features, right? Of course it did. You could store things like "Home 1" or "Neighbor 1" or "Office 1". Oh, and by the way, those names were immutable. You couldn't call it "Mrs. Brown" or "Mr. Chilman". It was "Neighbor 1" and "Neighbor 2" for you. Hope you remembered who was who! Enjoy flipping up that little door to see the labels you hand-wrote!

They did let you adjust the on-screen display for any given memory location, so while it might need to be TOLD "Neighbor 1", you could make it *display* "Mrs. Brown" or whatever... if it would fit.

It didn't always hear you correctly. When that happened, you had to say "BACKSPACE" and wait for it to acknowledge with another *ke-bwoop* noise. If you wanted to cancel, you had to be careful how you went about it, since letting go of the button would make it execute whatever you had told it so far.

In a stunning preview of today's event-driven half-assed GUI programs, you could actually get it out of sync with the "on hook" / "off hook" state that it maintained locally. I mentioned that it made that "I'm ready" tone when it's off the hook, right? Well, if you jostled it with just the right timing, you could manage to get it to be very much on the hook and yet still making the damn noise from the earpiece.

It wasn't super loud, but late at night when everything else was quiet, you could hear that tone coming from the phone as it was just waiting for you to push the button and give it some commands. The "fix" was to jostle it some more until it realized that you were not in fact trying to make it do something on your behalf.

One side note: for those thinking "this must have been amazing for people who can hear but can't see"... probably not so much. It didn't read back the numbers you told it to dial. So, if it mis-heard one digit as another, you'd never know. It only showed it on the display, so if you couldn't see it... oops.

Apparently the list price for this thing at the end of 1987 was $450... or about $1200 today. Imagine spending that much cash on some tech and then realizing it was annoying, flimsy, and generally unreliable.

Oh, wait, I guess we all do that pretty much constantly now. Never mind.

Add extra stuff to a "standard" encoding? Sure, why not.

I've built more than a few projects which use protocol buffers somewhere in them to store data or otherwise schlep it around - in files, over the network, and that kind of thing. A friend heard about this and wanted to write an implementation in another language and so I supplied the details. Everything seemed to be going fine, but then we started getting *really weird* errors when he tried to point his new client at my server process.

Just trying to get the outermost "envelope" thing to pass would fail. This made no sense. We finally had to get down to individual bytes from the network dump to try to sort it out. Then we tried to encode "the same thing" and got two different results. His end was generating "1f 0a 0b (string)" and mine was doing "0a 0b (string)".

Where was this extra 1f coming from? We started trying to unravel it according to the rules of protobuf: the tag of a record is a varint which comes from the field number and wire type and blah blah blah... and I won't even bother with the details here since that was also a dead end. It decoded to "field 3, type 7" but there isn't a type 7. There are just 0-5. So, again, WTF? What is this "invalid wire type 7" thing? (And yes, that string in this post is entirely deliberate.)

My friend is good at this sort of thing, and so started digging in deeper... and it started looking like a length byte. It's like, wait, what? Hold on. protobufs do not work that way! They don't have their own framing. That's why recordio was invented, and countless other ways to bundle them up so you know what type they are, how long they are, and all of that other stuff. The actual binary encoding of the protobuf itself is bare bones! So what's up with this length byte?

So then we started looking at this protobuf library he had selected, and sure enough, the author decided it was a good idea to prepend the message with the message length encoded as a varint.

WHY? Oh, why?!

And yes, it turns out that other people have noticed this anomaly. It's screwed up encoding and decoding in their projects, unsurprisingly. We found a (still-open) bug report from 2018, among others. They all manifest slightly differently, so not everyone realizes that it's all from the same root cause.

The fix was dubious, but it did work: you skip the "helper" function that's breaking things. That gives you just the proper bytes, and then everything is happy.

That's how I got both a "second source" for speaking my goofy RPC language and another story about wacky broken libraries at the same time.

Feedback: the feed seems just fine...

Earlier, someone wrote in saying "the RSS feed seems to be broken", but didn't leave any contact info. All I have is an IP address and whatever the web server logged. Let's just see what we got here, and see what's broken and what isn't.

xx:01:29 : requests feed over http with no conditional headers (If-Modified-Since, If-None-Match). Receives entire feed as a result. Expected behavior for first fetch.

xx:03:53 : requests feed over http again... with no conditional headers. It's been a hair over two minutes. The server rejects it with a 429 "Too Many Requests". Strong signal for a broken feed reader.

xx:20:37 : third request over http with no conditional headers (so it wants the unchanged feed again, 19 minutes later). Rejected the same way.

Then things pivot and hit the "secure" side of the site. I treat them differently for the purposes of throttling, or very bad things would happen to feed aggregator places which subscribe to both versions of the feed.

xx:21:46 : requests feed over https. Sends no conditional headers. Receives entire feed since it's specifically rigged to not care about the earlier http traffic.

xx:22:04 : requests feed over https again. Sends no conditional headers despite receiving entire feed 18 seconds earlier. Is rejected with a 429.

xx:22:11 : requests feed over https a third time, again with no conditional headers. Is rejected with a 429 again.

xx:22:18 : fourth request over https, again no conditional headers, is again thrown in the bit bucket with a 429.

xx:23:13 : fifth request over https, still no conditional headers. Gets 429.

I'd say things are working perfectly... here. There's a feed reader involved which doesn't send conditional requests, doesn't throttle on a 429, and doesn't surface HTTP failure codes to the user, but I have no control over that.

HTTP 429 means slow your roll.

Feedback: I try to answer "how to become a systems engineer"

I got some anonymous feedback a while back asking if I could do an article on how to become a systems engineer. I'm not entirely sure that I can, and part of that is the ambiguity in the request. To me, a "systems engineer" is a Real Engineer with actual certification and responsibilities to generally not be a clown. That's so far from the industry I work in that it's not even funny any more.

Seriously though, if you look up "systems engineering" on Wikipedia, it talks about "how to design, integrate and manage complex systems over their life cycles". That's definitely not my personal slice of the world. I don't think I've ever taken anything through a whole "life cycle", whatever that even means for software.

In the best case scenario, I suppose some of my software has gotten to where it's "feature complete" and has nothing obviously wrong with it. Then it just sits there and runs, and runs, and runs. Then, some day, I move on to some other gig, and maybe it keeps running. I've never had something go from "run for a long time" to "be shut down" while I was still around.

This is not to say that I haven't had long-lived stuff of mine get shut down. I certainly have. It's just that it's all tended to happen long enough after I left that it wasn't me managing that part of the "life cycle", so I heard about it second- or third-hand and much much later.

If anything, some things have lived far too long. My workstation at the web hosting support gig started its life with me in 2004 as a pile of parts that had formerly been a dedicated server. It had a bunch of dumb tools that I wrote and other people found useful. It should have been used to inspire the "real" programmers at that company to code up replacements, but seemingly did not. That abomination lived until *at least* 2011, or five years after I moved on from that company. None of that stuff was intended to run long-term, but someone kept tending it for years and years. It was awful.

But, okay, let's be charitable here. Maybe the feedback isn't asking for that exact definition, but rather something more like "how to get a job sort-of like the things I've done over the years". That's the kind of thing I definitely could take a whack at answering, assuming you like caveats.

I think it goes something like this: you start from the assumption that when you see something, you wonder why it is the way it is. Then maybe you observe it and maybe do a little research to figure out how it came to be the thing you see in front of you. This could go for just about anything: a telephone, a scale, a crusty old road surface, a forgotten grove of fruit trees, you name it. By research, I mean maybe you go poking around: try to open that scale with a screwdriver, get out of the car and walk down the old road, or turn over some of the dirt in the field to see if you can find any identifying marks.

I should also point out that this goes for trying to understand how people and groups of people came to be the way they are, too, but most tend to not respond well to being opened with screwdrivers, walked on, or turned over in the dirt. (And if they do, well, don't yuck their yum.)

Anyway, if you start from this spot, then maybe you start coming up with some hypotheses for how something happened, and then sort of mentally file that away for later. Or, maybe you even write it down. Then as more data comes down the pipe over the years, you revisit those thoughts and notes and refine them. Some notions are discarded (and noted as to why), but others are reinforced and evolved.

Do this for a while, and sooner or later you might have some working models. They might not necessarily be the actual explanation for why something is the way it is, but it gives you a starting point.

Then, one day, something breaks, and you end up getting involved. It might be a high-level system that's new to you, but it has some low-level stuff deep inside, and you recognize some of that. One of those low-level things had a history of doing a certain thing, and that never changed. They might've built a whole obscure system over top of it, but the fundamentals are still there, and they still break the same way. You go and look, and sure enough, some obscure thing has happened. Nobody else saw something like this before, and so when you point it out and flip it back to sanity to restore the rest of the system, they look at you like you just pulled off some deep magic.

The question is: did you, really? It's all relative. If you've been poking and prodding at things and have remembered the results of these experiments from over the years, it's not really new to you. It's just one of many events and might not be anything particularly special by itself. It just happened to be important on this occasion.

Some people will accept this explanation. Others will refuse it and will insist that you are a magician for fixing "the unfixable". A few others will know exactly what you did because they did it themselves once upon a time.

Then there are the one or two in every sufficiently large crowd who will see that you are being celebrated for knowing and utilizing some obscure factoid, and they will make it their mission to wreck your world. Basically, they have to make your random happenstance about them somehow, and so they make it about how it hurt them and how they need to get back at you. If this sounds pathological, it's because it is, and unfortunately you will encounter this at any company which doesn't have the ability to screen out the psychos.

This also goes for the web as a whole. Having something you've done be (temporarily!) elevated to a point of visibility somewhere public will just set these people off. This, too, is enabled by having forums which don't notice this and deal with their pests.

Now, for some examples of obscure knowledge that paid off, somehow.

pid = fork(); ... kill(pid, SIGKILL); ... but they didn't check for -1. "kill -9 -1" as root nukes everything on the box. This takes down the cat pictures for a couple of hours one morning because it turns out you need web servers to run a web site. Somehow, the bit in the kill(1) man page about "it indicates all processes except the kill process itself and init" stuck in my head. Also, the bit in the fork(2) man page that says "on failure, -1 is returned in the parent".

malloc(1213486160) is really malloc(0x48545450) is really malloc("HTTP"). I think this came from years of digging around in hex dumps and noticing that the letters in ASCII tend to bunch together (this is entirely deliberate). Seeing four of them in a row in the same range with nothing going over 0x7f suggested SOME WORD IN ALL CAPS. It was.

The fact I had seen some of this stuff before is just linked to some chance events in my life, combined with doing this kind of ridiculous work for a rather long time now. There are plenty of other times when something broke (or was generally flaky) and I had no idea what it could possibly be, and had to work up from first principles.

For someone who's just getting started, it's a given that you haven't seen many of these events yet. Don't feel too badly about it. If you keep doing it, you'll build up your own library of wacky things that could only be earned by slogging away at the job for years and years.

Also, if you think this is nuts and choose another path, I don't blame you. This *is* nuts, and it's entirely reasonable to seek something that doesn't require years of arcane experiences to somehow become effective.

Administrivia: new web hosting arrangements

Welcome to the new hosting situation. Over the past month or so, I've been working to move this web page and some of my other stuff to a new spot. As of this morning, it's done, and this is being served from the new machine. Say hello to flicker.rachelbythebay.com.

So, what happened? Well, a cute little company called SoftLayer turned into a massive monster called IBM. They still had acceptable rates and actually offered IPv6 (barely), but their corporate brain damage only got worse every passing year.

They had definite "left hand, right hand" moments, like when I went to turn up a new machine in February 2020 and they didn't offer a kickstart of RHEL 8. It's like, hello, you bought Red Hat six months before. RHEL 8 itself had been out for nearly a year at that point, and indeed, had made it to 8.1 by then. So, I had to do CentOS 8, and then they hosed us all royally that year. That's when I stuck more pins in my IBM voodoo doll and migrated to Rocky.

Then there was the day in January 2022 when I was doing some work on the machine and noticed that it needed a firmware update or something. I figured, okay, fine, I'll take the downtime and let their automatic doodad do exactly that. It's really late and nobody should care. I queued it up and powered it down (per their instructions).

I watched from the remote screen monitor as the automatic updater powered it up and got it to boot over the network into Windows (!) in order to run some nasty thing that popped up CMD windows and worse. I went off to do something else to distract myself. One hour turned into two, then into three, and support started saying "oh, it'll be a total of four hours". Great. The worst part was the complete lack of updates during this process. They just kept flailing.

I finally said "please just abort this and put my machine back up". I told them that it failed, and they should not attempt to troubleshoot their automation system on my machine. They should admit that it failed, put me back up, and leave me alone for a while until I can figure out what happens next. They finally got someone who was paying attention to do exactly this, and the machine went back up.

We scheduled it to happen the next night during another four-hour window. They started it, worked for about an hour, then called it and decided to go with a chassis swap. Yep, they pulled my drive out and jammed it into another box (and I was fine with this). Since I'm not a complete clown, it came back up by itself and figured everything out and kept going. How about that.

So, if you noticed multiple hours of the site being down on January 3rd, 4th and 5th of 2022, that's why!

What else with them? Their customer support is completely boneheaded sometimes. They had this "VPN" thing so you could tunnel into your "privatenet" which has the IPMI/remote KVM interface for your server(s). I would do that when doing a kernel upgrade in case I screwed up and needed to rescue things. I'd get that working *first* before doing the reboot just out of paranoia. I've yet to need it, but old habits die hard.

One day, it just stopped working. I filed a ticket asking them what I should be doing, since their documentation web page (and I provided the URL) said to use X, but X wasn't working. Is there a new hostname, or can you fix the thing?

They came back and said, oh, use this documentation web page.

It was the same page I had put in the request, unchanged.

Several days went by. Finally, I "thanked" them for "providing the same URL that I had provided them in the first place", and closed the ticket with a thumbs-down.

In the meantime, I had managed to find another way in by guessing how their hostname scheme worked, and got my work done and rebooted into the new kernel. They never really fixed the docs as far as I know, and they are probably still pointing people at a long-dead VPN endpoint.

But no, that wasn't it, either. The machine was physically in Texas. That particular hive of hate and villainy is talking about making ISPs restrict access to certain kinds of web pages. That's obviously about consumer-side stuff, but they could probably find ways to extend that to the *hosting* side of it, too. Also, screw them and feeding their tax base. I started looking for replacement options in other locales.

At this point, I noticed that all IBM would sell me was something that was much less box for much more money. I'm talking a slower processor, less memory, and all of that stuff, and the monthly bill would go up. Screw. That.

And then I got my final sign from the universe: they're "modernizing" and so DAL05 (my location) will be shutting down in April 2024. I didn't even notice this until I happened to be in their "portal" to do some unrelated work. Did they mail me? No. Did they call me? No. I just happened to notice it while in there one day.

Well, that's the last sign I needed, and I pulled the trigger on a colocation cabinet a few days later. That then started the whole crazy mess of getting a server, pulling together the network equipment, installing it *physically* (this was hard!), installing it *logically*, and then migrating everything.

Late Friday night into Saturday morning, I started flipping things over and kept an eye on them. A few minutes ago, I turned off the web server on the old machine. I figure if your DNS provider is crazy enough to clamp my 900 second TTL up to something over 12 hours, you deserve to talk to a brick wall of RSTs for a while.

So here we are. I now have a server I can physically lay hands on, albeit with a little driving involved. I got it used, and it's a real beast, but it does work. I'm also hearing from early testers that it's significantly faster for them. I thought it was just because I moved it about 40 milliseconds closer to me, but it just seems snappier for them, too. How about that?

I probably screwed up at least one thing with this migration like I did last time, so if you spot something amiss, please do holler. All of the URLs should still be working and all of that stuff. I already know the mtimes all reset, so a bunch of pages look new when they have the same content - that was unavoidable.

That's the story of one more bird in the flock.

Fulfilling a reader's request for my "dot files"

I got a bit of feedback the other day from Nate asking if I had dot files. I certainly do. I assume what they meant is if I have particular customizations, and then if I would care to share them. I definitely have a bunch of particular changes, and as for sharing them, why not. It lets me get a bunch of shots in at things that have become annoying over the years, and that means it's perfect for stirring up the hornet nests with a Friday night post.

Starting on my daily driver box that runs Debian, then:

I have a .bashrc that has a bunch of dumb two- or three-letter aliases which amount to 'ssh (otherbox)'. For some reason that is lost to time, they all start with the letter m, and then the second letter sometimes reflects the name of the target system - "mm" takes me to my Mac Mini (which also runs Debian), for instance.

The stock PS1 bugged me a bit, so I mangled it down to this:

export PS1='\h:\w\$ '

... which turns into "hostname:/some/path$ ", in other words.

I think I've had a prompt like that on my personal machines basically forever - probably back to 1994 if not before. That's fine when I'm just running things as myself. If I run sudo, I get the stock setting which ends up looking like this:

root@hostname:/some/path#

... and that's fine, too. Making it look a bit different when rootly powers are in force is a good thing.

The next one is switching off another annoyance:

alias ls='/bin/ls -N'

The -N switch to ls says "print entry names without quoting" ... and it's the difference between having just the filename shown, spaces or no, and having it 'wrapped like this'. The way I see it, if you're printing quotes there, they'd better be part of the damn name. It reminds me of the time they started doing crazy UTF-8 "smart quotes" in their error messages and I didn't know it had changed. Cue me going "WTF is this gunk in this filename?" and thinking we had major corruption in the system somewhere.

I'd probably put up with the quoting if it didn't bump everything else out to the right another column. Two spaces between the time and the filename? Heresy!

The next two are filed under "everyone sucks at setting colors in Unix tools so stop adding it to everything". The first one is for sar:

export S_COLORS=never

... and the second one is something I found a little later which seems to be something that might work across multiple programs (assuming they've been patched to recognize it):

export NO_COLOR="eat flaming death you [elided]"

You can guess what the rest of it says. The actual value doesn't matter. Just having it set does the job. The value I put there is just to make me feel better every time I have to fight to get back to the perfectly working system I've had.

That's it for .bashrc. Next, I have .gitconfig which is mostly boring. There's a [user] section which has name= and email=, and those are set to about what you would expect.

I have pull.rebase set to true because that's always what I would use anyway when doing a pull, and it started whining at some point. So I put this in here to make it keep doing what I wanted. This is because I don't do branches and other goofiness and just want a nice simple continuous timeline for my commits.

I also have init.defaultBranch set to main because, eh, why not? I've designed enough systems based on the old broken naming schemes and don't need any more.

I have a .gdbinit. Why? Same old story: the default now sucks. It has one line:

set style_enabled off

It's amazing just how awful it is when it changes colors every time it hits a ( or " or whatever. How do people deal with that stuff? So bad. It's so nasty.

Next up, .nanorc, and this one is a three-ring circus. Basically, for the longest time, I didn't need one of these. Now, I add about one line on average every three or four years because - again - things keep changing for the worse.

Here's where things are now:

syntax "all" ".*"
color yellow "^$"
unset locking
set emptyline
set breaklonglines

The first two have been with me for quite a while now, and serve to disable syntax highlighting across the board. Again, not my thing.

Line three stops it from pooping out stupid ~ files everywhere. Not wanted, not needed, didn't ask for it, was forced upon me, had to murder it with a setting.

Lines four and five just put back behaviors that they dumped in 4.0: the blank line right below the status bar at the top, and the wordwrap that happens when you hit a certain column. I use that all the time, like, well, *right now* writing this post. It hard-wraps at 72, because OF COURSE it does.

Next is my .Xresources which provides a way to disable some obnoxious behavior in urxvt without having to recompile it. For the longest time, I'd chop it out and drop a custom binary into my bin directory. Then I realized it could be tamed without such mangling, and here we are:

URxvt.perl-ext:
URxvt.perl-ext-common:

This has the effect of making it so a double-click highlights the whole word, and a third click highlights the whole line *even if* someone's holding a LISP convention on that particular row of the terminal.

Then I have a .xsessionrc which needs to exist because I now log in through xdm, and the window manager (fluxbox) ends up inheriting *that* environment. Yep, it doesn't get a .bashrc type thing applied to it. (Not gonna lie - this took a while to figure out. Quick, which of .bashrc, .bash_profile, .profile et al get run for any given type of login you do to a box? Text mode, X *and* ssh all matter.) Anyway, that means I have to twiddle my PATH in there, or the commands that fluxbox runs for me won't find anything in those extra directories.

That is, I like my .fluxbox/menu entries to be short and sweet, like "term". That's a small stupid script in my bin directory. If that's not in my PATH then I'd have to spell out the whole /home/blahblah thing, and that's just idiotic.

Speaking of fluxbox, that has a dot directory, and a startup script in there to set a few things up properly.

xset b off
xset r rate 250 30
xset dpms 1900 2000 2100
xscreensaver &

Line one turns off the console beep - not that my machine has a PC squeaker any more, but I think some things try to be "helpful" by sending a beep into the system audio path. That can be really obnoxious, like when I'm deliberately holding down a key for whatever reason and get to the beginning of the line.

Line two is about getting that key-repeat going at a speed I like. If I end up on a machine where that's not fast enough, it becomes obvious pretty quickly, and I have to go adjust things. Not every situation allows for things like ^W to eat a word or ^U to eat the whole line, and so holding down backspace to change the wording of something is what I want.

Likewise, if I want to put a "-------------" divider somewhere, I don't want to wait for it to get going. It looks like that means "wait 250 ms before repeating, and then repeat at 30 Hz", but I had to look it up because it's been set like that for as long as I can recall.

Or maybe I want to hold down the cursor key to scroll something, or just move somewhere else on the line. Same thing.

Annoyingly, this seems to be set in the keyboard itself and not on anything local to the machine, so if I have to replug the keyboard for some reason, I have to run that again or it'll be stuck in stock molasses mode. This feels like a regression from the PS/2 days but I haven't bothered plugging in one of my old model Ms to verify this.

Line three just sets up the power-saver specifics on the monitor. Those don't usually matter too much since I have a hotkey that explicitly locks things and then forces it to go to sleep right away, and I push that when I'm done using this thing.

Line four, well, that's my dose of jwz, and that's what actually keeps the screen locked, as opposed to the legions of craptacular also-ran "lock" programs that always end up sucking and failing open. I can't imagine how many years in total my screens have been protected by xscreensaver in "lock" mode.

The rest of that file just starts my three Window Maker-era widgets and those aren't important or even interesting. There's a clock/calendar, the CPU load, and something to twiddle the system volume for when I have speakers or headphones connected.

That's about it. I don't use .plan or .project files any more since I haven't run fingerd for decades, and besides, my machines are all just me and nobody else, and so a local finger is also not a thing. (Oh, get your minds out of the gutter. It's the "ratting out to the cops" sense of "finger".)

Want to see the last time I used that stuff? Here's the file in my homedir archive from the last machine which had that running:

-r-------- 1 rkroll rkroll 34 Apr 28  1996 .plan

See, told you it's been decades. All I did was rip off a line that I had seen in someone else's file that was intended to sow confusion:

Segmentation fault (core dumped).

The idea is that you'd think that the far-end finger process crashed, or the far-end finger daemon, or maybe even *your local finger client*, and then you'd run around trying to figure it out. Then you'd eventually realize what was going on and shoot a nerf dart at whoever wasted your time.

Ah, the '90s.

Escalating via post-it note just to get some health checks

I used to work at a place that had an internal task tracking system. Big deal, you think. Lots of places do that. Well, at this particular company, it was sometimes a pit of sorrow into which you would issue a request and never hear from it again... unless you went to some lengths.

Let's back up a little to set the stage. It's June of some year quite a while back, and it's about 9:30 at night. I guess I'm on call for the "last line of defense" debugging team, and I get pinged by the manager type who's wrangling an outage. It seems this team did some kind of code push and now they were completely down in some cluster: "0 online users" type of thing.

The incident manager asked me to find out what the load balancers were doing to healthcheck the systems in question, so I went looking. It turned out they were getting a small HTTP request on port 80, sort of like this: "GET /status HTTP/1.0".

But... the ones in the broken cluster weren't listening on port 80.

I asked if they did port takeover stuff (where one process can actively hand the listening socket to the one that's going to replace it), but then noticed they were running Java, and figured not. That kind of stuff was only really seen in some of the C++ backend processes at this gig.

I asked if maybe they had restarted in such a way that they tried to bind port 80 with the new process before the old one had shut down. Crickets.

Anyway, lacking a response from the engineer in question, I kept going, and found that poking another one of their instances in another (healthy) location would get a "I am alive" type response. To me, that seemed like a smoking gun: no response to HC = no love from the load balancer = no users.

A few minutes had gone by with no reply from the engineer, so I just used my Magic Powers on the box to just kick one of the java instances in the head to see what would happen. A few minutes later, it restarted, and now it was listening on port 80 and answering the health checks. Unsurprisingly, it started receiving requests from the load balancer, and the number of users started creeping upward.

I suggested a follow-up task: their stuff should kill itself when it can't get all of the ports it needs. Also, the thing that runs tasks should be set to check all of the ports too, and if it can't get satisfaction, *it* should kill the task.

Now, okay, granted, this is hacky. The program should be able to survive port 80 not being immediately available. It should also have some way to "hand off" from the other process. Or, you know, it could bind to a new port every time and just update its entry in the service directory. There are lots of ways to go about fixing this. However, considering the context, I went for the lowest-hanging fruit. You don't want to ask people to boil the ocean when they're having trouble making tea.

Anyway, they started restarting tasks and the service slowly returned to normal. I dropped offline and went to bed.

Two months went by. They kept having outages related to not having healthchecks. They were all preventable. They'd happen right when I was getting ready to go home, and so I'd miss my bus and have to wait an hour for the next one. That kind of crap.

I started counting the HC-related outages. Two, three, four.

At some point, I was over it, and dropped a post-it note on the desk of the head of engineering, pointing at the task to fix this thing and pleading for them to get involved. I was through being the "bad cop" for this particular one. It was time for management to deal with it.

Another month went by. Then, one day in late September, someone popped up in our IRC channel saying that they had turned on health checks and now had to test them, and could we help? I happened to be there, grabbed one of their machines, and promptly screwed up one of their java processes so it would just hang. (I forget what I did, but SIGSTOP seems plausible.)

The task runner thing noticed and set it to unhealthy. About three minutes later, it killed it, and then restarted it. Four minutes after that, it was still restarting, and maybe another four minutes after that, it was finally alive again and taking requests.

I informed this person that it did in fact work, but it took something like ten minutes to cycle back up to being useful again. They thanked me for checking and that was the last I heard about it. Apparently they were fine with this.

Considering this method of accessing things sent more people to the service than the *web site* did, you'd think it'd be kind of important. But, no, they were just rolling along with it.

Then today, I read a post about someone who found that their system had something like 1.2 GB strings full of backslashes because they were using JSON for internal state, and it kept escaping the " characters, so it turned into \\\\\\\\\\\\\\\\\\\\\\\\\" type of crap. That part was new to me, but the description of the rest of it seemed far too familiar.

And I went... hey, I think I know that particular circus!

Turns out - same circus, same elephants, different shit.

Administrivia: HTML generation and my general clowniness

I've been kind of quiet these past few weeks. Part of that has been from plowing a bunch of work into getting serious about how all of the /w/ posts get generated. I figure if I'm going to start leaning on people to not do goofy things with their feed readers, the least I can do is make sure I'm not sending them broken garbage.

To really explain this, I need to back up to 2011 when this whole thing was just getting off the ground. I started writing here in order to keep the momentum from the writing I had been doing inside the company I was about to leave. I figured that anything was better than nothing, and so those posts were all done by hand. The posts themselves were hand-formatted: I'd type it up, slap on the header and footer, and then it'd get a link from the top-level index page.

Then people asked for an Atom feed, and I delivered on that too... ALSO doing it by hand at first. Yeah, that was about as awful as you can possibly imagine. Obviously that could not stand, but it did get me through the first couple of days and posts, and then my little generator came together and it picked up most of the load for me.

But there's a dirty little secret here: this generator has been little more than a loop that slaps HTML paragraph ("p") tags around everything. It doesn't really understand what's going on, and any time it sees a blank line, it assumes one ended and another one just began.

If you've ever looked at the source of some of the more complicated posts with embedded tables, audio players, PRE blocks or anything else of the sort, you've probably wondered what kind of crazy I was smoking. Now you know why. The only reason it works at all is because the web as a whole is terrible and browsers have had to adapt to our collective human clownery. HTML parsers tend to ignore the botched tags, and it generally looks right anyway.

I still find myself doing stupid things to work around the nuances of the ridiculous state machine that I created. If you've seen PRE blocks where for some reason there are lines with a single space in them, this is why! A blank line would trip the "stick on a /p and then a p" thing, but a line with a single space would not. So, I've been doing that.

Worse still, see how I'm calling it /p and p? I'm not using the actual angle brackets? Yeah, that's because there's no entity encoding in this thing at the moment. I'd have to manually do the whole "ampersand l t semicolon" thing... and HAVE been doing this all this time. I don't feel like doing that at the moment. (Because I'd have to fix it when it's time to convert this very post, but I'm getting ahead of myself.)

Both publog (the thing that is responsible for what you're seeing now) and my own diary software share a similar heritage, and I've been bitten by the lack of proper handling of this stuff over the years. For whatever reason, I decided it was time to do something about it, and finally got traction with an approach around the time of the new year.

Here's what's coming: every single post will be run through a generator that actually functions like a "real" parser - tokens and rules and put_backs and all of this! It's not just a "am I in a paragraph right now" state machine. It'll accumulate text, and when it's ready to emit a paragraph, it will do that with all of the rules it's been told about, like how to handle attributes, their values, AND when (and what) to escape/encode in the actual body of the tag/container.

This also goes for some of the "commands" that have been part of the input files all this time. When I include an image, I've been doing a special little thing that says "generate the IMG SRC gunk with the right path for this file with this height and width". This lets me ensure that the http and https feeds don't get cross-protocol URLs, among other things. The "this post has an update" lines and the backwards links to older posts also work this way.

This HAD been working with a bunch of nasty stuff that was basically building HTML from strings. You know the type, right? You print the left bracket, IMG SRC=, then you have to do a \" to get a literal " in there without ending the string... and then you end the string. Then you add the filename, and start another string and put a \" in it to cap off the SRC attribute of the IMG tag, and so on and so forth...

This kind of crap!

I'm kind of wondering who's reading this and thinks I'm a clown vs. how many people are reading this and are just nodding their heads like "yeah, totally, that's how we do HTML all over the place". But I digress.

Now, actually doing this has meant coding it up, but it's also meant going back and converting all of the damn posts, too. Any place where I had raw HTML shenanigans going on (like doing my own "ampersand + l + t + semicolon" stuff) had to be found and changed back to the actual character I want there. The program itself will do that encoding for me now. It's nice to have it, but it's a chore to go and do it without breaking anything, like a place where I WANT the literal gunk there.

With almost 5.5 MB of input text across 1400 posts, that was a non-trivial amount of work. I would not be surprised if I missed things that will pop up down the road and which will need to be hammered back down.

So yes, for a while, it will be "same clown, different circus". But, at least this time, I'll be trying to emit the right stuff.

I haven't set a date or anything for this. There's this possibility of also trying to solve some other dumb problems that also vex certain (broken) feed readers at the same time, and I haven't decided whether to block the rollout of the one thing on the rollout of the other one. This matters because I'd rather not rewrite every single /w/YYYY/MM/DD/whatever/index.html page multiple times. Ideally, they'll only change the one time. (What can I say, I care about these things.)

...

While waiting on that, if you're a feed reader author, you can at least check on a few things. You aren't honestly taking the "updated" time from inside the feed and using that in the HTTP transaction (If-Modified-Since), right? Right?? You know those are two totally different things from different layers of the stack, and aren't interchangeable, right? The IMS value should come from the "Last-Modified" header I sent you in the first place.

Right, Akregator? Right, NextCloud-News?

It's crazy how long it took me to figure out why they were sending me reasonable-looking "IMS" values that I had never handed out. It wasn't until I looked inside the actual feed that the penny dropped.

Want to know how the sausage is made and why this happens? Okay, settle in.

The web pages and the feed files (yep, plural: http and https) are made by running the generator on my laptop. The wall time on that system winds up being used in the "updated" fields in the XML gunk that is the Atom feed. The files also get a mtime that's about the same... on the laptop. More on that in a bit.

This writes to a directory tree that's a git repo, and a few moments later there's a git add + git commit + git push that captures the changes and schleps it off to my usual git storage space.

Later on, I jump on snowgoose (that's my current web server machine) and have it pull from that same git storage space into a local directory and then rsync the new stuff out of that tree into the various document roots - there are multiple web sites on this box.

If you didn't know this already, git does not preserve mtimes. The mtimes on files it writes out are just "now", whatever that may be. It's usually a minute or two later than when I did the generation on my laptop, just because I don't usually push to "production" right away. I usually eyeball things on an internal machine first.

Now, rsync DOES preserve mtimes, but it's preserving values that aren't particularly interesting. They are just the time when "git pull" ran on the web server and brought in the new/updated versions of the files. It's not the same time that the actual feed was updated on my laptop.

Apache uses the mtime on the files, so it's handing out "Last-Modified: (whatever)" based on that "git pull". This is not going to match the "updated" XML blob in the feed itself.

So, what I get to consider is whether I want to go nuclear on this and come up with something that will actually *SET* the mtimes explicitly and make sure they stay set all the way to the document root, no matter where it is.

Besides the broken feed fetchers, there's another reason to care about this sort of thing. What if I get a second web server, and put it behind a load balancer? Requests could be served by one or the other. Imagine if the two web heads did their "git pull" at two different times. Clients would get one Last-Modified value from server #1 and another value from server #2. Chaos! Madness! Insanity!

Now, I don't have a second web server, and in fact have no plans to do that unless people want to start throwing a LOT of money at me to run one in a colocation rack somewhere. But, it's the principle of the thing: controlling important values explicitly instead of leaving them to chance, *especially* since I'm expecting other people to do their part with those same values.

It's funny, right. I never thought I'd miss XHP until I started doing this project, and I didn't even do that many (internal) web pages at FB - just the ones I absolutely needed because nothing else would do.

Load 'em up and throw 'em under the bus

In recent times, I've been realizing more and more just how much a screwed up management situation can lead to screwed up technical situations. I've written a bit about this in the past few months, and got to thinking about a specific anecdote from not too long ago.

I was working on a team which was supposed to be the "last line of defense" for outages and other badness like that. We kept having issues with this one service run by this team which ran on every system in the fleet and was essential for keeping things going (you know, the cat pics). We couldn't figure out why it kept happening.

Eventually, I wound up transferring from my "fixer" team and into the organization which contained the team in question, and my first "tour of duty" was to embed with that team to figure out what was going on. What I found was interesting.

The original team had been founded some years before, but none of those original members were still there. They had moved on to other things inside the company. There was one person who had joined the team while the original people were still there, and at this point, he was the only one left who had "overlapped" with the original devs.

What I found was that this one person who had history going back to when the "OGs" were still around was basically carrying the load of the entire team. Everyone else was very new, and so it was up to him.

I got to know him, and found out that he wasn't batshit or even malicious. He was just under WAY too much load, and was shipping insanity as a result. Somehow, we managed to call timeout and got them to stop shipping broken things for a while. Then I got lucky and intercepted a few of the zanier ideas while he was still under the stupid-high load, and we got some other people to step up and start spreading the load around.

I pitched in too, like trying to help some of the irked customers of the team and do some general "customer service" work. My thinking was that if I could do some "firewall" type work on behalf of the team, it would give them some headroom so they could relax and figure out how to move forward.

This pretty much worked. The surprise came later, when the biannual review cycle started up and the "calibration sessions" got rolling. They wanted to give this person some bullshit sub-par rating. I basically said that if they give him anything less than "meets expectations", I would be royally pissed off, since it wasn't his fault.

What's kind of interesting is that they asked the same question of one of my former teammates (who had also been dealing with the fallout from these same reliability issues), and he said the same thing! We didn't know we had both been asked about it until much later. We hadn't even discussed the situation with the overloaded engineer. It was just apparent to both of us.

With both of us giving the same feedback, they took it seriously, and didn't hose him over on the review. He went on to do some pretty interesting stuff for monitoring and other new stuff (including bouncing it off the rest of the team first), and eventually shoved off for (hopefully) happier shores.

The service, meanwhile, got way better at not breaking things. The team seemed to gel in a way that it hadn't before. It even pulled through a truly crazy Friday night event that you'd think would have caused a full site outage, but didn't. Everyone came together and worked the problem. The biggest impact was that nobody internally could ship new features for a couple of hours while we figured it out and brought things back to normal. The outside world never noticed.

Not long after that event, I considered the team "graduated" and that I no longer needed to embed with them, and went off to the next wacky team in that particular slice of the company's infra organization.

This was never a tech problem. It was one guy with 3 or 4 people worth of load riding on his shoulders who was doing his very best but was still very much human and so was breaking down under the stress. They tried to throw him under the bus post-facto, but we wouldn't stand for it. This was a management problem for letting it happen in the first place.

See how it works?

More than five whys and "layer eight" problems

I saw a post about a year ago talking about the "five whys" technique of trying to figure out what caused something to fail. It was using a car scenario for an example, and it went something like this:

The car didn't start... because the battery is dead... because the alternator wasn't charging it... because the alternator belt broke... because the belt was beyond its useful life but wasn't replaced... because it wasn't maintained according to recommended schedule.

That's about five levels, and it pretty much stopped there. I figure, well, you can go beyond that, and in the case of the infra stuff at a big enough company, you probably need to if you intend to actually try to fix something.

So, that's been my life: trying to roll back through the series of actions (or lack of actions) to see how things happened, and then trying to do something about it. The problem is that if you do this long enough, eventually the problems start leaving the tech realm and enter the squishy human realm.

Perhaps you've heard of the OSI model of networking, where you have seven layers as a way to talk about what's going on in the "stack". I've seen some brilliantly snarky T-shirts that talk about "layer eight" and sometimes beyond as things like "corporate politics" and "management" and all of that good stuff.

It turns out that when you start doing this root-cause analysis and really keep after it, the "squishy human realm" is actually the no-longer-hypothetical "layer eight" from those T-shirts.

In our "car" example, you might discover that management is forcing people to ignore the maintenance schedule while saying things like "it'll work, trust me". Or, they're doing even worse things, like ignoring safety codes that have been written in blood.

For those of us in tech, we tend to get off much more lightly than people who do Actual Stuff in the Real World (like cars). Chasing down our problems means you start getting into things like "empire-building manager is hiring anyone with a pulse in order to look more important by having more direct reports". Maybe you chase that one down and you get to "manager of manager is also into this whole thing, and benefits from the equation".

That might lead into "the entire company is obsessed with hiring even though the tech equivalent of the Drake equation says there is no way they can find anywhere near that many qualified people in the entire world".

What that does that look like? Well, some people have no business working on certain kinds of systems, whether as a transient situation, or a permanent one. Transient situations are a lack of training. Permanent ones might come from attitudes or a genuine lack of ability for whatever reason. Having the wrong person on the job is supposed to be noticed and handled by the manager. If they don't, that's a failure.

Now, the team's manager (M1) also has a manager (M2) of some kind, and M2 is supposed to be making sure M1 can actually, well, manage! If they can't tell if that's happening or not, that too is a failure.

In some situations, you come to realize that a whole bunch of bad things happen due to non-technical causes, and they are some of the hardest things that you might ever need to remove from an organization. Unlike the line workers, management is in a whole different world in which the "reality distortion field" matters most. You either generate a big enough one yourself, or you slot into someone else's. If you are opposed to it, you are rejected.

I guess this is my way of warning anyone who fancies themselves a troubleshooter and who really, truly, wants to get to the bottom of things. If you do this long enough, expect to start discovering truly unsatisfying situations that cannot be resolved.

Also, I will remind anyone who wants to try to tilt at such a windmill that if you are given responsibility without the power to make any changes, then you have just become the scapegoat. I said this in a post way back in February 2013, and I *still* fell into that damn trap in 2017 within a particularly broken organization.

Finally, in this same vein, I wanted to share something that a reader sent to me a while back, and that I found to be brilliant and amazing (I still do, but I did then, too): People can read their manager's mind.

In particular, pay attention to where it says corollary 1 and starts talking about the "insane employee". The whole "personal offense" thing? Yeah, if you have the ability to not become that person, try to avoid it. Alternatively, if you're cursed with the tendency to fall into those things, try not to give yourself a hard time when someone terrible takes advantage of you for the nth time.

Hang in there.

Determine durations with monotonic clocks if available

Sometimes, on a lazy weekend afternoon, I use apt-get to pull down the source of something and start grepping it for things that are bound to be interesting. This is one of those afternoons, and I found something silly. While looking for uses of time_t in bash, I found a variable called "time_since_start". Uh oh.

bash supports a dynamic variable called "SECONDS" (bet you didn't know that - I didn't), and it's documented as "the number of seconds since shell invocation". I'm sorry to say, that's not quite true. You can totally make it go negative since it's based on wall time. Just set the system clock back.

root@rpi4b:/tmp# systemctl stop chrony
root@rpi4b:/tmp# echo $SECONDS
11
root@rpi4b:/tmp# date -s "2023-01-01 00:00:00Z"
Sat 31 Dec 2022 04:00:00 PM PST
root@rpi4b:/tmp# echo $SECONDS
-2500987

That's an extreme demonstration, but backwards-going wall time happens every time we have a leap second. Granted, we're in a long dry spell at the moment, but it'll probably happen again in our lifetimes. The difference there is just one second, but it could break something if someone relies on that value in a shell script.

Or, how about if the machine comes up with a really bad time for some reason (did your hardware people cheap out on the BOM and leave off the 25 cent real-time clock on the brand new multi-thousand-dollar server?), the shell gets going, and later chrony (or whatever) fixes it? Same deal, only then it might not be a second. It might be much more.

In the case where the machine comes up with a past date and then jumps forward, SECONDS on a still-running shell from before it's fixed will be far bigger than it should be. I'm pretty sure every Raspberry Pi thinks it's time=0 for a few moments when it first comes up because there's no RTC on the board. Run "last reboot" on one to see what I mean.

I should also mention that bash does other similar things to (attempt to) see how much time has passed. Have you ever noticed that it'll sometimes say "you have new mail", for those rare people who actually use old-school mail delivery? It only checks when enough time has elapsed. I imagine a "negative duration" would mean no more checks.

The lesson here is that wall time is not to be used to measure durations. Any time you see someone subtracting wall times (i.e., anything from time() or gettimeofday()), worry. Measure durations with a monotonic clock if your device has one. The actual values are a black box, but you can subtract one from the other and arrive at a count of how many of their units have elapsed... ish.

Be sure to pay attention to which monotonic clock you use if you have a choice and there's any possibility the machine can go to sleep. "Monotonic time I have been running" and "monotonic time since I was booted" are two different things on such devices.

Here's today's bonus "smash head here" moment. From the man pages for clock_gettime on a typical Linux box:

CLOCK_MONOTONIC: "This clock does not count time that the system is suspended."

Here's the same bit on a current (Ventura) Mac:

CLOCK_MONOTONIC: "...and will continue to increment while the system is asleep."

Ah yes, portability. The cause of, and solution to, all of life's software issues.

Tonight's rabbit hole: time math and 32 bit longs

I find some funny rabbit holes sometimes. Tonight, it went like this. Ubiquiti released a new version of the software for their USG devices because they had this thing where their dhcpv6-pd implementation could be exploited to run arbitrary commands by someone sitting in the right spot on the network (i.e., out your "WAN" port).

It's been a good while since they put out a new build for these devices, and I wanted to know what else changed. I find that when companies supposedly ship a "just a security fix" patch, they usually end up shipping far more, and probably break stuff too. (I'm still bitter about the 2020-002 "security update" for Macs.)

Anyway, it got me thinking: can you diff these things? Turns out, you sure can. It's two squashfs filesystems, so mount 'em and diff 'em, and dig through the results, and... hey.

/opt/vyatta/sbin/dhcpv6-pd-response.pl:
 
         if (defined $domain) {
             $domain =~ s/\.\s+$//;
+            $domain =~ s/[^A-Za-z0-9.]+/-/g;
             $dn = $domain;
         } else {
             $dn = "";

Yeah. That's what changed. So there's that. *facepalm*. Then I got bored and kept looking through the output to see what else happened. That's when I saw that the entirety of /etc/shadow changed. A bunch of numeric values changed, like this:

-root:!:18920:0:99999:7:::
+root:!:19369:0:99999:7:::

I had to look it up to be sure, but that's the "date of last password change". Divide them by 365 and you'll realize one of them is about 51 years, and the other one is about 53 years. So, 2021, and 2023 - the dates of the previous release and the new release, respectively. Their release process obviously rebuilds the shadow file.

But that's not the end of the rabbit hole. Thinking of last week's time post, I started looking at that number. It's so small. It fits into 16 bits (but not 15). I wondered what sort of type they were using to hold it. Into the shadow source I went.

The first thing I found was something called strtoday(). It looks like this (adjusted a bit to fit here):

long strtoday (const char *str) {
        time_t t;
[...]
        t = get_date (str, NULL);
        if ((time_t) - 1 == t) {
                return -2;
        }
        /* convert seconds to days since 1970-01-01 */
        return (long) (t + DAY / 2) / DAY;
}

Uh huh. It returns a long. On a 32 bit machine, a long is 4 bytes, and it's still going to be 4 bytes even after glibc does their "time_t is now 64 bits" thing that's coming down the pipe eventually. longs aren't going to change.

So, when does this break? It turns out... 12 hours BEFORE everything else blows up. "DAY" is defined in the source as (24L*3600L), so 86400 - the number of seconds in a day. It's taking half of that (so 43200 - 12 hours worth of seconds) and is adding it to the value it gets back from get_date. That makes it blow up 12 hours early.

2038-01-18 15:14:08Z is when that code will start returning negative numbers. That'll be fun and interesting.

Remember, the actual "end times" for signed 32 bit time_t is 12 hours later: 2038-01-19 03:14:08Z.

The lesson here is: if you take a time and do math on it and shove it into another data type, you'd better make sure it won't overflow one of those types that *won't* be extended between now and then.

...

$ cat t.cc
#include <stdio.h>
#include <sys/time.h>
 
#include <cinttypes>
 
#define DAY (24L*3600L)
 
long strtoday_tt(time_t t) {
  return (long) (t + DAY / 2) / DAY;
}
 
int main() {
  printf("2147440447 -> %ld\n", strtoday_tt(2147440447));
  printf("2147440448 -> %ld\n", strtoday_tt(2147440448));
  return 0;
}
$ ./t
2147440447 -> 24855
2147440448 -> -24855

Reader feedback: "bad" names, !main(), and Mastodon

Questions, questions, questions. Sometimes I have answers. Here we go with more reader feedback.

...

M.D. asks if I could share a story about when a company "did something really wrong".

OK, how about yet another case of "real names" gone wrong? I'm talking about the thing where some derpy programmer writes a filter to "exclude naughty words" and ends up rejecting people with names that HAPPEN to match a four-letter English word. My canonical example is "Nishit" because that's what actually happened at another job back around 2009.

But that's old news. I'm talking about "yet another case". This one is from right at the end of 2019. It seems like someone decided they were going to "move a metric" from the support queue. There were a TINY NUMBER of customers who had deliberately signed up with obviously offensive names. They were being handled by reasonable bags of mostly water (i.e., people) who could look at it and figure out what to keep and what to purge.

Well, the derpy programmer with a goal to hit by the end of the quarter apparently struck again, and they wrote a thing to ban anyone who matched a short list. Then they ran it, and - surprise - it banned a bunch of real people who weren't doing anything wrong. Of course, those people have probably had many problems on other services, and now THIS company was the latest one to show how damn stupid and unfeeling it could be.

"My last name really is 'Cocks'. How would you like me to proceed?"

Unsurprisingly, this pissed off a bunch of people and generated a blip in the news cycle. Internally, it was brought to the weekly review meeting that I was somehow still running at that point. Someone was there and presented the case, and it was pretty clear they were going through the motions because we called them on the carpet.

For some really stupid reason (literally every other senior engineer and manager was at some offsite planning thing that morning, and *someone* had to run this meeting), I was the most senior person in the room, and so I felt I had to ask them the question as "the voice of the company" (whatever that even means):

"Can I get you to promise to never do this again?"

They wouldn't commit to it. I got no reply. They just looked at me. Conclusion: this will definitely happen again. Nobody gave a damn about what happened to the customers, and how bad the whole thing looked.

Afterwards, I talked to some friends who had worked in the trenches in customer support. They knew what was happening in terms of the "trouble reports" that would come in from people using the app. They had a good feel for what was actually a problem and what was clearly "OKR scamming".

Near as we can figure, they decided to code this up because it would let them claim to have automated some class of tickets that were being filed. It's like, sure, it would in fact remove the handful of tickets that get filed about this. It would also generate a godawful amount of hurt (and bad PR and so on) a few hours or days later, and would have to be turned off. But, the person managed to ship the feature, and so they can get their bonus, or promotion, or whatever.

Of course, karma is a bitch. A few months later, COVID hit and the company started laying people off in droves. I bet all of those people are gone now. Unfortunately, this also means anyone who learned a lesson from this event is probably gone, too. Hmph.

For anyone who's in today's "lucky 10,000" set, have some additional reading on this topic.

...

A reader responded to my "mainless" program post and asked if you could avoid the segfault by putting "_exit()" into the destructor or similar. They're totally right. If you make the program bail out instead of trying to carry along into uncharted territory, it will in fact exit cleanly. You already have unistd.h in my example, so go for it.

...

Another reader commented on the same post, asking just "The file name is a bit worrying. Is everything alright?". I guess they missed the link to knowyourmeme.com, or are generally unfamiliar with the notion of "things that aren't really trolling but are sort of funny kind of".

When it comes to that kind of stuff, that lizard gets me.

Hhhhhehe.

...

Jeff asks if I'm on Mastodon. I am not using that (or any other aspect of the "Fediverse"). I am also not on Twitter. I used to have a "business" account out there for a while, but never really used it, and deleted it last year when it became apparent where Twitter was headed. I stand by that decision.

I should mention that there are some dubious parts of the whole Fediverse thing, as someone who runs a site that occasionally gets links shared around. Posting a link to one of these things basically summons a giant swarm of (bot) locusts who all want to generate a link preview at the same time. I was going to write a post about this a while back, but it would be short and kind of threadbare, so I'll just mention it here instead.

Now, since all of this /w/ stuff is just a bunch of flat files on disk, all of it gets served out of memory and it's like ho hum. I can't run out of database connections or anything like that. I'm basically limited by the level of bandwidth I pay for on my switch ports. Eventually, they all have what they came for and it stops. But, for a minute or two, it can be interesting to watch.

There's a certain "cadence" to a web server's activity logs as seen in "tail -f". When this happens, it scrolls so fast you can barely read it. You definitely notice.

This is far from a new issue. A report that it can collectively be used "as a DDOS tool" was filed back in August 2017, when it was a far smaller problem.

It should not surprise anyone who's been doing this kind of thing for a while that the bug report in question (which I won't link here, but which people will still look up and try to brigade) has been closed since 2018.

Did anyone ever see the classic Simpsons episode where someone falls down a well, and at the end of the episode they "solve the problem" by sticking a "CAUTION WELL" sign in the ground?

"That should do it." - Groundskeeper Willie

I can hear his voice in my head any time I see a bug like that.

Who needs main() anyway?

Want to inflict terrible things on other programmers who show up later to do maintenance work? Write C++ code that doesn't need a main(). Then write C++ code that doesn't *have* a main(). Yes.

I mentioned this quite a while back, calling it "spooky action at a distance" in code, but looking at that now, it seems like it was a very long and drawn out demonstration.

Instead, tonight, I present a far simpler version. Usual disclaimers apply: may summon Ancient Ones who will haunt your soul. Probably won't work on all systems or compilers. It didn't work for me until I gave it -O2, and even then, it still gives a magnificent segfault.

At any rate, enjoy the evil.

$ cat hhhehehehe.cc
#include <unistd.h>
 
class wat {
 public:
  wat() { write(1, "wat\n", 4); }
};
 
static wat wat_;
 
// no main.  nothing else.
$ g++ -O2 -Wall -nostartfiles -o hhhehehehe hhhehehehe.cc
/usr/bin/ld: warning: cannot find entry symbol _start; [...]
$ ./hhhehehehe 
wat
Segmentation fault
$ 

That's it.

Setting the clock ahead to see what breaks

Given that we're now within 15 years of the signed 32-bit time_t craziness, I decided to start playing around with my own stuff to see how things are doing. I wanted to see what would break and what would work.

One thing I particularly wanted to see was how my smaller systems would work. It's basically a given that my 64 bit Linux boxes are going to be fine since time_t is already wider, and it won't explode in 2038. But that's far from the whole story. 32 bit machines still exist, and are more common than some would think thanks to the existence of things like Raspberry Pis.

Unless you deliberately install the 64-bit flavor of Raspbian, you're going to get a 32-bit system. With the version of glibc it's currently running, you will hit the wall. It's easy enough to try - you'll notice that you can't actually set the clock that far ahead:

root@rpi4b:/tmp# date -s "2038-01-19 03:14:08 UTC"
date: invalid date β€˜2038-01-19 03:14:08 UTC’

So, okay, put on your "time to do evil" hat, set it one second earlier, and wait for the fun to happen. Starting from scratch again, it does this:

root@rpi4b:/tmp# systemctl stop chrony
root@rpi4b:/tmp# date -s "2038-01-19 03:14:07 UTC"
Mon 18 Jan 2038 07:14:07 PM PST
root@rpi4b:/tmp# 
Message from syslogd@rpi4b at Jan 18 19:14:07 ...
 systemd[1]: Failed to run main loop: Invalid argument

Broadcast message from systemd-journald@rpi4b (--- XXXX-XX-XX XX:XX:XX):

systemd[1]: Failed to run main loop: Invalid argument


Message from syslogd@rpi4b at Jan 18 19:14:07 ...
 systemd[1]: Freezing execution.

Broadcast message from systemd-journald@rpi4b (--- XXXX-XX-XX XX:XX:XX):

systemd[1]: Freezing execution.

Yee haw! Look at that sucker burn. I particularly dig the XX-XX stuff. It's like a cartoon character who's been knocked out.

Now, before you whip out the pitchforks, keep in mind that systemd is just the messenger here. It's just working with what it's been given.

Also, the system is actually still up here. systemd has just basically checked out and is not going to do much more for you. It's not even going to take an ordinary "reboot" since that's really just a request to init (pid 1, so systemd again) to reboot the box. You're going to need to use "reboot -f" and suffer whatever badness might happen to stuff on the box. It's like pulling the plug, so have fun with that.

What happened? If you dig around in the remains, you will find that an assertion in systemd fired. It's refusing to continue unless clock_gettime() returns 0. Clearly, it returned something else. systemd saw this not-zero value and decided to protect itself by effectively stopping.

So you think "I know, I'll try this again, and strace pid 1 this time, and see what was in fact returned". You get something like this right before it croaks:

clock_gettime64(CLOCK_REALTIME, {tv_sec=2147483648, tv_nsec=898182}) = 0

... what? It returned 0? Yes... and no. Look at it closely.

clock_gettime*64* returned 0. But systemd called clock_gettime. strace is showing you the system call... but that system call happens by way of a C library function which in this case is being provided by glibc 2.31. If you were to open up glibc's source code and go digging around for clock_gettime(), you'd find this:

  ret = __clock_gettime64 (clock_id, &tp64);

  if (ret == 0)
    {
      if (! in_time_t_range (tp64.tv_sec))
        {
          __set_errno (EOVERFLOW);
          return -1;
        }

First call the (64-bit capable) syscall. Then assuming that succeeds (and it does, per strace), then see if it'll fit in a (32-bit) time_t. It won't, so set errno to EOVERFLOW, and return -1.

That's what systemd gets, and so it blows up.

glibc is saying "I can't fit this into that, so I'm failing this call".

This is wrapped in a bunch of preprocessor #if tests such that it only runs when __TIMESIZE isn't set to 64, but guess what? On this particular combination of hardware and software, __TIMESIZE is in fact 32. Grovel around in the headers if you like and follow the bouncing ball starting here:

./arm-linux-gnueabihf/bits/timesize.h:#define __TIMESIZE __WORDSIZE

... or just write something dumb to printf(..., __TIMESIZE) and see.

To be clear, this is glibc 2.31 on the 32 bit build of Raspbian/Raspberry Pi OS 11 (bullseye) on a Pi 4B. Newer versions of the OS will almost certainly not behave this way, since glibc itself is marching down the road to having 64-bit time even on 32-bit machines. Once that's done and rolled up into a release, expect this to go away.

...

And yes, NetBSD and OpenBSD tore off this band-aid about 10 years ago, and it's a done deal now. I know. Cheers to that.

Feeds, updates, 200s, 304s, and now 429s

In the past, I've written a few complaints about poorly-behaved feed fetchers. It's been a little over a year, and the situation is about the same. There are still a few people out there who think it's cool to poll every minute, or every 2 minutes, or whatever. It's not cool. It's useless. I don't update this thing anywhere near that often, so what's the point of wasting those resources?

There have been some bright spots. At least one person switched on If-Modified-Since headers and even put a little comment in their User-Agent header to let me know about it. That was above and beyond, so thank you to whoever that is.

But, there are still plenty of misbehaving feed readers out there, so it's time to talk about carrots and sticks.

The carrot basically is: if you have a well-behaved feed reader, you will continue to be able to discover a new post on my feed in a reasonable amount of time. This is most people. Most people do it right. Thank you for that.

The stick is: if you do not, you will not. It will take considerably longer to notice something's different out here.

What constitutes a well-behaved feed reader? My primary concern is about not having to serve the full feed to someone who has no reason to pull it again. This means making conditional requests - your client tells my server the last version of things it saw, and my server goes "okay, nothing's different" or (once in a while, after an update) "oh cool, here's the latest".

How do you do this? Ideally, you just run your feed reader and it figures it out. But, trust me, from looking at the feature requests and code bases for far too many of these things this past week, it seems like that's not very common.

This is how the tech part of it works, lest anyone claim it's too hard to implement. My server sends out a number of headers when you fetch the feed. Two of them are potentially applicable here. Right now, they look something like (but not exactly like) this:

Last-Modified: Fri, 06 Jan 2023 00:00:00 GMT
ETag: "xxxxx-yyyyyyyyyyyy"

Well-behaved HTTP clients can store those values when they do a fetch, and then return either or both of them in their subsequent requests. The first one turns into If-Modified-Since, and the other one turns into If-None-Match. Note that second one actually requires the "" around it or it won't work. (Yeah, I know. Not my doing.)

If-Modified-Since: Fri, 06 Jan 2023 00:00:00 GMT

... and/or...

If-None-Match: "xxxxx-yyyyyyyyyyyy"

Now, your HTTP client software should take this as some kind of argument to some well-defined setting and you should probably not be setting headers directly, but we're still smashing rocks together for a protocol that's 30+ years old. But I digress.

(Side note: this means your feed reader has to maintain some state per feed. You can't just statelessly fetch a URL until the end of time. That's incredibly boneheaded.)

Just take what you got before and hand it back as shown above. If nothing's changed, you'll get a 304 HTTP code back, and that means "nothing new". It's a short, simple transaction, and uses very little in the way of resources.

If the feed has been updated, say, because I wrote a new post, or did an update or typo fix or whatever to an existing one, then you'll automatically get that returned as a 200 along with a new set of headers. It's your feed reader's responsibility to remember one or both of those fields and then use them later on.

From my point of view, a request with a proper "IMS" or "INM" header is considered a conditional request. I look relatively kindly upon those. Those tend to come from people who want to do the right thing.

A request with neither "IMS" nor "INM" headers is unconditional, and I'm not such a fan of those. I understand that everyone's going to fetch something "fresh" now and then. That's a given. You have to prime the pump somehow. I don't care about that.

But when someone requests the full feed and makes no attempts to conserve, and they do it over and over again, like every 2 seconds? That's when I sit down and start coding. And code I did. That's why I'm writing this post. Poorly-behaved feed readers will no longer get timely updates.

I should note that one particular feed reader sends "Wed, 01 Jan 1800 00:00:00 GMT" and that's utter bullshit. You made that up and you know it. Nobody ever served you a page with that value. See, this is actually known pathological behavior. Sending that does not count as conditional.

Bad clients get a 429. That means slow your roll.

Bonus note for pedants: yes, it's still possible to be abusive with perfectly-formed conditional requests. Please don't try to find out where that point is. Just remember, I don't post that often. You don't need to poll that often.

S p a m m y s y s c a l l s in strace dumps

I was doing some light nerd reading at lunch the other day and ran across someone who had encountered trouble with a program that was using TCP_NODELAY when perhaps it shouldn't. TCP_NODELAY is one of those things which turns off Nagle's algorithm, which is usually used to batch up a bunch of small writes so you don't spam the network with tons of tiny packets. (If this sounds familiar to long-time readers, it's because it starred in a post that made the rounds in the fall of 2020.)

All of those packets have overhead. It's not quite the same problem that it was when we had 10 megabit shared-media networks with collisions out the wazoo, but it's still not great to just waste bandwidth and CPU time on things that aren't latency-sensitive.

The problem comes when you have a program that has a bunch of stuff to put on the wire, and yet it does it with individual calls to write(). Instead of pushing (say) ~2 KB at the network with a single call, it instead spins through the buffer, writing each one individually. Now you have 2000 packets flying around, all with their headers and everything else as overhead. Having the kernel batch this up is basically saving the world from broken code.

I saw this and it reminded me of a similar bit of damage in my own life. I have some projects where I am forced to wrap another program and listen to its stdout. It doesn't have a library form, so the only way to make use of it is to go through this whole rigamarole. I get to create a pipe, then fork and have the child connect stdout to that pipe and exec the program in question. The parent process then sits there listening to the pipe for updates.

I realized that my program (the reader) was waking up FAR too often. I should be getting updates every 30-45 seconds, but it would wake up a couple of thousand times in that interval. WTF? Well, it turns out that for whatever reason, it writes to stdout (more or less) a byte at a time.

Seriously. I had to see this for myself, and attached to it with strace. It pretty much looked like this:

708589 22:46:24.174856 write(1, "\"", 1) = 1 <0.000039>
708589 22:46:24.175018 write(1, "i", 1) = 1 <0.000041>
708589 22:46:24.175187 write(1, "d", 1) = 1 <0.000040>
708589 22:46:24.175339 write(1, "\"", 1) = 1 <0.000041>
708589 22:46:24.175506 write(1, " : ", 3) = 3 <0.000048>
708589 22:46:24.175666 write(1, "12345", 5) = 5 <0.000041>
708589 22:46:24.175814 write(1, ", ", 2) = 2 <0.000041>
708589 22:46:24.175981 write(1, "\"", 1) = 1 <0.000041>
708589 22:46:24.176138 write(1, "c", 1) = 1 <0.000039>
708589 22:46:24.176279 write(1, "h", 1) = 1 <0.000040>
708589 22:46:24.176443 write(1, "a", 1) = 1 <0.000041>
708589 22:46:24.176596 write(1, "n", 1) = 1 <0.000040>
708589 22:46:24.176732 write(1, "n", 1) = 1 <0.000040>
708589 22:46:24.176875 write(1, "e", 1) = 1 <0.000043>
708589 22:46:24.177045 write(1, "l", 1) = 1 <0.000070>
708589 22:46:24.177331 write(1, "\"", 1) = 1 <0.000030>
708589 22:46:24.177454 write(1, " : ", 3) = 3 <0.000029>

That's 17 lines from a much longer log. Those 17 lines alone were displayed in about three milliseconds, and had many many more above and below. Here I was, trying to see what kind of data it was sending to me, and it was spamming me with syscalls.

If you look closely, you can see it's not quite one byte per write() but it's pretty close. Numbers go all at once for whatever reason, and those " : " strings are another curiosity.

I pretty much forgot about this until the TCP_NODELAY post crossed my path, and then was reminded of this. Clearly, short writes are pretty common.

I wonder though, do people not strace programs any more? If this was my project and I was trying to figure something out, all of the vertical scrolling would drive me crazy. When it's spewing out more than my scrollback buffer will let me access, something is wrong! I'd go to lengths to try to batch it up a little.

The one part of this process I control is the reader, so that side has some sanity-enabling hacks added to it. I wait for up to a second between checks, and even then, I call select() with a 250 msec timeout. This gives the syscall-spamming writer program a chance to finish writing to the pipe before I go and read it. This raises the chances that I'll get the whole thing from a single read() call. Their program can spin making thousands of syscalls per event. Mine makes about four: futex, futex, select, read.

...

Full disclosure: I thought about writing a program to spit the body of this post out as a series of individual bytes handed to write(), and then the post itself would have just been the output from running it in strace. Everyone would have had to read it vertically, and I bet it would have been seriously annoying for basically anyone, except the rare people who have lived through it and would find it oddly hilarious.

But, I'm trying to make a point here, so I decided to make it accessible to people who tend to read things in terms of words and sentences. I left the crazy up in the first two words of the title instead.

Another look at the steps for issuing a cert

Oh boy. Yesterday's post has riled people up. A lot of them don't like what I said about the ACME protocol. I get the impression a few of these people haven't looked at the problem that needs to be solved in a different light.

How about we give that a shot now? Let's just go through the steps for getting a secure web site going, and ignore the specifics of the protocol for the moment.

First, the baseline assumptions: there's a key. There's a certificate signing request which references that key. Then there's the certificate itself with a signature which attaches it to the "web of trust" (ehhh...) that is largely accepted by most clients. Okay?

1. You generate a key. It'll eventually need to be installed on the web server somewhere. It's just a blob of gunk that's been encoded in ASCII with that whole --- BEGIN/END blah blah --- stuff at either end. You should probably generate it with a certain algorithm and with a certain complexity (i.e., number of bits).

2. You generate the CSR. It needs to read from that key in order to pick up its identity, but the CSR itself doesn't contain sensitive key material. The CSR has a bunch of fields that almost nobody uses: country, organization name and unit name, an e-mail address, and so on. Some of them used to mean something in a different age, but you'll probably find that a CA ignores all of them and generates something else entirely in the resulting certificate. Case in point: the cert for my site has CN = rachelbythebay.com and that's it.

3. You tell the CA that you want a cert by sending them that CSR.

4. The CA says "ok, prove you own this domain, punk" and gives you a URL or two that you can populate with some magic string in order to do exactly that. Or, it gives you some DNS entries for that domain that you can create with a magic string - maybe the same string as the URL, maybe not.

5. You stand up the magic URL with the magic data, or drop in a DNS entry with the magic data.

6. You either poke the CA to say "ok, go look", or (much more likely) you just sit there feeling stupid until they happen to retry and notice. They usually don't provide much in the way of being able to trigger this on-demand, so this can add a bunch of delay to the process.

7. The CA eventually sees the right thing in the right place and says "okay, this can't be chance, so they must actually control the document root and/or DNS or whatever", and that's enough for them. They issue a cert for that key and domain name. (In the old days, they'd start looking at DUNS numbers and stuff like that instead as verification.)

8. You check back and notice it is in fact available. It would be nice if it poked you somehow, but odds are, they won't, or at least, they won't do it faster than you could check for it. You download the cert which is yet another --- BEGIN --- ... --- END --- ... blob of gunk.

9. You install the key and cert on the web server and swing the config around to reference it.

10. You kick the web server in whatever fashion it requires to make it start using the new key/cert and stop using the old one - reload, restart, reconfigure, whatever.

11. You load the site up and verify that it actually works, and that it's using the new cert and not the old one.

Now, when you're talking to the CA, it's likely they are going to want you to authenticate yourself. Notice though, at no point did I say that you should now throw down and start doing your own crypto (*cough* JWT) in order to generate some kind of "proof" of some "claim" to convince them that you are in fact who you say you are.

You could just, you know, pass them an opaque token that they issued to you when you logged in... just like people have been doing with the web for years and years. Ever call Stripe from inside a program? You get an API key. How about Wavefront? Another API key. Gandi? API key. It gets set in the headers. Now they know who you are. Done.

When you're working in these terms, it's a matter of building up a request in a format they desire. It'll probably be a POST because sane people don't create things with just a GET. So you fire some form data containing the CSR at them, or maybe you send them a blob of JSON, and stick your auth token in a header. This goes over https to them, and so it's as secure as anything else you'll be doing with this stuff.

See how much more accessible that is? See how evil rolling your own crypto can be?

It's evil. Very evil. Don't enable their goofy auth schemes and don't roll your own crypto.

No, really.

How long since the last alg=none JWT vulnerability? It's 14 days as I write this, and this is January 2023. 2023!

BTW, to the person that said these protocols are "... an opportunity to pull in as many 'standards' as one can ..."? You nailed it. That's what I meant when I said "web kool-aid" in the original post.

Why I still have an old-school cert on my https site

People sometimes ask me why I don't use Let's Encrypt, and it's a long story. It has a lot to do with just how damn evil the protocol is. It looks like it was created by people who had been drinking FAR too much of the web kool-aid, since it's chock full of terrible things. It should be a small amount of drama to start a process, receive a magic string, sock it away somewhere at a magic path, then poke the validator and say "go for it". Then you just check back and see whether it worked or not.

Actually doing this with the ACME stuff is terrifying. I first looked at this several years ago, and not only was the protocol bad, but the implementations I checked out were also miserable. One of them had line widths in excess of 200 characters. Many puppies paid the price when that was created, and the fact that nobody else cares just boggles the mind.

But, I kept reading. This thing winds up running openssl (as in, the CLI toolset in /usr/bin or whatever) and then grovels around in the output with regexes. Then it turns it into a jwk, and this is where the "web kool-aid" thing shows up. It's part of the protocol, so it's not like they had any choice, but it's just one more awful thing you now have to worry about supporting.

But somehow this gets turned into JSON, and then that gets a SHA-256 hash, and then the base64 encoding of that turns into a thumbprint? So it's a SHA of a text representation of something that can be reordered or reformatted, and this is supposed to be useful?

Then it runs openssl again to read the CSR, and so on and so forth.

Anyway, that's about where I got to it after first encountering it in 2018, and then after reconsidering it in 2020 when I found myself with a bunch of extra time on my hands due to the lockdown.

But what about now? This isn't about Let's Encrypt. This is about me finding a supposed alternative and going through the same process of due diligence to understand just what I'd be getting into. Somehow, a couple of weeks ago, I found this other site which claimed to be better than LE and which used relatively simple HTTP requests without a bunch of funny data types.

I went through all of their API docs. You call an endpoint and tell it which domain or domains you want, and feed in the CSR. You also tell it how many days it should last: 3 or 12 months, more or less. Once you do this, it tells you what to do in order to perform verification. They do the usual techniques: put this magic blob in DNS, put a magic blob in your document root somewhere, or confirm that you can receive e-mail at a certain address.

Your end goes and sets this up, then calls back and says "okay, I'm ready, go look". This starts the process rolling, and assuming it all checks out, in theory you'll get a certificate issued pretty soon after that point.

Sounds great, right? I thought so, and created an account at this point to take a whack at it. That's when I saw it started asking questions about what kind of account did I want. Did I want a free account, or to pay this much, or that much, or this other even bigger amount? What? They didn't mention this before.

This is when the fine print finally appeared. This service only lets you mint 90 day certificates on the free tier. Also, you can only do three of them. Then you're done. 270 days for one domain or 3 domains for 90 days, and then you're screwed. Isn't that great?

Oh, and finally? You can't do the less-insane API on the free tier. Yep. You have to pay up for that. Gotcha, sucker!

I immediately deleted my account and marked the experience as one worthy of warning others.

At least now I can point at this post when people ask why I'm still using an old-school certificate on my site. It's deliberate.


January 4, 2023: This post has an update.

Not quite a successful prediction about tracking Apple stuff

Just a hair over 10 years ago, I wrote a post lamenting the fact that my WiFi-only original iPad (which was new at the time) was probably a mistake. After all, if I took it outside my house and away from the one wireless network it knew, it was now cut off from the Internet. If someone stole it or if I left it behind somewhere, there would be no way to track it down or wipe it.

Well, times change, and now that's no longer a concern. Pretty much anything that Apple sells nowadays that is intended to be portable is also trackable, even if it's "off" (whatever that even means now). Obviously I'm talking about AirTags, but the phones, watches, iPads, and yes, even laptops have the ability to be found in the same way if they are new enough.

When I thought about this at the time, I figured maybe they'd do some kind of one-way wifi and satellite-based "push" system where you could send "kill signals" for stolen devices. Yeah, that was a pretty bad call. It didn't end up even close to that. Instead, now, just every other Apple device in the vicinity has the ability to hear a beacon from a missing device and report about where it was found.

That plus the whole activation lock thing where a device won't let you use it unless the previous owner releases it from their iCloud account hopefully put a pretty big dent in the utility of stealing these things. Even with that, you still hear about people knocking over Apple stores and stealing armloads of devices for some reason. It must be for the non-serialized parts, since the rest of them will surely tattle if they are ever connected to the "mothership" again.

Looking at it another way, I no longer worry about this kind of thing. It's as if they saw something that might keep people from wanting to buy a certain class of device and then did something about it. How about that?

Unintentionally BREAKing a serial console

I heard about a neat bug once that was caused by the interaction of some hardware that was missing some electronics and some software which was just doing what it was told. It had to do with the "access of last resort" you'd use on a machine that was otherwise dead to the world: the console.

Imagine a datacenter with tens of thousands of Linux boxes running. Sometimes, they break and fall off the network. Fortunately, they have a "mini-me" type thing attached which then allows you access to a serial console. It's not quite the same as being there with a monitor and keyboard plugged into the box, but it's frequently enough to dig out of a real mess without getting in a car (or worse).

It seemed that people had been trying to fire up the console on their systems and weren't getting the expected results. What's supposed to happen is that they connect, hit ENTER once or twice, and it should pop up something like this in reply:

Linux x.y.z (something-arcane.foo.bar.company.example)

login:

They'd hit ENTER and at best, nothing would happen. Sometimes, it would just be a jumbled mess. Obviously, if the machine was unreachable over the network, we couldn't decode it, so it took a bit to find one that had a broken console but which was still available over the network.

What we found was interesting. The thing that actually puts up that login prompt is a process called getty (or some variant, like "agetty"). Its job is just to sit there and handle the serial line, read your login name, and get you connected to a login process to carry on from there.

For this to work, agetty and the serial port on the host have to agree with the serial port on the client in terms of baud rates (and other things too, but let's keep this story simple). If you get one out of sync, the other end will have no idea what you're talking about.

If you've never crossed paths with this before, imagine you're a dog that can only hear whistles at a specific set of pitches: one high, one low. Someone who uses the wrong set of frequencies won't make much sense to you. Baud rates are a little like that.

Somehow, the getty on these machines had gotten into a state where it wasn't running at the same baud rate as the actual system which was providing remote access to the serial console. We knew there was a feature in getty that would look for a "serial break" (imagine a really long low whistle in the dog analogy) and it would cause it to rotate through a list of baud rates.

This feature was probably intended to avoid a chicken-and-egg situation where you plug a terminal into a serial port on some Unix box and can't talk to it because it's at some rate that you can't change to. So, you keep poking it with BREAKs until it comes around to something that you can reach, and then you proceed from there.

We didn't have the ability to jam a BREAK down the line from the remote console client system, so what gives? These were two subsystems that were part of a much larger rackmounted beast, so it's not like there was an old-school serial cable running between them. They were probably just traces on a board somewhere. Something didn't add up.

This is when I heard some really neat troubleshooting from someone who actually understood this stuff (i.e., not me): they knew that on other hardware, they had installed buffers between the two systems to keep the electrical low state at boot time from triggering the BREAK behavior.

Unfortunately, they hadn't done this same thing on this particular type of hardware. It was missing the necessary electronics magic (they called it a "pullup") to keep things from getting out of hand when the controller restarted. Oops.

Their solution was to disable that behavior in their getty config. Since the server was hardwired to the only client it would ever have, there was no reason for it to honor a BREAK to do that sort of thing.

That was it. The machines probably still have the same electrical situation to this day and send all kinds of wild crap down the line when their controllers reboot, but at least their gettys won't care.

If you're someone who's never done serial stuff on a vaguely Unixy box and you're bored over the holidays, maybe this is your time to check it out. Find a box with a serial port (good luck!), plop a getty on it, then wire it up to another box with a serial port (more luck to you!) and see if you can get them talking to each other.

Failing that, check out the magic of someone who already did that and then some. Enjoy!

WPA3: no go on Raspberry Pi (plus some Mac gotchas)

If you've been doing the wifi thing for a while, you've probably followed the successive rounds of "security" that get layered on top. Back more than 20 years now, it was WEP, the so-called "wired equivalent privacy". That claimed to be 64 or 128 bits, but was closer to 40 or 104 due to the whole 24-bit "IV" thing, and a whole bunch of dumb problems with the crypto generally meant it was weaker than that in practice. Collect enough packets and burn some CPU power and the network is yours.

Then we got WPA, and then WPA2, and now the new hotness is WPA3. You might have noticed this last one in the settings of your newer network stuff and thought "hey, maybe I can benefit from it". Maybe you can, but a lot of it comes down to just how much you're willing to abandon.

Perhaps you have a thing for Raspberry Pis. They gained the ability to do 2.4 GHz wifi natively when the 3B came out, and picked up the 5 GHz band with the 3B+, so now you can have reasonable connectivity anywhere you can find power. The trouble is that the stock hardware and software absolutely will not do a true WPA3 network.

By "true WPA3", I mean a network that's only speaking WPA3 in SAE mode, which requires protected management frames (802.11w), and which does not support any kind of WPA2 fallback. This is a network that you can scan with something like Kismet and it'll say "WPA3-SAE" and nothing else. A stock RPi will absolutely fail to connect to them. This has been known for years and yet still persists.

If you spend far too much time digging around through the bug reports and forum posts, you may discover the angle of starting from the Linux kernel source, applying Infineon patches, fixing compilation errors, and then installing new Cypress firmware as well. Assuming you're willing to go through all of that, then yes, you may find yourself able to join it up to a WPA3-only network.

Wonderful. You now get to track this abomination of a kernel yourself, since you'll now be off whatever upstream decides to push out - security fixes, bug fixes, new features, or whatever else. Have fun!

In the Apple ecosystem, things are a little better. Support is pretty good for such things, and you should find that any Mac or iPhone made in the past few years should work just fine with a WPA3-only network. Even a first generation HomePod can handle it.

But, there's a catch, at least on the Macs. This assumes you are running in normal mode, i.e., booting from your SSD or whatever and running macOS in the usual way. If something happens to your machine and you need network recovery mode, it'll just fail to associate with the WPA3-only wireless network.

At that point, you'd better hope you have another network around that still has WPA2 mode available. Otherwise, you're kind of stuck. These (laptop) machines haven't had built-in Ethernet ports for many many years so that's not an easy option, either.

I should point out that if you get the bright idea to plug your ailing Mac into a Thunderbolt 3 dock with an Ethernet port with the intent of having it "phone home" for recovery mode that way, you will find that it does not work. It seems that whatever drivers are necessary to notice and/or use that NIC just don't exist in that world, just like how WPA3 support is also somehow missing.

If you have an old Apple Thunderbolt Ethernet adapter for some reason, and also have the requisite USB-C TB3 to mini-DP type TB2 dongle, then you just got lucky. That much will actually be recognized in recovery mode, and you can bootstrap into network recovery mode without standing up a WPA2 network.

Some day, these things will be fixed and this whole post will be a sour footnote in history, but for the moment I figured I'd warn people before they blew too much time trying to make this stuff work.

Systems design and being bitten by edge-triggering

Let's try a thought experiment: we're going to design a little program that provides a service on a vaguely Unix-flavored box. It's designed to periodically source information over the Internet from hosts that may be close or far away, and then it keeps a local copy for itself and others to use.

You might have it use some kind of config file where it is told the hostnames of the servers it's going to access. Maybe you've set up a pool, such that any given attempt at resolving foo.service.example yields a different IP address every time, and there are bunches of them.

server 0.foo.service.example
server 1.foo.service.example
server 2.foo.service.example
server 3.foo.service.example

When would you make it resolve the host down to an IP address? It seems like you might want it to happen when your program starts up. Given the above config, it would find four entries, would turn that into four IP addresses, and then would get busy trying to sync data from them.

But, I haven't told you the whole story. What if you designed your program in a day and age where the network was just assumed to "always be there"? There was no such thing as consumer-grade Internet and home connections. You'd probably write it to do the name-to-IP resolution stuff once and then never again.

Consider what happens when a system with that design runs into the reality of running on goofy consumer-grade hardware with goofy consumer-grade Internet connections, raw crappy power from the local utility, and all of the other entropy sources you can think of. It's probably not going to behave well.

Such a system would start when the machine started and would attempt to get its IP addresses. Then it would take the success or failure and would use whatever it happened to get. If it got nothing, then that's it. It would just sit there staring at its own shoes for eternity, or at least until the next wonky utility power situation restarted the cycle.

This is what happens when you run ntpd on a dumb little consumer "router" for home Internet connections. Chances are good that both the router box and the cable modem, DSL bridge (or whatever else) will both restart at the same time. It's also a good bet that the router might manage to boot and start ntpd before the actual Internet connection comes up.

That means ntpd will find itself on a network with no routing to the outside world, and then it will try to resolve things and will fail. Then it will just sit there being useless until something or someone comes along and kicks it.

This happens on Unifi gateway devices, and it will bite you *right now* if the order of things happens to line up as described above.

So, if you find yourself with a machine that's attempting to run, say, systemd-timesyncd against a local USG or something like that and it's not syncing, you probably fell into this trap. Nothing in ntpd is going to wake it up and try to rectify the situation.

The Unifi + ntpd situation is effectively edge-triggered: the "rising edge" of the box starting up sends it off to do a bunch of setup stuff. If it works, you're good, but if it fails, you're screwed.

Let's try a different approach, then. You are a server. Your job is to talk to other servers periodically. You have been given some config directives to help you find them. Until you have "enough" servers to talk to, you keep trying to add more. This means attempting DNS resolution, and then if that succeeds, trying to talk to them and see if they are sane. If they are, then you keep them around and potentially use them as a source of data. If you aren't, you evict them and start the process over to get another one.

This situation is more of a level-triggered one. The system in question is going to keep trying to get to where it needs to be. It's able to start up in a broken environment and then eventually recover once the rest of the world starts doing its job again. It won't just go on vacation because everyone else hasn't shown up for work yet. Now, obviously it needs a little care because retrying in a tight loop is also bad. There's an art to doing retries (backoff, jitter, that sort of thing), and it also needs to be considered.

It's a big difference in how things work, and once you start thinking about systems this way, you'll start noticing all of the little race conditions and timing anomalies which trip up edge-triggered stuff in everyday life. Any time you've had to reset something in "the right order" or otherwise run something back through a series of other states in order to make it all "sync up", you probably were fighting with that.

Isn't it nice when systems know what they're supposed to be doing, and then keep working towards it until they succeed?

TL;DR use chrony.

Run it XOR use it, part two

If you read back through some of my posts from 2021, you might discover something which basically says "run an IRC network or get involved with the chatting on it, but try not to do both". This was a reflection on my own youthful stupidity, and a plea to others to not make the same mistakes (as many of my posts tend to be).

Imagine my surprise when I got to thinking about some of the recent bits of the tech / politics / malignant narcissist news cycle of late, and realized that it could apply there too.

"Hey (owner), what about X?"

"Raar!"

[keyboard furiously clicking in the background]

*service unavailable*

Sound familiar?

44 billion dollar tech companies: run them or use them but not both.

A reader asks how to avoid working for evil

This one came in a request from a reader. They want to know my feelings about trying to "... avoid a company contributing to the downfall of humanity". This one's tough, particularly given my own history.

I worked for a web hosting company that had a dubious history of keeping spammers around far too long. Then while I was there, they had the so-called "adware" vendor. They got mad if you called it "spyware". I honestly thought it was random trash that people were installing on their own machines and so that's what they wanted. I only found out recently that it apparently was distributed by way of Internet Exploder drive-by ActiveX/whatever shenanigans. So, if you ran that cursed browser and landed on a page with their stuff in it, you got owned.

Now, that customer didn't last forever. They got whacked by AUP after a bit, but they were still there for a good... six months or so? And we definitely got bonuses in our paychecks when they upgraded their configs because we had managed to solve a bunch of their scaling problems. Yes, we made them more efficient, and *they got bigger* as a result, and those of us on the support teams directly benefited in a paycheck or two.

Then I worked for a place that was doing web search and had gotten into the business of providing free web-based e-mail that was pretty good. They had also started doing a few other things. They had a few simple *well delineated* ads on their result pages (and maybe a few other places), and that was it. Lots of people were like "you should go work there", so I tried it, and somehow I got in.

During my tenure there, they went and ate a company that I had a real beef with as a spam-fighting sysadmin for a bunch of users before the web hosting job. I'm convinced it's actually karma: eight years before, I had dinner with some people, including someone I had never met before. When I found out where he worked, I asked him something like "what's it like working for an evil company like Doubleclick". Yeah, I actually said that. *facepalm*

When the legalities of the merger were finished in 2008, I too worked for that evil company by extension. By absorbing it instead of killing it, we became them (see also: Collabra). The name was different, but the internal damage was done. This lead to all kinds of other crazy shit that came down the line, all in the name of fellating the advertisers, like Emerald Sea, aka Google Plus. That whole thing.

They were trying to do all kinds of crazy stuff, like you'd be browsing around and it'd say "hey, this looks like your Twitter page, so would you like to link it to your profile?" - and it's like holy crap, the company has crossed the line, then dug it up and set the pit on fire. Just because you CAN make a dossier on someone with your damn crawling infra doesn't mean you DO IT. That's where they were going. Full on creeper land, with the immense power of their infrastructure.

Then I decided to go somewhere else that (as far as I could tell) existed because people willingly put their data there. They uploaded pics and posted about going places and doing things. All of the data was sent to the site. The site didn't go out and scrape it off the web. I was okay with this. I didn't use the site myself, but I figured that made me the weirdo, not the (then) billion-something people who did. Clearly, they find it useful, so what do I care?

Of course, while I toiled in the infra mines at this company, all kinds of truly evil shit was going on, including the installation of a fascist regime in my country, the apparent genocide in at least one other country, and so on. It's like, someone even asked me about supporting the not-quite-UTF-8 language stuff they used in that country. Now I wonder exactly what all was enabled by virtue of being able to support that encoding! (Seriously, you know who you are. Is that what happened? Did that work let the bad people break loose out there?)

Then there's the joint which tried to look like they were all about smarter use of cars, but which probably added to overall congestion. They didn't want the key people who actually do the real work to be employees and went to the mat with heavy lobbying to make it happen during an election cycle. They also pulled out of a good-sized urban area in a very large state when the city put up requirements for background checks.

This is just the obvious stuff. I haven't even mentioned any of the "how they treat their employees" incidents from these places. Every company has at least a couple of these that I've actually witnessed, and far more that I heard about from trustworthy sources.

Sometimes I think about the fact that I've made some bad things more reliable so they can go about doing evil more efficiently, quickly, or just at all. It sucks.

I said this in 2013: "If your resources or reputation could be used to harm people, you owe it to them to jealously guard it lest it fall into the wrong hands." I still think this has happened too many times.

However, I no longer think that people are capable of guarding it to keep the vampires out. The only way to keep something with great power from being exploited might be to keep it from existing in the first place.

But what do I know, right?

Short stories from outage reports

Not all of my stories are long. I have plenty of shorter ones.

...

Did you hear the one about the company that posted a bunch of videos to advertise something, and the backend screwed it up? Somehow, one of them got mixed up with someone else's video upload.

They were trying to advertise a sports league. They got something else entirely.

"... the video was switched to a dolphin swimming in a pool."

But it gets better: the customer asked the response team to not fix it, and instead to leave it alone because "engagement with their [ad] was going up".

This sounds hilarious, but what if it hadn't been another public video? Then it would have been a privacy disaster.

Crosslinked videos? That's bad news.

...

Or, how about the time a rare lightning storm rolled through, and a bolt hit some of the nearby electric infrastructure? It knocked half the campus offline, and everyone was forced to find other places to be, including the chefs who moved dinner to the one cafe building which still had power.

Someone put "zeus" in the root cause.

This one is actually funnier if you worked there, because there was a real service called zeus, and it caused all kinds of outages once upon a time.

...

Backhoes love to find fiber optic cables. Sysadmins know that the best thing to bring with you on a trip into the wild (in which you may get stranded somewhere) is a length of fiber. That way, you can bury it, and when the backhoe arrives to dig it up, you can hitch a ride back to civilization with the operator.

One fine day, a backhoe found a nice fat fiber run somewhere in the world. The updates from the scene were not encouraging.

"Cable is thirteen feet down and beside a creek. Water keeps filling up the space. Working to find a shallower access point. In the mean time, a larger backhoe has been requested. ETA 30 minutes."

You know the line from Jaws? We're gonna need a bigger boat? They're gonna need a bigger backhoe.

Twenty five thousand dollars of funny money

I used to work at a place that sold ads. One of the things this company wanted was for the employees to try it out and see what it was like to actually use the ads product themselves. It's the usual "dogfooding" thing you hear about sometimes.

To that end, they issued a $250 credit every month. You just had to go to a certain internal web page and click a button, and it would credit it to your account. Every time the calendar rolled over to a new month, you could go click it again.

They told us all about this during our first day or two of classes - the infernally-named "onboarding". I noticed something during this: our presenter hadn't claimed their credit yet, so they went and did it for real right in front of us. They went to load up the page and it bombed - something in the code blew up and it didn't work. They reloaded it and then it worked, and they now had $250 of virtual ad money in their account.

Some weeks later, a new month started and I wanted to get in there and give it a shot. I went to start it up, and it blew up, just like what happened in my class. But hey, this time I had a computer of my own, and access to the source code, and even a tiny bit of experience poking at frontend stuff courtesy of some of the introductory tasks they assigned to new employees. Why not take a whack at it? This place is supposed to be all about fixing random stuff even if it's "not yours" - the "nothing is someone else's problem" posters all over the place implied it, at least.

I loaded it up on my dev environment and got cracking. Sure enough, something was wrong with it, and the first time through, it would blow up. It was something dumb like the code was throwing an exception but the exception handling path was making the wrong sort of log call so that would then blow up the whole request. I fixed the logging so we'd actually get to see what the exception was, and that'd give us a chance to fix any real problems. Simple enough, right? I sent the change to the last person to touch the code... who had just touched it that morning, oddly enough. They thanked me and it was applied.

Then I tried to get my credit, and this time it blew up again, but now it logged what was wrong. I could see this on the dev environment. It was something about calling some function with the wrong number of parameters.

The code itself did something like this:

if (condition) old_func(a, b, c, d, e) else new_func(a, b, c, d, e);

The problem is that new_func didn't take 5 arguments. It took 4. I read through the code and found that it didn't need a "d" argument any more, and so I just changed the arg list to (a, b, c, e). I figured it was a simple oversight by the person who had just changed it.

Then I ran it for myself, clicked the button, got the "your credit is now in your account" message, and was pleased. I asked a friend to try it too and it worked for them as well.

It turned out this very if-then-else part was what had been added that morning, and so I again sent that person the code for review, and they again thanked me and accepted it. I went off to go do other not-frontendy things, and the code went out to the internal web servers a little while later.

A few hours later, someone reached out online: we have to turn off the ads credit thing. It's giving away WAY too much money. How much? Twenty-five thousand dollars. $25,000. Not $250.

What happened? The thing had been passing the credit amount as pennies to "old_func", so it was passing in 25000, because 25000 pennies is in fact 250 dollars. But... new_func took dollars, not pennies. So, 25000 in that context was 25 thousand dollars!

I had been at the company something like six weeks and had changed a line of source code to fix a bug (logging), to uncover another bug (wrong argument count), to enable yet another bug (wrong units, and zero type safety) that gave 25 grand worth of funny money to anyone who clicked! And I had clicked! And I got a friend to click! And other people got it too!

What happened? They just turned off the feature until they could fix it. Those of us who had way too much credit in our accounts turned off our ads so as not to actually consume any of the "bad money", and kept them off until they reversed it out of our accounts. Then we were clear to go back to dogfooding.

And no, nobody was fired for this.

This is yet another reason why I say bare numbers can be poison in a sufficiently complicated system. If that function had demanded a type called "dollars" and the caller had another one called "pennies", it simply would not have passed the type checker/compiler. But, this was before those days, so it sailed right through.

The night of 1000 alerts (but only on the Linux boxes)

Here's another story from way back that somehow hasn't been told yet. It's from fairly early in my days of working tech support for a web hosting company. I had been there less than two months and one night, things got pretty interesting.

It was around midnight, and our monitoring system went crazy. It popped up hundreds of alerts. We noticed and started looking into it. The boxes were all answering pings, but stuff like ssh was really slow. FTP and SMTP and similar things would connect but wouldn't yield a banner.

Someone realized they were all in one datacenter, so we called up networking to see what was up. They said none of their alarms had gone off. So, uh, great.

Somehow during this, the question of DNS was raised. One customer's box was grabbed randomly, and my usual "run w after login" was showing the IP address of the support network's external (NAT) interface instead of the usual nat-vlanXYZ.company.domain thing (which comes from a PTR). That was weird, and it was an early hint of what was wrong. Running DNS queries from there also failed - "host" this, "dig" that. Even forcing the queries to the two recursive nameservers that customers were supposed to use from that datacenter didn't work.

Next, it was time to see what this had to do with the daemons. With tcpdump running, I'd poke port 25 and watch as sendmail (or whatever) lobbed a bunch of queries to the usual DNS resolvers but got no reply. This would go on for a while, and if you waited long enough, it would eventually stop attempting those lookups, and you'd finally get the usual SMTP banner. The same applied for other services which worked more or less like that.

This made me suspect those resolver boxes, and sure enough, they couldn't be reached with traceroute or ping or really anything else for that matter. Our manager called down to someone again and told them what was happening, but somehow they didn't get on it right away.

Some time passed, and the customers started noticing - the phone calls started and lots of tickets were being created. Eventually, someone who ran the internal infrastructure responded and things started clearing out.

This was my first time in that sort of situation, and I regret to say that I participated in a "maybe it was a transient error, since things seem fine again now" response storm. I mean, it *was* transient in that it happened and then it stopped, and things DID seem fine again then, but it just feels so wrong and dishonest.

One interesting customer during all of this had the whole thing figured out while it was still going on. They managed to do the same thing we had, and noticed that both of the recursive nameserver boxes for that datacenter were toast. I'm sorry to say they got the same form-letter response.

What's even more amazing is that this customer came back with "hey cool, it's up now, just wanted to mention it in case it helped out", and was mostly shocked by the speed of our response. I guess we "got away with it" in that sense.

I found out much later that they were just two physical boxes for that whole datacenter, and apparently they had no monitoring. Nice, right?

Now, telling the customers that? That would have been epic, and it would have started a conflagration of credit memos the likes of which hadn't been seen since the time our friend "Z" "slipped" with the cutters while cleaning up the racks.

I was new on the job. I went with it. Later on, I would try to find ways to convey information without resorting to such slimy tactics. Mostly, I tried to get out of tech support, and eventually succeeded.

So, to that specific customer out there those 18 years ago, you were right. To all of those technical customers out there who think they're being told a line, this is to say that sometimes you are in fact being told a line. They're afraid of sharing the truth.

If you're in a spot where you can tell the truth about what happened and not get in trouble for that, even to a customer, consider yourself lucky.

...

Side note: the reason the alerting blew up was that the poller would only wait so long for the daemons to send their usual banner. The daemons, meanwhile, were waiting on their DNS resolution attempts to fail. The monitoring system's poller timeout was shorter than the DNS timeout, so when DNS went down, everything went into an alert status.

While this was going on, the Windows techs (as in, the people who supported customers running Windows boxes) were giving us grief because "only the Linux boxes are showing up with alerts". Apparently the Windows machines didn't tend to do blocking DNS calls in line with the banner generation, or timed things out sooner, or who knows what, but it allowed their monitoring polls to succeed.

"Windows boxes are fine... seems like Linux can't get the job done!"

They were mostly joking about it (as we tended to do with each other), but it was an interesting difference at the time.

Reliability stuff worth reading

Just a short note: I'm not Mosquito Capital, and I'm not sure who it is. I can make some educated guesses based on the lingo, and I'm pretty sure I worked with them in the past.

It doesn't really matter who they are though, since what they are saying is spot-on and you should pay attention if you're in this kind of business.

Go read. It's good. Be sure to hit "more replies" when you get to #29.

Missing the point completely

Presentation slide with "clownpenis.fart"

Today, I'll share a story about someone from marketing who was trying to make things happen and really stepped in it. It goes like this: a bunch of us engineer types are invited to a meeting with someone from marketing. We don't know the first thing about marketing in general, or even the particulars of it at this company. But they want to chat and hey, it's an excuse to not "work", so why not go see what they want?

We'll call this person A. A gets up there and it's pretty clear they're the sort of loose, goofy and hopefully funny... almost hippie type... you've heard about. It makes sense. You need to be creative for that job, and they're the creative type. So far, so good.

A proceeds to tell a story about this big "tent-pole" movie that was coming out. All of the nerds were going to want to see this latest movie in a series that went back a couple of decades. Plus, it'd successfully slopped over into non-nerd life, such that a bunch of regular people would also be excited about it coming out.

Marketing saw this coming and proceeded to reach out to the company that worked on the movies. They wanted to do a partnership where the company's product would change for the weekend of the movie's opening. Instead of having the usual little icons for whatever it is they did (dog walking, pizza delivery, you know the type of company), there'd be icons of characters and certain well-known and well-loved vehicles from the franchise zipping around.

All they had to do was get it written and out the door in advance of the movie. The marketing folks sat down with the engineering peeps and laid it out. They were told right then and there "no can do". Even though they had all kinds of time in which to do this kind of "theming" of the app, they knew they couldn't do it. That's how bad it was at this company.

Anyway, A told us this story, and used it as a basis for asking if anyone could help solve the fundamental disconnect that kept the company from doing cool things. It was basically a "this is why we can't have nice things" story, but at the same time it was a cry for help. Since they were at the mercy of the software people who had to add the actual icons and toggles and features into the app (and backend stuff), they were stuck.

Towards the end, A told the audience a story about what happens if you have a product that might be good in theory but has terrible marketing. As part of this part of the talk, they put up a slide. You might've been wondering what the hell was with the image I put at the top of this post. Well, that's what was on the slide.

For people who are using text to speech or otherwise can't see the image, I'll describe it here: it's a bog-standard conference room with a projector screen, and the only thing on the screen is "clownpenis.fart" - like a URL, only not. There are also some white blocks where I cut out a few people who would otherwise be identifiable (including the speaker).

When I saw this, I didn't get it. My own reaction was along the lines of "heh" from my inner 12 year old and also "what?" and "I guess I'm missing the reference". Someone quickly informed me that it was a callback to a Saturday Night Live skit that aired around 2000. I had stopped watching SNL by then, and so that's why it was off my radar. Easy enough.

I had to go back and watch it - you can still find it online if you're interested. It's about an investment firm that is solid and has been around forever, but is only now (2000, remember) "getting online" - establishing a "web presence", if you will. They waited too long and so got the "last domain name available" - the aforementioned clownpenis.fart. This ran on national TV, albeit late on a Saturday night some 22 years ago. (And yes, that's the voice of Jerry from Rick and Morty.)

Basically, the presenter was trying to get people to connect the dots by using something funny that they had seen in the past. It didn't work. Oh, did it ever not work.

For the next month, all you could hear about in the company was that A did the wrong thing, and shouldn't have said that, and they should apologize, and what are we going to do about this sort of communications issue, and so on.

Personally, I filed it under "stupid", not "hostile", but that was me at that point in time. I had plenty of "buffer space" in my life.

What's kind of amazing is that nobody ever mentioned the "we can't have nice things because eng is fundamentally unable to do the simplest, stupidest adjustments to our 'chrome' which would surprise and delight our customers". That went COMPLETELY unnoticed and was largely forgotten.

The company continued to be unable to have nice things.

❌