Reading view

There are new articles available, click to refresh the page.

Wikidata Query Service graph database reload at home, 2025 edition

By: Adam Baso

This post is about importing Wikidata into the graph database technology used for hosting the Wikidata Query Service (WDQS). The post includes details on how you can perform your own full Wikidata import to Blazegraph in about a week if you have a nice desktop computer, which was one of the nice takeaways from the analysis.

System utilization around file 20 of 2583 of Wikidata import to local WDQS

Graph databases and Wikidata

Graph databases are a useful technology for data mining relationships between all kinds of things and for enriching knowledge seeking via retrieval-augmented generation (“RAG”) and other AI. Within the Wikimedia content universe, we have a powerful graph database offering called the Wikidata Query Service (“WDQS”) which is based on a mid-2010s technology called Blazegraph.

Wikidata community members model topics you might find on Wikipedia, and this modeling makes it possible to answer all kinds of questions after importing Wikidata’s data into WDQS. Our colleague Trey wrote a nice post describing WDQS that you should check out.

The Wikidata and WDQS architecture spans a number of components and technologies.

Wikidata and Wikidata Query Service high level diagram at it pertains to data flows that may be involved in consumption

Big data growing pains

As Wikidata has grown, the WDQS graph database has become pretty big, with about 16.6 billion records (known as triples) as of this writing, with many intricate relationships between those records that ultimately result in complex and large data structures on disk and in memory. Unfortunately, the WDQS graph database has also become unstable as a result, and this seems to be getting worse as the database gets larger. The last time a data corruption occurred it rippled through the infrastructure and it took about 60 days to reload the graph database to a healthy state across all WDQS graph database servers (part of this had to do with repeated failed imports; hopefully techniques in this here post are instructive to others encountering failed imports).

The long recovery time was a prompt to further enhance the data reload mechanisms and to figure out a way to manage the growth in data volume. Over the course of the last year, the Search Platform Team, which is part of the Data Platform Engineering unit at the Wikimedia Foundation, worked on a project to improve things.

As part of its goal setting, the team determined it should make it possible to support more graph database growth (up to 20 billion rows in total) while being able to recover more reliably and more quickly in the event of a database corruption (within 10 days). The idea being that complex queries are useful – WDQS is one of the most important tools in the Wikidata system – but only if the system is up!

In order to support more database growth, it was pretty clear that either the backend graph database would need to be completely replaced or it would be necessary to split the graph database to buy some time, as the clock had run out on the graph database being stable. A full backend graph database replacement is necessary, but this is a rather complex undertaking and would push timelines out considerably; the replacement is an area for further analysis.

A stopgap solution seemed best. So, the team pursued the approach of splitting the graph database from one monolithic database into separate databases partitioned by two coarse grained knowledge domains: (1) scholarly article entities and (2) everything else. As fate would have it, these two knowledge domains are roughly equivalent in size.

After many changes to the data ingestion, streaming, network topology, and server automation the migration of the WDQS servers to the split-based architecture is happening now in the spring of 2025 (around the time of this blog post, coincidentally).

Now, while working through the split of the graph database, although initial testing suggested that it should be possible to achieve a data reload of a graph of 10 billion rows within ten days and reloads for both knowledge domains could run in parallel (thus allowing for 20 billion rows in total), there still wasn’t a lot of room for error. What happens if a graph database corruption happens right when the weekend starts? What if some other sort of server maintenance is blocking the start of a reload for a day or two? We wanted to be certain that we could reload and still have some breathing room to stay within 10 days.

Cumulative Wikidata dump import time using the approach detailed in this post on publicly available dump data

Hardware to the rescue?

From previous investigations it seemed that more powerful servers could speed up data reloads. Obvious, right?

Well, yes and no. It’s a little more complicated. People have tried.

The legacy Blazegraph wiki has some nice guidance on Blazegraph performance optimizationI/O optimization, and query optimization (some of this knowledge is evident in a configuration ticket from as early as 2015 involving one of the original maintainers of Blazegraph). Some of it is still useful and seems to apply, although some changes backported in the JDK plus the sheer scale of Wikidata make some of the settings harder to reason about in practice. Data reloads have become so big and time consuming (think on the order of weeks, not hours) that it is impractical (and expensive) to profile every permutation of hardware configuration and Blazegraph, Java, and operating system configuration.

This said, after noticing that a personal gaming-class machine I bought in 2018 for a machine learning workflow (cross-compiling and applying transfer learning ultimately for an offline Raspberry Pi application) was able to do much faster WDQS imports than what we were seeing on our data center servers, I wanted to understand if there were advances with CPU, memory, and disk in the wild that might point the way to even faster data reloads and wanted to understand better if any software configuration variables could yield bigger performance gains.

Wikidata dump segment import times using the approach detailed in this post

This was explored in T359062, where you’ll find an analysis and running log of import performance on various AWS EC2 configurations, a MacBook Pro (2019 Intel-based), my desktop (2018 Intel-based), our bare metal data center servers, and Amazon Neptune. The takeaways from that analysis were that:

  • Cloud virtual machines are sufficiently fast for running imports. They may be an option in a pinch.
  • Removal of CPU governor limits on data center class bare metal servers significantly improved performance. In other words, allowing the CPUs to run at their maximum published clock rates sped up imports.
  • Removal of CPU governor limits didn’t confer an advantage on prosumer grade computers.
  • A Blazegraph buffer configuration variable increase significantly improved import speed.
  • Higher grade hard drives (fast consumer NVMe at home and data center class RAIDed SSDs in production) confer a noticeable performance advantage.
  • The Amazon Neptune service was by far the fastest option for import. It’s unclear if free or near-free data ingestion observed during the free cloud credit period would extend for additional future imports, though. It is a viable option for imports, but requires additional architectural consideration post an import.
  • The N-Triples file format (.nt) dramatically improved import speed. It should be (and now, is) used instead of the more complicated Turtle (.ttl) format for imports.

Computing configuration and initial setup

My 2018 personal gaming-class machine with a 6-CPU configuration (up to 4.6 GHz turbo boost) after several years of upgrades has 64 GB of DDR4 RAM and a 4 TB NVMe.

full Wikidata graph import into Blazegraph took 5.22 days with this configuration and our optimized N-triples files in August 2024.

I had the benefit of pre-split N-triples files produced from our Spark cluster as part of an Airflow DAG that runs weekly, where there are no duplicate lines in the files and there are some additional simplifications compared to the N-triples files produced by legacy jobs in our data dumps infrastructure. If you’re doing this at home without a large Spark cluster, though, you can fetch wikidata-<YYYYMMDD>-all-BETA.nt.bz2 from a datestamped directory in the Wikidata dumps and run some shell commands to prepare files to achieve something similar (do note that the data is less optimized, but it works).

You can at present import somewhat reliably and peformantly with one 4 TB NVMe internal drive and one 2 TB external (or SATA) SSD drive if you are willing to script some file compression to avoid running out of disk. In the example that follows, I assume that you have three drives, though: one 4 TB NVMe drive (let’s say this is your primary drive), one SATA or external 2+ TB SSD (that’s /media/ubuntu/EXTERNAL_DRIVE in the example), and another SATA or external 2+ TB SSD (that’s /media/ubuntu/SOME_OTHER_DRIVE in the example).

The commands

Here are the commands you’ll need to download the Wikidata dump, break it up into smaller files that Blazegraph can handle, and import within a reasonable timeframe.

Note that you’ll need to have a copy of the logback.xml file downloaded to your home directory.

# Download some dependencies
sudo apt update
sudo apt install bzip2 git openjdk-8-jdk-headless screen
git clone https://gerrit.wikimedia.org/r/wikidata/query/rdf
cd rdf
./mvnw package -DskipTests
sudo mkdir /var/log/wdqs
mkdir /home/ubuntu/wtemp
# Your username and group may differ from ubuntu:ubuntu
sudo chown ubuntu:ubuntu /var/log/wdqs
touch /var/log/wdqs/wdqs-blazegraph.log
cd dist/target/
tar xzvf service-0.3.*-SNAPSHOT-dist.tar.gz
cd service-0.3.*-SNAPSHOT/
cd /media/ubuntu/EXTERNAL_DRIVE
mkdir wd
cd wd
# Run the next multiline command before the weekend - be sure to verify
# that your computer will stay awake without reboot. The server throttles
# somewhat, so the download takes a while. And the deflate and split-sort-split
# also take a while. There are faster ways, but this is easy enough. In case
# you were wondering, wget seems to work more reliably than other options.
# Torrents do exist for dumps, but be sure to verify their checksums against
# dumps.wikimedia.org and verify the date of a given dump. In the following
# command pipeline we just print out the checksum for manual verification later,
# as it's nice to let this run over a weekend and come back on a Monday to
# verify instead of potentially having to wait longer; it usually works fine.
date && \
wget https://dumps.wikimedia.org/wikidatawiki/entities/20241216/wikidata-20241216-all-BETA.nt.bz2 && \
date && \
wget https://dumps.wikimedia.org/wikidatawiki/entities/20241216/wikidata-20241216-sha1sums.txt && \
grep wikidata-20241216-all-BETA.nt.bz2 wikidata-20241216-sha1sums.txt && \
sha1sum wikidata-20241216-all-BETA.nt.bz2 && \
date && \
bzcat wikidata-20241216-all-BETA.nt.bz2 | split -d --suffix-length=4 --lines=7812500 --additional-suffix='.nt' - 'wikidata_full_with_duplicates.' && \
date && \
sort wikidata_full_with_duplicates.*.nt --unique --temporary-directory=/home/ubuntu/wtemp | split -d --suffix-length=4 --lines=7812500 --additional-suffix='.ttl.gz' --filter='gzip > $FILE' - 'wikidata_full.' && \
date
# Let's head back to where you were:
cd ~/rdf/dist/target/service-0.3.*-SNAPSHOT/
mv ~/logback.xml .
# Using runBlazegraph.sh like production, change heap from 16g to 31g and
# point to logback.xml by updating HEAP_SIZE and LOG_CONFIG to look like so,
# without the # comment symbols, of course.
# HEAP_SIZE=${HEAP_SIZE:-"31g"}
# LOG_CONFIG=${LOG_CONFIG:-"./logback.xml"}
vim runBlazegraph.sh
# Modify the buffer in RWStore.properties so it looks like this (1M, not 100K),
# without the # comment symbol, of course.
# com.bigdata.rdf.sail.bufferCapacity=1000000
vim RWStore.properties
# Let's get Blazegraph running in the background.
screen
# Wait a few seconds after running the next command to ensure it's good.
./runBlazegraph.sh
# Then CTRL-a-d to leave screen session running in background.
# You can chain the following commands together with && \ if you like.
# Let's import the first file to make sure it's working (takes about 1 minute).
time ./loadData.sh -n wdq -d /media/ubuntu/EXTERNAL_DRIVE/wd -s 0 -e 0 -f 'wikidata_full.%04d.ttl.gz' 2>&1 | tee -a loadData.log
# If it worked, let's import another 9 files (maybe another ~10 minutes).
time ./loadData.sh -n wdq -d /media/ubuntu/EXTERNAL_DRIVE/wd -s 1 -e 9 -f 'wikidata_full.%04d.ttl.gz' 2>&1 | tee -a loadData.log
# Let's see how long it took to import the first ten files, just sum and then
# divide by 1000 for seconds (sum / 1000 / 60 / 60 / 24 for days).
grep COMMIT loadData.log | cut -f2 -d"=" | cut -f1 -d"m"
# Now let's handle the rest of the files. This could take a week or so - again
# be sure to verify that your computer will stay awake, without reboot.
time ./loadData.sh -n wdq -d /media/ubuntu/EXTERNAL_DRIVE/wd -s 10 -f 'wikidata_full.%04d.ttl.gz' 2>&1 | tee -a loadData.log
# Hopefully that worked. Go to http://localhost:9999/bigdata/#query and run the
# following query:
# SELECT (count(*) as ?ct) WHERE { ?s ?p ?o }
# For this example it was 19,827,410,787 with the non-optimized dump.
# As of March 2025 you might expect 16.6B for an optimized dump, as here:
# https://query.wikidata.org/#select%20%28count%28%2a%29%20as%20%3Fct%29%20where%20%7B%3Fs%20%3Fp%20%3Fo%7D
# Celebrate!
# Let's close Blazegraph and make a backup of the Blazegraph journal.
screen -r
# CTRL-c to stop Blazegraph
exit
# Okay, screen session ended, let's look at the size of the file
ls -alh wikidata.jnl
cp wikidata.jnl /media/ubuntu/SOME_OTHER_DRIVE/

You’ll notice here I don’t take time to make intermediate backups of the Blazegraph journal file. It’s a good exercise for the reader!

Production, in practice

We were a little surprised that my desktop could perform faster imports than what we were seeing in our data center servers. Our colleague Brian King in Data Platform SRE had a hunch, which turned out to be correct, that we could adjust the CPU governor on the production servers. This helped dramatically on the production servers, and when coupled with the graph split it makes recovery much faster. We don’t need to use the buffer size configuration trick as described above, but we also have that as an option should it become necessary.

Considerations

It would be nice to have no hardware limitations, but there are some practical limitations.

CPU: Although CPU speed increases are still being observed with each new generation of processor, much of the advances in computing have to do with parallelizing computation across more cores. And although WDQS’s graph database holds up relatively well in parallelizing queries across multiple cores, it’s difficult to optimize perfectly for many-cores architecture for data import.

Memory: Although more memory is commonly beneficial to large data operations and intuitively you might expect a graph database to work better with more memory, the manner in which memory is used by running programs can drive performance in surprising ways, ranging from good to bad. WDQS runs on Java technology, and configuration of the Java heap is notoriously challenging for achieving performance without long garbage collection (“GC”) pauses. We deliberately use a 31 GB heap in production for our Blazegraph instances. It’s also important to remember that a large Java heap requires a lot of RAM, which can become expensive.

Nevertheless, more memory can be helpful for filesystem paging operations. Taking the hardware configuration guidance at face value suggests that we would need about 12 TB of memory for the scale of data we have today for an ideal server configuration (we have about 1200 GB of data with about 16.6 billion records). We’re getting by with 128 GB of memory per server, which is much less than 12 TB of memory. We’ve also heard of people using several hundred GB of memory and having reasonable success. More memory would be nice, but today it’s too expensive in a multi-node setup built for redundancy across multiple data centers.

Disk: NVMe disks have brought increased speed to data operations. But backpressure on CPU or memory can also mask what might otherwise be able to manifest with speedier NVMe throughput. NVMEs did show a material performance gain during testing, although presently in production we’re thankfully doing okay with RAIDed data center class SSDs (6 TBs). NVMes would most likely be an improvement in the future in the data center, but they are priced higher for data center quality devices, whereas prosumer grade NVMes for personal computers are reasonably priced; due to the risks of hardware failure we prefer to avoid prosumer grade NVMes in the data center.

Caveats

A few things to remember if you’re running Blazegraph at home:

  • Be mindful of SERVICE wikibase:mwapi syntax, as it uses external Wikimedia APIs; be sure to avoid rapid repeat queries with this syntax.
  • Beware of exposing on the network: it doesn’t have the same load balancing and firewall arrangement, as well as other security controls, as the real Wikidata Query Service.

Conclusion

If you are looking to host your own Blazegraph database of Wikidata data without having two graph partitions (i.e., if you want to have the full graph in one partition) you might try the following:

  1. Get a desktop with the fastest CPU possible and acquire a speedy 4 TB NVMe plus 64 GB or more of DDR5 RAM; get a couple larger internal SATA SSD or faster throughput external SSDs if you can, too. As of this writing consumer grade 4 TB NVMes can be had with reasonable price-performance tradeoffs; perhaps 6 GB or 8 GB NVMes with the same level of performance will become available in the next year or two.
  2. Import using the N-Triples format split into multiple files.
  3. Consider scripting the batch import operation to make a backup copy of the graph database for every 100 files imported. That way if your graph database import fails at some point you can troubleshoot and resume from the point of backup. The intermediate backup will slow things down a little but it may save you many days in the end.
  4. If you can’t build or upgrade a desktop of your own, consider use of a cloud server to perform the import, then copy the produced graph database journal file to a more budget friendly computer; remember that in addition to cloud compute and storage costs, there may be data transfer costs.

As you saw up above, there are a few variables in configuration files that you need to update in order to speed an import along.

Production

After splitting the Wikidata graph database in two and removing CPU throttling for Wikidata Query Service production data center nodes, we’re now able to import the WDQS database and catch Blazegraph up to the Wikidata edit stream in less than a week. The way Wikidata updates are applied in the production environment is an interesting topic unto itself, but this diagram gives you an idea of how it works.

Production Wikidata Query Service Kafka Flink-based updater

Acknowledgments

Thank you for reading this post. I’d like to thank the wonderful colleagues in Search PlatformData Platform SRE, Infrastructure FoundationsTraffic, and Data Center Operations for the solid work on the graph split, and more specifically regarding this post, the support in exploring opportunities to improve performance. I’d like to especially express my gratitude to David Causse (WDQS tech lead and systems thinker), Peter Fischer (Flink-Kafka graph splitter extraordinaire), Erik Bernhardson (thank you for the Airflow environment niceties), Ryan Kemper & Brian King & Stephen Munene & Cathal Mooney (cookbook puppeteers who balance networks and servers with a keyboard), Andrew McAllister (thank you for your analysis of query patterns!), Haroon Shaikh & Renil Thomas (much appreciated on the AWS configs) and Willy Pao & Rob Halsell & Sukhbir Singh (thank you for helping investigate NVMe options). Thank you to my Engineering Director, Olja Dimitrijevic, for encouraging this post, as well as to Tajh Taylor for review. And as ever, major thanks to Guillaume Lederrey, Luca Martinelli, and Lydia Pintscher (WMDE) for partnership in WDQS and Wikidata and its amazing community.

High level overview of Wikidata editing data flows and how it relates to components involved in production

2025-04-06 Airfone

We've talked before about carphones, and certainly one of the only ways to make phones even more interesting is to put them in modes of transportation. Installing telephones in cars made a lot of sense when radiotelephones were big and required a lot of power; and they faded away as cellphones became small enough to have a carphone even outside of your car.

There is one mode of transportation where the personal cellphone is pretty useless, though: air travel. Most readers are probably well aware that the use of cellular networks while aboard an airliner is prohibited by FCC regulations. There are a lot of urban legends and popular misconceptions about this rule, and fully explaining it would probably require its own article. The short version is that it has to do with the way cellular devices are certified and cellular networks are planned. The technical problems are not impossible to overcome, but honestly, there hasn't been a lot of pressure to make changes. One line of argument that used to make an appearance in cellphones-on-airplanes discourse is the idea that airlines or the telecom industry supported the cellphone ban because it created a captive market for in-flight telephone services.

Wait, in-flight telephone services?

That theory has never had much to back it up, but with the benefit of hindsight we can soundly rule it out: not only has the rule persisted well past the decline and disappearance of in-flight telephones, in-flight telephones were never commercially successful to begin with.

Let's start with John Goeken. A 1984 Washington Post article tells us that "Goeken is what is called, predictably enough, an 'idea man.'" Being the "idea person" must not have had quite the same connotations back then, it was a good time for Goeken. In the 1960s, conversations with customers at his two-way radio shop near Chicago gave him an idea for a repeater network to allow truckers to reach their company offices via CB radio. This was the first falling domino in a series that lead to the founding of MCI and the end of AT&T's long-distance monopoly. Goeken seems to have been the type who grew bored with success, and he left MCI to take on a series of new ventures. These included an emergency medicine messaging service, electrically illuminated high-viz clothing, and a system called the Mercury Network that built much of the inertia behind the surprisingly advanced computerization of florists [1].

"Goeken's ideas have a way of turning into dollars, millions of them," the Washington Post continued. That was certainly true of MCI, but every ideas guy had their misses. One of the impressive things about Goeken was his ability to execute with speed and determination, though, so even his failures left their mark. This was especially true of one of his ideas that, in the abstract, seemed so solid: what if there were payphones on commercial flights?

Goeken's experience with MCI and two-way radios proved valuable, and starting in the mid-1970s he developed prototype air-ground radiotelephones. In its first iteration, "Airfone" consisted of a base unit installed on an aircraft bulkhead that accepted a credit card and released a cordless phone. When the phone was returned to the base station, the credit card was returned to the customer. This equipment was simple enough, but it would require an extensive ground network to connect callers to the telephone system. The infrastructure part of the scheme fell into place when long-distance communications giant Western Union signed on with Goeken Communications to launch a 50/50 joint venture under the name Airfone, Inc.

Airfone was not the first to attempt air-ground telephony---AT&T had pursued the same concept in the 1970s, but abandoned it after resistance from the FCC (unconvinced the need was great enough to justify frequency allocations) and the airline industry (which had formed a pact, blessed by the government, that prohibited the installation of telephones on aircraft until such time as a mature technology was available to all airlines). Goeken's hard-headed attitude, exemplified in the six-year legal battle he fought against AT&T to create MCI, must have helped to defeat this resistance.

Goeken brought technical advances, as well. By 1980, there actually was an air-ground radiotelephone service in general use. The "General Aviation Air-Ground Radiotelephone Service" allocated 12 channels (of duplex pairs) for radiotelephony from general aviation aircraft to the ground, and a company called Wulfsberg had found great success selling equipment for this service under the FliteFone name. Wulfsberg FliteFones were common equipment on business aircraft, where they let executives shout "buy" and "sell" from the air. Goeken referred to this service as evidence of the concept's appeal, but it was inherently limited by the 12 allocated channels.

General Aviation Air-Ground Radiotelephone Service, which I will call AGRAS (this is confusing in a way I will discuss shortly), operated at about 450MHz. This UHF band is decidedly line-of-sight, but airplanes are very high up and thus can see a very long ways. The reception radius of an AGRAS transmission, used by the FCC for planning purposes, was 220 miles. This required assigning specific channels to specific cities, and there the limits became quite severe. Albuquerque had exactly one AGRAS channel available. New York City got three. Miami, a busy aviation area but no doubt benefiting from its relative geographical isolation, scored a record-setting four AGRAS channels. That meant AGRAS could only handle four simultaneous calls within a large region... if you were lucky enough for that to be the Miami region; otherwise capacity was even more limited.

Back in the 1970s, AT&T had figured that in-flight telephones would be very popular. In a somewhat hand-wavy economic analysis, they figured that about a million people flew in the air on a given day, and about a third of them would want to make telephone calls. That's over 300,000 calls a day, clearly more than the limited AGRAS channels could handle... leading to the FCC's objection that a great deal of spectrum would have to be allocated to make in-flight telephony work.

Goeken had a better idea: single-sideband. SSB is a radio modulation technique that allows a radio transmission to fit within a very narrow bandwidth (basically by suppressing half of the signal envelope), at the cost of a somewhat more fiddly tuning process for reception. SSB was mostly used down in the HF bands, where the low frequencies meant that bandwidth was acutely limited. Up in the UHF world, bandwidth seemed so plentiful that there was little need for careful modulation techniques... until Goeken found himself asking the FCC for 10 blocks of 29 channels each, a lavish request that wouldn't really fit anywhere in the popular UHF spectrum. The use of UHF SSB, pioneered by Airfone, allowed far more efficient use of the allocation.

In 1983, the FCC held hearings on Airfone's request for an experimental license to operate their SSB air-ground radiotelephone system in two allocations (separate air-ground and ground-air ranges) around 850MHz and 895MHz. The total spectrum allocated was about 1.5MHz in each of the two directions. The FCC assented and issued the experimental license in 1984, and Airfone was in business.

Airfone initially planned 52 ground stations for the system, although I'm not sure how many were ultimately built---certainly 37 were in progress in 1984, at a cost of about $50 million. By 1987, the network had reportedly grown to 68. Airfone launched on six national airlines (a true sign of how much airline consolidation has happened in recent decades---there were six national airlines?), typically with four cordless payphones on a 727 or similar aircraft. The airlines received a commission on the calling rates, and Airfone installed the equipment at their own expense. Still, it was expected to be profitable... Airfone projected that 20-30% of passengers would have calls to make.

I wish I could share more detail on these ground stations, in part because I assume there was at least some reuse of existing Western Union facilities (WU operated a microwave network at the time and had even dabbled in cellular service in the 1980s). I can't find much info, though. The antennas for the 800MHz band would have been quite small, but the 1980s multiplexing and control equipment probably took a fare share of floorspace.

Airfone was off to a strong start, at least in terms of installation base and press coverage. I can't say now how many users it actually had, but things looked good enough that in 1986 Western Union sold their share of the company to GTE. Within a couple of years, Goeken sold his share to GTE as well, reportedly as a result of disagreements with GTE's business strategy.

Airfone's SSB innovation was actually quite significant. At the same time, in the 1980s, a competitor called Skytel was trying to get a similar idea off the ground with the existing AGRAS allocation. It doesn't seem to have gone anywhere, I don't think the FCC ever approved it. Despite an obvious concept, Airfone pretty much launched as a monopoly, operating under an experimental license that named them alone. Unsurprisingly there was some upset over this apparent show of favoritism by the FCC, including from AT&T, which vigorously opposed the experimental license.

As it happened, the situation would be resolved by going the other way: in 1990, the FCC established the "commercial aviation air-ground service" which normalized the 800 MHz spectrum and made licenses available to other operators. That was six years after Airfone started their build-out, though, giving them a head start that severely limited competition.

Still, AT&T was back. AT&T introduced a competing service called AirOne. AirOne was never as widely installed as Airfone but did score some customers including Southwest Airlines, which only briefly installed AirOne handsets on their fleet. "Only briefly" describes most aspects of AirOne, but we'll get to that in a moment.

The suddenly competitive market probably gave GTE Airfone reason to innovate, and besides, a lot had changed in communications technology since Airfone was designed. One of Airfone's biggest limitations was its lack of true roaming: an Airfone call could only last as long as the aircraft was within range of the same ground station. Airfone called this "30 minutes," but you can imagine that people sometimes started their call near the end of this window, and the problem was reportedly much worse. Dropped calls were common, adding insult to the injury that Airfone was decidedly expensive. GTE moved towards digital technology and automation.

1991 saw the launch of Airfone GenStar, which used QAM digital modulation to achieve better call quality and tighter utilization within the existing bandwidth. Further, a new computerized network allowed calls to be handed off from one ground station to another. Capitalizing on the new capacity and reliability, the aircraft equipment was upgraded as well. The payphone like cordless stations were gone, replaced by handsets installed in seatbacks. First class cabins often got a dedicated handset for every seat, economy might have one handset on each side of a row. The new handsets offered RJ11 jacks, allowing the use of laptop modems while in-flight. Truly, it was the future.

During the 1990s, satellites were added to the Airfone network as well, improving coverage generally and making telephone calls possible on overseas flights. Of course, the rise of satellite communications also sowed the seeds of Airfone's demise. A company called Aircell, which started out using the cellular network to connect calls to aircraft, rebranded to Gogo and pivoted to satellite-based telephone services. By the late '90s, they were taking market share from Airfone, a trend that would only continue.

Besides, for all of its fanfare, Airfone was not exactly a smash hit. Rates were very high, $5 a minute in the late '90s, giving Airfone a reputation as a ripoff that must have cut a great deal into that "20-30% of fliers" they hoped to serve. With the rise of cellphones, many preferred to wait until the aircraft was on the ground to use their own cellphone at a much lower rate. GTE does not seem to have released much in the way of numbers for Airfone, but it probably wasn't making them rich.

Goeken, returning to the industry, inadvertently proved this point. He aggressively lobbied the FCC to issue competitive licenses, and ultimately succeeded. His second company in the space, In-Flight Phone Inc., became one of the new competitors to his old company. In-Flight Phone did not last for long. Neither did AT&T AirOne. A 2005 FCC ruling paints a grim picture:

Current 800 MHz Air-Ground Radiotelephone Service rules contemplate six competing licensees providing voice and low-speed data services. Six entities were originally licensed under these rules, which required all systems to conform to detailed technical specifications to enable shared use of the air-ground channels. Only three of the six licensees built systems and provided service, and two of those failed for business reasons.

In 2002, AT&T pulled out, and Airfone was the only in-flight phone left. By then, GTE had become Verizon, and GTE Airfone was Verizon Airfone. Far from a third of passengers, the CEO of Airfone admitted in an interview that a typical flight only saw 2-3 phone calls. Considering the minimum five-figure capital investment in each aircraft, it's hard to imagine that Airfone was profitable---even at $5 minute.

Airfone more or less faded into obscurity, but not without a detour into the press via the events of 9/11. Flight 93, which crashed in Pennsylvania, was equipped with Airfone and passengers made numerous calls. Many of the events on board this aircraft were reconstructed with the assistance of Airfone records, and Claircom (the name of the operator of AT&T AirOne, which never seems to have been well marketed) also produced records related to other aircraft involved in the attacks. Most notably, flight 93 passenger Todd Beamer had a series of lengthy calls with Airfone operator Lisa Jefferson, through which he relayed many of the events taking place on the plane in real time. During these calls, Beamer seems to have coordinated the effort by passengers to retake control of the plane. The significance of Airfone and Claircom records to 9/11 investigations is such that 9/11 conspiracy theories may be one of the most enduring legacies of Claircom especially.

In an odd acknowledgment of their aggressive pricing, Airfone decided not to bill for any calls made on 9/11, and temporarily introduced steep discounts (to $0.99 a minute) in the weeks after. This rather meager show of generosity did little to reverse the company's fortunes, though, and it was already well into a backslide.

In 2006, the FCC auctioned the majority of Airfone's spectrum to new users. The poor utilization of Airfone was a factor in the decision, as well as Airfone's relative lack of innovation compared to newer cellular and satellite systems. In fact, a large portion of the bandwidth was purchased by Gogo, who years later would use to to deliver in-flight WiFi. Another portion went to a subsidiary of JetBlue that provided in-flight television. Verizon announced the end of Airfone in 2006, pending an acquisition by JetBlue, and while the acquisition did complete JetBlue does not seem to have continued Airfone's passenger airline service. A few years later, Gogo bought out JetBlue's communications branch, making them the new monopoly in 800MHz air ground radiotelephony. Gogo only offered telephone service for general aviation aircraft; passenger aircraft telephones had gone the way of the carphone.

It's interesting to contrast the fate of Airfone to to its sibling, AGRAS. Depending on who you ask, AGRAS refers to the radio service or to the Air Ground Radiotelephone Automated Service operated by Mid-America Computer Corporation. What an incredible set of names. This was a situation a bit like ARINC, the semi-private company that for some time held a monopoly on aviation radio services. MACC had a practical monopoly on general aviation telephone service throughout the US, by operating the billing system for calls. MACC still exists today as a vendor of telecom billing software and this always seems to have been their focus---while I'm not sure, I don't believe that MACC ever operated ground stations, instead distributing rate payments to private companies that operated a handful of ground stations each. Unfortunately the history of this service is quite obscure and I'm not sure how MACC came to operate the system, but I couldn't resist the urge to mention the Mid-America Computer Corporation.

AGRAS probably didn't make anyone rich, but it seems to have been generally successful. Wulfsberg FliteFones operating on the AGRAS network gave way to Gogo's business aviation phone service, itself a direct descendent of Airfone technology.

The former AGRAS allocation at 450MHz somehow came under the control of a company called AURA Network Systems, which for some years has used a temporary FCC waiver of AGRAS rules to operate data services. This year, the FCC began rulemaking to formally reallocate the 450MHz air ground allocation to data services for Advanced Air Mobility, a catch-all term for UAS and air taxi services that everyone expects to radically change the airspace system in coming years. New uses of the band will include command and control for long-range UAS, clearance and collision avoidance for air taxis, and ground and air-based "see and avoid" communications for UAS. This pattern, of issuing a temporary authority to one company and later performing rulemaking to allow other companies to enter, is not unusual for the FCC but does make an interesting recurring theme in aviation radio. It's typical for no real competition to occur, the incumbent provider having been given such a big advantage.

When reading about these legacy services, it's always interesting to look at the licenses. ULS has only nine licenses on record for the original 800 MHz air ground service, all expired and originally issued to Airfone (under both GTE and Verizon names), Claircom (operating company for AT&T AirOne), and Skyway Aircraft---this one an oddity, a Florida-based company that seems to have planned to introduce in-flight WiFi but not gotten all the way there.

Later rulemaking to open up the 800MHz allocation to more users created a technically separate radio service with two active licenses, both held by AC BidCo. This is an intriguing mystery until you discover that AC BidCo is obviously a front company for Gogo, something they make no effort to hide---the legalities of FCC bidding processes are such that it's very common to use shell companies to hold FCC licenses, and we could speculate that AC BidCo is the Aircraft Communications Bidding Company, created by Gogo for the purpose of the 2006-2008 auctions. These two licenses are active for the former Airfone band, and Gogo reportedly continues to use some of the original Airfone ground stations.

Gogo's air-ground network, which operates at 800MHz as well as in a 3GHz band allocated specifically to Gogo, was originally based on CDMA cellular technology. The ground stations were essentially cellular stations pointed upwards. It's not clear to me if this CDMA-derived system is still in use, but Gogo relies much more heavily on their Ku-band satellite network today.

The 450MHz licenses are fascinating. AURA is the only company to hold current licenses, but the 246 reveal the scale of the AGRAS business. Airground of Idaho, Inc., until 1999 held a license for an AGRAS ground station on Brundage Mountain McCall, Idaho. The Arlington Telephone Company, until a 2004 cancellation, held a license for an AGRAS ground station atop their small telephone exchange in Arlington, Nebraska. AGRAS ground stations seem to have been a cottage industry, with multiple licenses to small rural telephone companies and even sole proprietorships. Some of the ground stations appear to have been the roofs of strip mall two-way radio installers. In another life, maybe I would be putting a 450MHz antenna on my roof to make a few dollars.

Still, there were incumbents: numerous licenses belonged to SkyTel, which after the decline of AGRAS seems to have refocused on paging and, then, gone the same direction as most paging companies: an eternal twilight as American Messaging ("The Dependable Choice"), promoting innovation in the form of longer-range restaurant coaster pagers. In another life, I'd probably be doing that too.

[1] This is probably a topic for a future article, but the Mercury Network was a computerized system that Goeken built for a company called Florist's Telegraph Delivery (FTD). It was an evolution of FTD's telegraph system that allowed a florist in one city to place an order to be delivered by by a florist in another city, thus enabling the long-distance gifting of flowers. There were multiple such networks and they had an enduring influence on the florist industry and broader business telecommunications.

2025-03-10 troposcatter

I have a rough list of topics for future articles, a scratchpad of two-word ideas that I sometimes struggle to interpret. Some items have been on that list for years now. Sometimes, ideas languish because I'm not really interested in them enough to devote the time. Others have the opposite problem: chapters of communications history with which I'm so fascinated that I can't decide where to start and end. They seem almost too big to take on. One of these stories starts in another vast frontier: northeastern Canada.

It was a time, rather unlike our own, of relative unity between Canada and the United States. Both countries had spent the later part of World War II planning around the possibility of an Axis attack on North America, and a ragtag set of radar stations had been built to detect inbound bombers. The US had built a series of stations along the border, and the Canadians had built a few north of Ontario and Quebec to extend coverage north of those population centers. Then the war ended and, as with so many WWII projects, construction stopped. Just a few years later, the USSR demonstrated a nuclear weapon and the Cold War was on. As with so many WWII projects, freshly anxious planners declared the post-war over and blew the dust off of North American air defense plans. In 1950, US and Canadian defense leaders developed a new plan to consolidate and improve the scattershot radar early warning plan.

This agreement would become the Pinetree Line, the first of three trans-Canadian radar fences jointly constructed and operated by the two nations. For the duration of the Cold War, and even to the present day, these radar installations formed the backbone of North American early warning and the locus of extensive military cooperation. The joint defense agreement between the US and Canada, solidified by the Manhattan Project's dependence on Canadian nuclear industry, grew into the 1968 establishment of the North American Air Defense Command (NORAD) as a binational joint military organization.

This joint effort had to rise to many challenges. Radar had earned its place as a revolutionary military technology during the Second World War, but despite the many radar systems that had been fielded, engineer's theoretical understanding of radar and RF propagation were pretty weak. I have written here before about over-the-horizon radar, the pursuit of which significantly improved our scientific understanding of radio propagation in the atmosphere... often by experiment, rather than model. A similar progression in RF physics would also benefit radar early warning in another way: communications.

One of the bigger problems with the Pinetree Line plan was the remote location of the stations. You might find that surprising; the later Mid-Canada and DEW lines were much further north and more remote. The Pinetree Line already involved stations in the far reaches of the maritime provinces, though, and to provide suitable warning to Quebec and the Great Lakes region stations were built well north of the population centers. Construction and operations would rely on aviation, but an important part of an early warning system is the ability to deliver the warning. Besides, ground-controlled interception had become the main doctrine in air defense, and it required not just an alert but real-time updates from radar stations for the most effective response. Each site on the Pinetree Line would require a reliable real-time communications capability, and as the sites were built in the 1950s, some were a very long distance from telephone lines.

Canada had only gained a transcontinental telephone line in 1932, seventeen years behind the United States (which by then had three different transcontinental routes and a fourth in progress), a delay owing mostly to the formidable obstacle of the Canadian Rockies. The leaders in Canadian long-distance communications were Bell Canada and the two railways (Canadian Pacific and Canadian National), and in many cases contracts had been let to these companies to extend telephone service to radar stations. The service was very expensive, though, and the construction of telephone cables in the maritimes was effectively ruled out due to the huge distances involved and uncertainty around the technical feasibility of underwater cables to Newfoundland due to the difficult conditions and extreme tides in the Gulf of St. Lawrence.

The RCAF had faced a similar problem when constructing its piecemeal radar stations in Ontario and Quebec in the 1940s, and had addressed them by applying the nascent technology of point-to-point microwave relays. This system, called ADCOM, was built and owned by RCAF to stretch 1,400 miles between a series of radar stations and other military installations. It worked, but the construction project had run far over budget (and major upgrades performed soon after blew the budget even further), and the Canadian telecom industry had vocally opposed it on the principle that purpose-built military communications systems took government investment away from public telephone infrastructure that could also serve non-military needs.

These pros and cons of ADCOM must have weighed on Pinetree Line planners when they chose to build a system directly based on ADCOM, but to contract its construction and operation to Bell Canada [1]. This was, it turned out, the sort of compromise that made no one happy: the Canadian military's communications research establishment was reluctant to cede its technology to Bell Canada, while Bell Canada objected to deploying the military's system rather than one of the commercial technologies then in use across the Bell System.

The distinct lack of enthusiasm on the part of both parties involved was a bad omen for the future of this Pinetree Line communications system, but as it would happen, the whole plan was overcome by events. One of the great struggles of large communications projects in that era, and even today, is the rapid rate of technological progress. One of ADCOM's faults was that the immense progress Bell Labs and Western Electric made in microwave equipment during the late '40s meant that it was obsolete as soon as it went into service. This mistake would not be repeated, as ADCOM's maritimes successor was obsoleted before it even broke ground. A promising new radio technology offered a much lower cost solution to these long, remote spans.

At the onset of the Second World War, the accepted theory of radio propagation held that HF signals could pass the horizon via ground wave propagation, curving to follow the surface of the Earth, while VHF and UHF signals could not. This meant that the higher-frequency bands, where wideband signals were feasible, were limited to line-of-sight or at least near-line-of-sight links... not more than 50 miles with ideal terrain, often less. We can forgive the misconception, because this still holds true today, as a rule of thumb. The catch is in the exceptions, the nuances, that during the war were already becoming a headache to RF engineers.

First, military radar operators observed mysterious contacts well beyond the theoretical line-of-sight range of their VHF radar sets. These might have been dismissed as faults in the equipment (or the operator), but reports stacked up as more long-range radar systems were fielded. After the war, relaxed restrictions and a booming economy allowed radio to proliferate. UHF television stations, separated by hundreds of miles, unexpectedly interfered with each other. AT&T, well into deployment of a transcontinental microwave network, had to adjust its frequency planning after it was found that microwave stations sometimes received interfering signals from other stations in the chain... stations well over the horizon.

This was the accidental discovery of tropospheric scattering.

The Earth's atmosphere is divided into five layers. We live in the troposhere, the lowest and thinnest of the layers, above which lies the stratosphere. Roughly speaking, the difference between these layers is that the troposphere becomes colder with height (due to increasing distance from the warm surface), while the stratosphere becomes warmer with height (due to decreasing shielding from the sun) [2]. In between is a local minimum of temperature, called the tropopause.

The density gradients around the tropopause create a mirror effect, like the reflections you see when looking at an air-water boundary. The extensive turbulence and, well, weather present in the troposhere also refract signals on their way up and down, making the true course of radio signals reflecting off of the tropopause difficult to predict or analyze. Because of this turbulence, the effect has come to be known as scattering: radio signals sent upwards, towards the troposphere, will be scattered back downwards across a wide area. This effect is noticeable only at high frequencies, so it remained unknown until the widespread use of UHF and microwave, and was still only partially understood in the early 1950s.

The locii of radar technology at the time were Bell Laboratories and the MIT Lincoln Laboratory, and they both studied this effect for possible applications. Presaging one of the repeated problems of early warning radar systems, by the time Pinetree Line construction began in 1951 the Lincoln Laboratory was already writing proposals for systems that would obsolete it. In fact, construction would begin on both of the Pinetree Line's northern replacements before the Pinetree Line itself was completed. Between rapid technological development and military planners in a sort of panic mode, the early 1950s were a very chaotic time. Underscoring the ever-changing nature of early warning was the timeline of Pinetree Line communications: as the Pinetree Line microwave network was in planning, the Lincoln Laboratory was experimenting with troposcatter communications. By the time the first stations in Newfoundland completed construction, Bell Laboratories had developed an experimental troposcatter communications system.

This new means of long-range communications would not be ready in time for the first Pinetree Line stations, so parts of the original ADCOM-based microwave network would have to be built. Still, troposcatter promised to complete the rest of the network at significantly reduced cost. The US Air Force, wary of ADCOM's high costs and more detached from Canadian military politics, aggressively lobbied for the adoption of troposcatter communications for the longest and most challenging Pinetree Line links.

Bell Laboratories, long a close collaborator with the Air Force, was well aware of troposcatter's potential for early warning radar. Bell Canada and Bell Laboratories agreed to evaluate the system under field conditions, and in 1952 experimental sites were installed in Newfoundland. These tests found reliable performance over 150 miles, far longer than achievable by microwave and---rather conveniently---about the distance between Pinetree Line radar stations. These results suggested that the Pinetree Line could go without an expensive communications network in the traditional sense, instead using troposcatter to link the radar stations directly to each other.

Consider a comparison laid out by the Air Force: one of the most complex communications requirements for the Pinetree Line was a string of stations running not east-west like the "main" line, but north-south from St. John's, Newfoundland to Frobisher Bay, Nunavut. These stations were critical for detection of Soviet bombers approaching over the pole from the northeast, otherwise a difficult gap in radar coverage until the introduction of radar sites in Greenland. But the stations covered a span of over 1,000 miles, most of it in formidably rugged and remote arctic coastal terrain. The proposed microwave system would require 50 relay stations, almost all of which would be completely new construction. Each relay's construction would have to be preceded by the construction of a harbor or airfield for access, and then establishment of a power plant, to say nothing of the ongoing logistics of transporting fuel and personnel for maintenance. The proposed troposcatter system, on the other hand, required only ten relays. All ten would be colocated with radar stations, and could share infrastructure and logistical considerations.

Despite the clear advantages of troposcatter and its selection by the USAF, the Canadian establishment remained skeptical. One cannot entirely blame them, considering that troposcatter communications had only just been demonstrated in the last year. Still, the USAF was footing most of the bill for the overall system (and paying entirely for the communications aspect, depending on how you break down the accounting) and had considerable sway. In 1954, well into construction of the radar stations (several had already been commissioned), the Bell Canada contract for communications was amended to add troposcatter relay in addition to the original microwave scheme. Despite the weaselly contracting, the writing was on the wall and progress on microwave relay stations almost stopped. By the latter part of 1954, the microwave network was abandoned entirely. Bell Canada moved at incredible speed to complete the world's first troposcatter long-distance route, code named Pole Vault.

One of the major downsides of troposcatter communications is its inefficiency. Only a very small portion of the RF energy reaching the tropopause is reflected, and of that, only a small portion is reflected in the right direction. Path loss from transmitter to receiver for long links is over -200 dB, compared to say -130 dB for a microwave link. That difference looks smaller than it is; dB is a logarithmic comparison and the decrease from -130 dB to -200 dB is a factor of ten million.

The solution is to go big. Pole Vault's antennas were manufactured as a rush order by D. S. Kennedy Co. of Massachusetts. 36 were required, generally four per site for transmit and receive in each direction. Each antenna was a 60' aluminum parabolic dish held up on edge by truss legs. Because of the extreme weather at the coastal radar sites, the antennas were specified to operate in a 120 knot wind---or a 100 knot wind with an inch of ice buildup. These were operating requirements, so the antenna had not only to survive these winds, but to keep flexing and movements small enough to not adversely impact performance. The design of the antennas was not trivial; even after analysis by both Kennedy Co. and Bell Canada, after installation some of the rear struts supporting the antennas buckled. All high-wind locations received redesigned struts.

To drive the antennas, Radio Engineering Laboratories of Long Island furnished radio sets with 10 kW of transmit power. Both D. S. Kennedy and Radio Engineering Laboratories were established companies, especially for military systems, but were still small compared to Bell System juggernauts like Western Electric and Northern Electric. They had built the equipment for the experimental sites, though, and the timeline for construction of Pole Vault was so short that planners did not feel there was time to contract larger manufacturers. This turn of events made Kennedy Co. and REL the leading experts in troposcatter equipment, which became their key business in the following decade.

The target of the contract, signed in January of 1954, was to have Pole Vault operational by the end of that same year. Winter conditions, and indeed spring and fall conditions, are not conducive to construction on the arctic coast. All of the equipment for Pole Vault had to be manufactured in the first half of the year, and as weather improved and ice cleared in the mid-summer, everything was shipped north and installation work began. Both militaries had turned down involvement in the complex and time-consuming logistics of the project, so Bell Canada chartered ships and aircraft and managed an incredibly complex schedule. To deliver equipment to sites as early as possible, icebreaker CCGS D'Iberville was chartered. C-119 and DC3 aircraft served alongside numerous small boats and airplanes.

All told, it took about seven months to manufacture and deliver equipment to the Pole Vault sites, and six months to complete construction. Construction workers, representing four or five different contractors at each site and reaching about 120 workers to a site during peak activity, had to live in construction camps that could still be located miles from the station. Grounded ships, fires, frostbite, and of course poor morale lead to complications and delays. At one site, Saglek, project engineers recorded a full 24-hour day with winds continuously above 75 miles per hour, and then weeks later, a gust of 135 mph was observed. Repairs had to be made to the antennas and buildings before they were even completed.

In a remarkable feat of engineering and construction, the Pole Vault system was completed and commissioned more or less on schedule: amended into the contract in January of 1954, commissioning tests of the six initial stations were successfully completed February of 1955. Four additional stations were built to complete the chain, and Pole Vault was declared fully operational December of 1956 at a cost of $24.6 million (about $290 million today).

Pole Vault operated at various frequencies between 650 and 800 MHz, the wide range allowing for minimal frequency reuse---interference was fairly severe, since each station's signal scattered and could be received by stations further down the line in ideal (or as the case may be, less than ideal) conditions. Frequency division multiplexing equipment, produced by Northern Electric (Nortel) based on microwave carrier systems, offered up to 36 analog voice circuits. The carrier systems were modular, and some links initially supported only 12 circuits, while later operational requirements lead to an upgrade to 70 circuits.

Over the following decades, the North Atlantic remained a critical challenge for North American air defense. It also became the primary communications barrier between the US and Canada and European NATO allies. Because Pole Vault provided connections across such a difficult span, several later military communications systems relied on Pole Vault as a backhaul connection.

An inventory of the Saglek site, typical of the system, gives an idea of the scope of each of the nine primary stations. This is taken from "Special Contract," a history by former Bell Canada engineer A. G. Lester:

(1) Four parabolic antennas, 60 feet in diameter, each mounted on seven mass concrete footings. (2) An equipment building 62 by 32 feet to house electronic equipment, plus a small (10 by 10 feet) diversity building. (3) A diesel building 54 by 36 feet, to house three 125 KVA (kilovolt amperes) diesel driven generators. (4) Two 2500 gallon fuel storage tanks. (5) Raceways to carry waveguide and cables. (6) Enclosed corridors interconnecting buildings, total length in this case 520 feet.

Since the Pole Vault stations were colocated with radar facilities, barracks and other support facilities for the crews were already provided for. Of course, you can imagine that the overall construction effort at each site was much larger, including the radar systems as well as cantonment for personnel.

Pole Vault would become a key communications system in the maritime provinces, remaining in service until 1975. Its reliable performance in such a challenging environment was a powerful proof of concept for troposcatter, a communications technique first imagined only a handful of years earlier. Even as Pole Vault reached its full operating capability in late 1956, other troposcatter systems were under construction. Much the same, and not unrelated, other radar early warning systems were under construction as well.

The Pinetree Line, for all of its historical interest and its many firsts, ended as a footnote in the history of North American air defense. More sophisticated radar fences were already under design by the time Pinetree Line construction started, leaving some Pinetree stations to operate for just four years. It is a testament to Pole Vault that it outlived much of the radar system it was designed to support, becoming an integral part of not one, or even two, but at least three later radar early warning programs. Moreover, Pole Vault became a template for troposcatter systems elsewhere in Canada, in Europe, and in the United States. But we'll have to talk about those later.

[1] Alexander Graham Bell was Scottish-Canadian-American, and lived for some time in rural Ontario and later Montreal. As a result, Bell Canada is barely younger than its counterpart in the United States and the early history of the two is more one of parallel development than the establishment of a foreign subsidiary. Bell's personal habit of traveling back and forth between Montreal and Boston makes the early interplay of the two companies a bit confusing. In 1955, the TAT-1 telephone cable would conquer the Atlantic ocean to link the US to Scotland via Canada, incidentally making a charming gesture to Bell's personal journey.

[2] If you have studied weather a bit, you might recognize these as positive and negative lapse rates. The positive lapse rate in the troposphere is a major driver in the various phenomenon we call "weather," and the tropopause forms a natural boundary that keeps most weather within the troposphere. Commercial airliners fly in the lower part of the stratosphere, putting them above most (but not all) weather.

2025-03-01 the cold glow of tritium

I have been slowly working on a book. Don't get too excited, it is on a very niche topic and I will probably eventually barely finish it and then post it here. But in the mean time, I will recount some stories which are related, but don't quite fit in. Today, we'll learn a bit about the self-illumination industry.

At the turn of the 20th century, it was discovered that the newfangled element radium could be combined with a phosphor to create a paint that glowed. This was pretty much as cool as it sounds, and commercial radioluminescent paints like Undark went through periods of mass popularity. The most significant application, though, was in the military: radioluminescent paints were applied first to aircraft instruments and later to watches and gunsights. The low light output of radioluminescent paints had a tactical advantage (being very difficult to see from a distance), while the self-powering nature of radioisotopes made them very reliable.

The First World War was thus the "killer app" for radioluminescence. Military demand for self-illuminating devices fed a "radium rush" that built mines, processing plants, and manufacturing operations across the country. It also fed, in a sense much too literal, the tragedy of the "Radium Girls." Several self-luminous dial manufacturers knowingly subjected their women painters to shockingly irresponsible conditions, leading inevitably to radium poisoning that disfigured, debilitated, and ultimately killed. Today, this is a fairly well-known story, a cautionary tale about the nuclear excess and labor exploitation of the 1920s. That the situation persisted into the 1940s is often omitted, perhaps too inconvenient to the narrative that a series of lawsuits, and what was essentially the invention of occupational medicine, headed off the problem in the late 1920s.

What did happen after the Radium Girls? What was the fate of the luminous radium industry?

A significant lull in military demand after WWI was hard on the radium business, to say nothing of a series of costly settlements to radium painters despite aggressive efforts to avoid liability. At the same time, significant radium reserves were discovered overseas, triggering a price collapse that closed most of the mines. The two largest manufacturers of radium dials, Radium Dial Company (part of Standard Chemical who owned most radium mines) and US Radium Corporation (USRC), both went through lean times. Fortunately, for them, the advent of the Second World War reignited demand for radioluminescence.

The story of Radium Dial and USRC doesn't end in the 1920s---of course it doesn't, luminous paints having had a major 1970s second wind. Both companies survived, in various forms, into the current century. In this article, I will focus on the post-WWII story of radioactive self-illumination and the legacy that we live with today.

During its 1920s financial difficulties, the USRC closed the Orange, New Jersey plant famously associated with Radium Girls and opened a new facility in Brooklyn. In 1948, perhaps looking to manage expenses during yet another post-war slump, USRC relocated again to Bloomsburg, Pennsylvania. The Bloomsburg facility, originally a toy factory, operated through a series of generational shifts in self-illuminating technology.

The use of radium, with some occasional polonium, for radioluminescence declined in the 1950s and ended entirely in the 1970s. The alpha radiation emitted by those elements is very effective in exciting phosphors but so energetic that it damages them. A longer overall lifespan, and somewhat better safety properties, could be obtained by the use of a beta emitter like strontium or tritium. While strontium was widely used in military applications, civilian products shifted towards tritium, which offered an attractive balance of price and half life.

USRC handled almost a dozen radioisotopes in Bloomsburg, much of them due to diversified operations during the 1950s that included calibration sources, ionizers, and luminous products built to various specific military requirements. The construction of a metal plating plant enabled further diversification, including foil sources used in research, but eventually became an opportunity for vertical integration. By 1968, USRC had consolidated to only tritium products, with an emphasis on clocks and watches.

Radioluminescent clocks were a huge hit, in part because of their practicality, but fashion was definitely a factor. Millions of radioluminescent clocks were sold during the '60s and '70s, many of them by Westclox. Westclox started out as a typical clock company (the United Clock Company in 1885), but joined the atomic age through a long-lived partnership with the Radium Dial Company. The two companies were so close that they became physically so: Radium Dial's occupational health tragedy played out in Ottawa, Illinois, a town Radium Dial had chosen as its headquarters due to its proximity to Westclox in nearby Peru [1].

Westclox sold clocks with radioluminescent dials from the 1920s to probably the 1970s, but one of the interesting things about this corner of atomic history is just how poorly documented it is. Westclox may have switched from radium to tritium at some point, and definitely abandoned radioisotopes entirely at some point. Clock and watch collectors, a rather avid bunch, struggle to tell when. Many consumer radioisotopes are like this: it's surprisingly hard to know if they even are radioactive.

Now, the Radium Dial Company itself folded entirely to a series of radium poisoning lawsuits in the 1930s. Simply being found guilty of one of the most malevolent labor abuses of the era would not stop free enterprise, though, and Radium Dial's president founded a legally distinct company called Luminous Processes just down the street. Luminous Processes is particularly notable for having continued the production of radium-based clock faces until 1978, making them the last manufacturer of commercial radioluminescent radium products. This also presents compelling circumstantial evidence that Westclox continued to use radium paint until sometime around 1978, which lines up with the general impressions of luminous dial collectors.

While the late '70s were the end of Radium Dial, USRC was just beginning its corporate transformation. From 1980 to 1982, a confusing series of spinoffs and mergers lead to USR Industries, parent company of Metreal, parent company of Safety Light Corporation, which manufactured products to be marketed and distributed by Isolite. All of these companies were ultimately part of USR Industries, the former USRC, but the org chart sure did get more complex. The Nuclear Regulatory Commission expressed some irritation in their observation, decades later, that they weren't told about any of this restructuring until they noticed it on their own.

Safety Light, as expressed by the name, focused on a new application for tritium radioluminescence: safety signage, mostly self-powered illuminated exit signs and evacuation signage for aircraft. Safety Light continued to manufacture tritium exit signs until 2007, when they shut down following some tough interactions with the NRC and the EPA. They had been, in the fashion typical of early nuclear industry, disposing of their waste by putting it in a hole in the ground. They had persisted in doing this much longer than was socially acceptable, and ultimately seem to have been bankrupted by their environmental obligations... obligations which then had to be assumed by the Superfund program.

The specific form of illumination used in these exit signs, and by far the most common type of radioluminescence today, is the Gaseous Tritium Light Source or GTLS. GTLS are small glass tubes or vials, usually made with borosilicate glass, containing tritium gas and an internal coating of phosphor. GTLS are simple, robust, and due to the very small amount of tritium required, fairly inexpensive. They can be made large enough to illuminate a letter in an exit sign, or small enough to be embedded into a watch hand. Major applications include watch faces, gun sights, and the keychains of "EDC" enthusiasts.

Plenty of GTLS manufacturers have come and gone over the years. In the UK, defense contractor Saunders-Roe got into the GTLS business during WWII. Their GTLS product line moved to Brandhurst Inc., which had a major American subsidiary. It is an interesting observation that the US always seems to have been the biggest market for GTLS, but their manufacture has increasingly shifted overseas. Brandhurst is no longer even British, having gone the way of so much of the nuclear world by becoming Canadian. A merger with Canadian company SRB created SRB Technologies in Pembroke, Ontario, which continues to manufacture GTLS today.

Other Canadian GTLS manufacturers have not fared as well. Shield Source Inc., of Peterborough, Ontario, began filling GTLS vials in 1987. I can't find a whole lot of information on Shield Source's early days, but they seem to have mostly made tubes for exit signs, and perhaps some other self-powered signage. In 2012, the Canadian Nuclear Safety Commission (CNSC) detected a discrepancy in Shield Source's tritium emissions monitoring. I am not sure of the exact details, because CNSC seems to make less information public in general than the US NRC [2].

Here's what appears to have happened: tritium is a gas, which makes it tricky to safely handle. Fortunately, the activity of tritium is relatively low and its half life is relatively short. This means that it's acceptable to manage everyday leakage (for example when connecting and disconnecting things) in a tritium workspace by ventilating it to a stack, releasing it to the atmosphere for dilution and decay. The license of a tritium facility will specify a limit for how much radioactivity can be released this way, and monitoring systems (usually several layers of monitoring systems) have to be used to ensure that the permit limit is not exceeded. In the case of Shield Source, some kind of configuration error with the tritium ventilation monitoring system combined with a failure to adequately test and audit it. The CNSC discovered that during 2010 and 2011, the facility had undercounted their tritium emissions, and in fact exceeded the limits of their license.

Air samplers located around the facility, some of which were also validated by an independent laboratory, did not detect tritium in excess of the environmental limits. This suggests that the excess releases probably did not have an adverse impact on human health or the environment. Still, exceeding license terms and then failing to report and correct the problem for two years is a very serious failure by a licensee. In 2012, when the problem was discovered, CNSC ordered Shield Source's license modified to prohibit actual tritium handling. This can seem like an odd maneuver but something similar can happen in the US. Just having radioisotope-contaminated equipment, storing test sources, and managing radioactive waste requires a license. By modifying Shield Source's license to prohibit tritium vial filling, the CNSC effectively shut the plant down while allowing Shield Source to continue their radiological protection and waste management functions. This is the same reason that long-defunct radiological facilities often still hold licenses from NRC in the US: they retain the licenses to allow them to store and process waste and contaminated materials during decommissioning.

In the case of Shield Source, while the violation was serious, CNSC does not seem to have anticipated a permanent shutdown. The terms agreed in 2012 were that Shield Source could regain a license to manufacture GTLS if it produced for CNSC a satisfactory report on the root cause of the failure and actions taken to prevent a recurrence. Shield Source did produce such a report, and CNSC seems to have mostly accepted it with some comments requesting further work (the actual report does not appear to be public). Still, in early 2013, Shield Source informed CNSC that it did not intend to resume manufacturing. The license was converted to a one-year license to facilitate decommissioning.

Tritium filling and ventilation equipment, which had been contaminated by long-term exposure to tritium, was "packaged" and disposed. This typically consists of breaking things down into parts small enough to fit into 55-gallon drums, "overpacking" those drums into 65-gallon drums for extra protection, and then coordinating with transportation authorities to ship the materials in a suitable way to a facility licensed to dispose of them. This is mostly done by burying them in the ground in an area where the geology makes groundwater interaction exceedingly unlikely, like a certain landfill on the Texas-New Mexico border near Eunice. Keep in mind that tritium's short half life means this is not a long-term geological repository situation; the waste needs to be safely contained for only, say, fifty years to get down to levels not much different from background.

I don't know where the Shield Source waste went, CNSC only says it went to a licensed facility. Once the contaminated equipment was removed, drywall and ceiling and floor finishes were removed in the tritium handling area and everything left was thoroughly cleaned. A survey confirmed that remaining tritium contamination was below CNSC-determined limits (for example, in-air concentrations that would lead to a dose of less than 0.01 mSv/year for 9-5 occupational exposure). At that point, the Shield Source building was released to the landlord they had leased it from, presumably to be occupied by some other company. Fortunately tritium cleanup isn't all that complex.

You might wonder why Shield Source abruptly closed down. I assume there was some back-and-forth with CNSC before they decided to throw in the towel, but it is kind of odd that they folded entirely during the response to an incident that CNSC seems to have fully expected them to survive. I suspect that a full year of lost revenue was just too much for Shield Source: by 2012, when all of this was playing out, the radioluminescence market had seriously declined.

There are a lot of reasons. For one, the regulatory approach to tritium has become more and more strict over time. Radium is entirely prohibited in consumer goods, and the limit on tritium activity is very low. Even self-illuminating exit signs now require NRC oversight in the US, discussed shortly. Besides, public sentiment has increasingly turned against the Friendly Atom is consumer contexts, and you can imagine that people are especially sensitive to the use of tritium in classic institutional contexts for self-powered exit signs: schools and healthcare facilities.

At the same time, alternatives have emerged. Non-radioactive luminescent materials, the kinds of things we tend to call "glow in the dark," have greatly improved since WWII. Strontium aluminate is a typical choice today---the inclusion of strontium might suggest otherwise, but strontium aluminate uses the stable natural isotope of strontium, Sr-88, and is not radioactive. Strontium aluminate has mostly displaced radioluminescence in safety applications, and for example the FAA has long allowed it for safety signage and path illumination on aircraft. Keep in mind that these luminescent materials are not self-powered. They must be "charged" by exposure to light. Minor adaptations are required, for example a requirement that the cabin lights in airliners be turned on for a certain period of time before takeoff, but in practice these limitations are considered preferable to the complexity and risks involved in the use of radioisotopes.

You are probably already thinking that improving electronics have also made radioluminescence less relevant. Compact, cool-running, energy-efficient LEDs and a wide variety of packages and form factors mean that a lot of traditional applications of radioluminescence are now simply electric. Here's just a small example: in the early days of LCD digital watches, it was not unusual for higher-end models to use a radioluminescent source as a backlight. Today that's just nonsensical, a digital watch needs a power source anyway and in even the cheapest Casios a single LED offers a reasonable alternative. Radioluminescent digital watches were very short lived.

Now that we've learned about a few historic radioluminescent manufacturers, you might have a couple of questions. Where were the radioisotopes actually sourced? And why does Ontario come up twice? These are related. From the 1910s to the 1950s, radioluminescent products were mostly using radium sourced from Standard Chemical, who extracted it from mines in the Southwest. The domestic radium mining industry collapsed by 1955 due to a combination of factors: declining demand after WWII, cheaper radium imported from Brazil, and a broadly changing attitude towards radium that lead the NRC to note in the '90s that we might never again find the need to extract radium: radium has a very long half life that makes it considerably more difficult to manage than strontium or tritium. Today, you could say that the price of radium has gone negative, in that you are far more likely to pay an environmental management company to take it away (at rather high prices) than to buy more.

But what about tritium? Tritium is not really naturally occurring; there technically is some natural tritium but it's at extremely low concentrations and very hard to get at. But, as it happens, irradiating water produces a bit of tritium, and nuclear reactors incidentally irradiate a lot of water. With suitable modifications, the tritium produced as a byproduct of civilian reactors can be concentrated and sold. Ontario Hydro has long had facilities to perform this extraction, and recently built a new plant at the Darlington Nuclear Station that processes heavy water shipped from CANDU reactors throughout Ontario. The primary purpose of this plant is to reduce environmental exposure from the release of "tritiated" heavy water; it produces more tritium than can reasonably be sold, so much of it is stored for decay. The result is that tritium is fairly abundant and cheap in Ontario.

Besides SRB Technologies which packages tritium from Ontario Hydro into GTLS, another major manufacturer of GTLS is the Swiss company mb-microtec. mb-microtec is the parent of watch brand Traser and GTLS brand Trigalight, and seem to be one of the largest sources of consumer GTLS overall. Many of the tritium keychains you can buy, for example, use tritium vials manufactured by mb-microtec. NRC documents suggest that mb-microtec contracts a lot of their finished product manufacturing to a company in Hong Kong and that some of the finished products you see using their GTLS (like watches and fobs) are in fact white-labeled from that plant, but unfortunately don't make the original source of the tritium clear. mb-microtec has the distinction of operating the only recycling plant for tritium gas, and press releases surrounding the new recycling operation say they purchase the rest of their tritium supply. I assume from the civilian nuclear power industry in Switzerland, which has several major reactors operating.

A number of other manufacturers produce GTLS primarily for military applications, with some safety signage side business. And then there is, of course, the nuclear weapons program, which consumes the largest volume of tritium in the US. The US's tritium production facility for much of the Cold War actually shut down in 1988, one of the factors in most GTLS manufacturers being overseas. In the interim period, the sole domestic tritium supply was recycling of tritium in dismantled weapons and other surplus equipment. Since tritium has such a short half-life, this situation cannot persist indefinitely, and tritium production was resumed in 2004 at the Tennessee Valley Authority's Watts Bar nuclear generating station. Tritium extracted from that plant is currently used solely by the Department of Energy, primarily for the weapons program.

Finally, let's discuss the modern state of radioluminescence. GTLS, based on tritium, are the only type of radioluminescence available to consumers. All importation and distribution of GTLS requires an NRC license, although companies that only distribute products that have been manufactured and tested by another licensee fall under a license exemption category that still requires NRC reporting but greatly simplifies the process. Consumers that purchase these items have no obligations to the NRC. Major categories of devices under these rules include smoke detectors, detection instruments and small calibration sources, and self-luminous products using tritium, krypton, or promethium. You might wonder, "how big of a device can a I buy under these rules?" The answer to that question is a bit complicated, so let me explain my understanding of the rules using a specific example.

Let's say you buy a GTLS keychain from massdrop or wherever people get EDC baubles these days [3]. The business you ordered it from almost certainly did not make it, and is acting as an NRC exempt distributor of a product. In NRC terms, your purchase of the product is not the "initial sale or distribution," that already happened when the company you got it from ordered it from their supplier. Their supplier, or possibly someone further up in the chain, does need to hold a license: an NRC specific license is required to manufacture, process, produce, or initially transfer or sell tritium products. This is the reason that overseas companies like SRB and mb-microtec hold NRC licenses; this is the only way for consumers to legally receive their products.

It is important to note the word "specific" in "NRC specific license." These licenses are very specific; the NRC approves each individual product including the design of the containment and and labeling. When a license is issued, the individual products are added to a registry maintained by the NRC. When evaluating license applications, the NRC considers a set of safety objectives rather than specific criteria. For example, and if you want to read along we're in 10 CFR 32.23:

In normal use and disposal of a single exempt unit, it is unlikely that the external radiation dose in any one year, or the dose commitment resulting from the intake of radioactive material in any one year, to a suitable sample of the group of individuals expected to be most highly exposed to radiation or radioactive material from the product will exceed the dose to the appropriate organ as specified in Column I of the table in § 32.24 of this part.

So the rules are a bit soft, in that a licensee can argue back and forth with the NRC over means of calculating dose risk and so on. It is, ultimately, the NRC's discretion as to whether or not a device complies. It's surprisingly hard to track down original licensing paperwork for these products because of how frequently they are rebranded, and resellers never seem to provide detailed specifications. I suspect this is intentional, as I've found some cases of NRC applications that request trade secret confidentiality on details. Still, from the license paperwork I've found with hard numbers, it seems like manufacturers keep the total activity of GTLS products (e.g. a single GTLS sold alone, or the total of the GTLS in a watch) under 25 millicurie.

There do exist larger devices, of which exit signs are the largest category. Self-powered exit signs are also manufactured under NRC specific licenses, but their activity and resulting risk is too high to qualify for exemption at the distribution and use stage. Instead, all users of self-powered safety signs do so under a general license issued by the NRC (a general license meaning that it is implicitly issued to all such users). The general license is found in 10 CFR 31. Owners of tritium exit signs are required to designate a person to track and maintain the signs, inform the NRC of that person's contact information and any changes in that person, to inform the NRC of any lost, stolen, or damaged signs. General licensees are not allowed to sell or otherwise transfer tritium signs, unless they are remaining in the same location (e.g. when a building is sold), in which case they must notify the NRC and disclose NRC requirements to the transferee.

When tritium exit signs reach the end of their lifespan, they must be disposed of by transfer to an NRC license holder who can recycle them. The general licensee has to notify the NRC of that transfer. Overall, the intent of the general license regulations is to ensure that they are properly disposed of: reporting transfers and events to the NRC, along with serial numbers, allows the NRC to audit for signs that have "disappeared." Missing tritium exit signs are a common source of NRC event reports. It should also be said that, partly for these reasons, tritium exit signs are pretty expensive. Roughly $300 for a new one, and $150 to dispose of an old one.

Other radioluminescent devices you will find are mostly antiques. Radium dials are reasonably common, anything with a luminescent dial made before, say, 1960 is probably radium, and specifically Westclox products to 1978 likely use radium. The half-life of radium-226 is 1,600 years, so these radium dials have the distinction of often still working, although the paints have usually held up more poorly than the isotopes they contain. These items should be handled with caution, since the failure of the paint creates the possibility of inhaling or ingesting radium. They also emit radon as a decay product, which becomes hazardous in confined spaces, so radium dials should be stored in a well-ventilated environment.

Strontium-90 has a half-life of 29 years, and tritium 12 years, so vintage radioluminescent products using either have usually decayed to the extent that they no longer shine brightly or even at all. The phosphors used for these products will usually still fluoresce brightly under UV light and might even photoluminesce for a time after light exposure, but they will no longer stay lit in a dark environment. Fortunately, the decay that makes them not work also makes them much safer to handle. Tritium decays to helium-3 which is quite safe, strontium-90 to yttrium-90 which quickly decays to zirconium-90. Zirconium-90 is stable and only about as toxic as any other heavy metal. You can see why these radioisotopes are now much preferred over radium.

And that's the modern story of radioluminescence. Sometime soon, probably tomorrow, I will be sending out my supporter's newsletter, EYES ONLY, with some more detail on environmental remediation at historic processing facilities for radioluminescent products. You can learn a bit more about how US Radium was putting their waste in a hole in the ground, and also into a river, and sort of wherever else. You know Radium Dial Company was up to similar abuses.

[1] The assertion that Ottawa is conveniently close to Peru is one of those oddities of naming places after bigger, more famous places.

[2] CNSC's whole final report on Shield Source is only 25 pages. A similar decommissioning process in the US would produce thousands of pages of public record typically culminating in EPA Five Year Reviews which would be, themselves, perhaps a hundred pages depending on the amount of post-closure monitoring. I'm not familiar with the actual law but it seems like most of the difference is that CNSC does not normally publish technical documentation or original data (although one document does suggest that original data is available on request). It's an interesting difference... the 25-page report, really only 20 pages after front matter, is a lot more approachable for the public than a 400 page set of close-out reports. Much of the standard documentation in the US comes from NEPA requirements, and NEPA is infamous in some circles for requiring exhaustive reports that don't necessarily do anything useful. But from my perspective it is weird for the formal, published documentation on closure of a radiological site to not include hydrology discussion, demographics, maps, and fifty pages of data tables as appendices. Ideally a bunch of one-sentence acceptance emails stapled to the end for good measure. When it comes to describing the actual problem, CNSC only gives you a couple of paragraphs of background.

[3] Really channeling Guy Debord with my contempt for keychains here. during the writing of this article, I bought myself a tritium EDC bauble, so we're all in the mud together.

2025-02-17 of psychics and securities

Psychic Friends Network TV spot

September 6th, 1996. Eddie Murray, of the Baltimore Orioles, is at bat. He has had 20 home runs in the season; 499 in his career. Anticipation for the 500th had been building for the last week. It would make Murray only the third player to reach 500 home runs and 3000 hits. His career RBI would land in the top ten hitters in the history of the sport; his 500th home run was a statistical inevitability. Less foreseeable was the ball's glancing path through one of the most famous stories of the telephone business.

Statistics only tell you what might happen. Michael Lasky had made a career, a very lucrative one, of telling people what would happen. Lasky would have that ball. As usual, he made it happen by advertising.

Clearing right field, the ball landed in the hands of Dan Jones, a salesman from Towson, Maryland. Despite his vocation, he didn't immediately view his spectacular catch in financial terms. He told a newspaper reporter that he looked forward to meeting Murray, getting some signatures, some memorabilia. Instead, he got offers. At least three parties inquired about purchasing the ball, but the biggest offer came far from discreetly: an ad in the Baltimore Sun offering half a million dollars to whoever had it. Well, the offer was actually for a $25,000 annuity for 20 years, with a notional cash value of half a million but a time-adjusted value of $300,000 or less. I couldn't tell for sure, but given events that would follow, it seems unlikely that Jones ever received more than a few of the payments anyway. Still, the half a million made headlines, and NPV or not the sale price still set the record for a public sale of sports memorabilia.

Lasky handled his new purchase with his signature sense of showmanship. He held a vote, a telephone vote: two 1-900 numbers, charging $0.95 a call, allowed the public to weigh in on whether he should donate the ball to the Babe Ruth Birthplace museum or display it in the swanky waterfront hotel he part-owned. The proceeds went to charity, and after the museum won the poll, the ball did too. The whole thing was a bit of a publicity stunt, Lasky thrived on unsubtle displays and he could part with the money. His 1-900 numbers were bringing in over $100 million a year.


Lasky's biography is obscure. Born 1942 in Brooklyn, he moved to Baltimore in the 1960s for some reason connected to a conspicuous family business: a blood bank. Perhaps the blood bank was a grift, it's hard to say now, but Lasky certainly had a unique eye for business. He was fond of horse racing, or really, of trackside betting. His father, a postal worker, had a proprietary theory of mathematics that he applied to predicting the outcome of the race. This art, or science, or sham, is called handicapping, and it became Lasky's first real success. Under the pseudonym Mike Warren, he published the Baltimore Bulletin, a handicapping newsletter advertising sure bets at all the region's racetracks.

Well, there were some little details of this business, some conflicts of interest, a little infringement on the trademark of the Preakness. The details are neither clear nor important, but he had some trouble with the racing commissions in at least three states. He probably wouldn't have tangled with them at all if he weren't stubbornly trying to hold down a license to breed racehorses while also running a betting cartel, but Lasky was always driven more by passion than reason.

Besides, he had other things going. Predicting the future in print was sort of an industry vertical, and he diversified. His mail-order astrology operation did well, before the Postal Service shut it down. He ran some sort of sports pager service, probably tied to betting, and I don't know what came of that. Perhaps on the back of a new year's resolution, he ran a health club, although it collapsed in 1985 with a bankruptcy case that revealed some, well, questionable practices. Strange that a health club just weeks away from bankruptcy would sell so many multi-year memberships, paid up front. And where did that money go, anyway?

No matter, Lasky was onto the next thing. During the 1980s, changes had occurred that would grow Lasky's future-predicting portfolio into a staple of American media. First, in 1984, a Reagan-era FCC voted to end most regulation of television advertising. Gone was the limit of 16 minutes per hour of paid programming. An advertiser could now book entire half-hour schedule slots. Second, during the early '80s AT&T standardized and promoted a new model in telephone billing. The premium-rate number, often called a "1-900 number" after the NPA assigned for their use, incurred a callee-determined per-minute toll that the telco collected and paid on to the callee. It's a bit like a nascent version of "Web 3.0": telephone microtransactions, an innovative new way to pay for information services.

It seems like a fair assumption that handicapping brought Lasky to the 1-900 racket, and he certainly did offer betting tip lines. But he had learned a thing or two from the astrology business, even if it ran afoul of Big Postal. Handicapping involved a surprising amount of work, and its marketing centered around the supposedly unique insight of the handicapper. Fixed recordings of advice could only keep people on a telephone line for so long, anyway. Astrology, though, involved even fewer facts, and even more opportunity to ramble. Best of all, there was an established industry of small-time psychics working out of their homes. With the magic of the telephone, every one of them could offer intuitive readings to all of America, for just $3.99 a minute.

In 1990, Lasky's new "direct response marketing" company Inphomation, Inc. contracted five-time Grammy winner Dionne Warwick, celebrity psychic Linda Georgian, and a studio audience to produce a 30 minute talk-show "infomercial" promoting the Psychic Friends Network. Over the next few years, Inphomation conjoined with an ad booking agency and a video production company under the ownership of Mike Lasky's son, Marc Lasky. Inphomation spent as much as a million a week in television bookings, promoting a knitting machine and a fishing lure and sports tips, but most of all psychics. The original half-hour Psychic Friends Network spot is often regarded as the most successful infomercial in history. It remade Warwick's reputation, turning her from a singer to a psychic promoter. Calls to PFN's 1-900 number, charged at various rates that could reach over $200 an hour, brought in $140 million in revenue in its peak years of the mid 1990s.

Lasky described PFN as an innovative new business model, but it's one we now easily recognize as "gig work." Telephone psychics, recruited mostly by referral from the existing network, worked from home, answering calls on their own telephones. Some read Tarot, some gazed into crystals, others did nothing at all, but the important thing was that they kept callers on the line. After the phone company's cut and Inphomation's cut, they were paid a share of the per-minute rate that automatically appeared on caller's monthly phone bills.


A lot of people, and even some articles written in the last decade, link the Psychic Friends Network to "Miss Cleo." There's sort of a "Berenstain Bears" effect happening here; as widely as we might remember Miss Cleo's PFN appearances there are no such thing. Miss Cleo was actually the head psychic and spokeswoman of the Psychic Reader's Network, which would be called a competitor to the Psychic Friends Network except that they didn't operate at the same time. In the early '00s, the Psychic Reader's Network collapsed in scandal. The limitations of its business model, a straightforward con, eventually caught up. It was sued out of business by a dozen states, then the FTC, then the FCC just for good measure.

The era of the 1-900 number was actually rather short. By the late '80s, it had already become clear that the main application of premium rate calling was not stock quotations or paid tech support or referral services. It was scams. An extremely common genre of premium rate number, almost the lifeblood of the industry, were joke lines that offered telephonic entertainment in the voice of cartoon characters. Advertisements for these numbers, run during morning cartoons, advised children to call right away. Their parents wouldn't find out until the end of the month, when the phone bill came and those jokes turned out to have run $50 in 1983's currency. Telephone companies were at first called complicit in the grift, but eventually bowed to pressure and, in 1987, made it possible for consumers to block 1-900 calling on their phone service. Of course, few telephone customers took advantage, and the children's joke line racket went on into the early '90s when a series of FTC lawsuits finally scared most of them off the telephone network.

Adult entertainment was another touchstone of the industry, although adult lines did not last as long on 1-900 numbers as we often remember. Ripping off adults via their children is one thing; smut is a vice. AT&T and MCI, the dominant long distance carriers and thus the companies that handled most 1-900 call volume, largely cut off phone sex lines by 1991. Congress passed a law requiring telephone carriers to block them by default anyway, but of course left other 1-900 services as is. Phone sex lines were far from gone, of course, but they had to find more nuanced ways to make their revenue: international rates and complicit telephone carriers, dial-around long distance revenue, and whatever else they could think of that regulators hadn't caught up to yet.

When Miss Cleo and her Psychic Reader's Network launched in 1997, psychics were still an "above board" use of the 1-900 number. The Psychic Readers lived to see the end of that era. In the late '90s, regulations changed to make unpaid 1-900 bills more difficult to collect. By 2001, some telephone carriers had dropped psychic lines from their networks as a business decision. The bill disputes simply weren't worth the hassle. In 2002, AT&T ended 1-900 billing entirely. Other carriers maintained premium-rate billing for a decade later, but AT&T had most of the customer volume anyway.

The Psychic Friends Network, blessed by better vision, struck at the right time. 1990 to 1997 were the golden age of 1-900 and the golden age of Inphomation. Inphomation's three-story office building in Baltimore had a conference room with a hand-painted ceiling fresco of cherubs and clouds. In the marble-lined lobby, a wall of 25 televisions played Inphomation infomercials on repeat. At its peak, the Psychic Friends Network routed calls to 2,000 independent psychic contractors. Dionne Warwick and Linda Georgian were famous television personalities; Warwick wasn't entirely happy about her association with the brand but she made royalties whenever the infomercial aired. Some customers spent tens of thousands of dollars on psychic advice.

In 1993, a direct response marketing firm called Regal Communications made a deal to buy Inphomation. The deal went through, but just the next year Regal spun their entire 1-900 division off, and Inphomation exercised an option to become an independent company once again. A decade later, many of Regal's executives would face SEC charges over the details of Regal's 1-900 business, foreshadowing a common tendency of Psychic Friends Network owners.

The psychic business, it turns out, was not so unlike the handicapping business. Both were unsavory. Both made most of their money off of addicts. In the press, Lasky talked about casual fans that called for two minutes here and there. What's $5 for a little fun? You might even get some good advice. Lawsuits, regulatory action, and newspaper articles told a different story. The "30 free minutes" promotion used to attract new customers only covered the first two minutes of each call, the rest were billed at an aggressive rate. The most important customers stayed on the line for hours. Callers had to sit through a few minutes of recordings, charged at the full rate, before being connected to a psychic who drew out the conversation by speaking slowly and asking inane questions. Some psychics seem to have approached their job rather sincerely, but others apparently read scripts.

And just like the horse track, the whole thing moved a lot of money. Lasky continued to tussle with racing commissions over his thoroughbred horses. He bought a Mercedes, a yacht, a luxury condo, a luxury hotel whose presidential suite he used as an apartment, a half-million-dollar baseball. Well, a $300,000 baseball, at least.

Eventually, the odds turned against Lasky. Miss Cleo's Psychic Reader's Network was just one of the many PFN lookalikes that popped up in the late '90s. There was a vacuum to fill, because in 1997, Inphomation was descending into bankruptcy.


Opinions differ on Lasky's management and leadership. He was a visionary at least once, but later decisions were more variable. Bringing infomercial production in-house through his son's Pikesville Pictures might have improved creative control, but production budgets ballooned and projects ran late. PFN was still running mainly off of the Dionne Warwick shows, which were feeling dated, especially after a memorable 1993 Saturday Night Live parody featuring Christopher Walken. Lasky's idea for a radio show, the Psychic Friends Radio Network, had a promising trial run but then faltered on launch. Hardly a half dozen radio stations picked it up, and it lost Inphomation tens of millions of dollars. While they were years ahead of the telephone industry cracking down on psychics, PFN still struggled with a timeless trouble of the telephone network: billing.

AT&T had a long-established practice of withholding a portion of 1-900 revenue for chargebacks. Some customers see the extra charges on their phone bills and call in with complaints; the telephone company, not really being the beneficiary of the revenue anyway, was not willing to go to much trouble to keep it and often agreed to a refund. Holding, say, 10% of a callee's 1-900 billings in reserve allowed AT&T to offer these refunds without taking a loss. The psychic industry, it turned out, was especially prone to end-of-month customer dissatisfaction. Chargebacks were so frequent that AT&T raised Inphomation's withholding to 20%, 30%, and even 40% of revenue.

At least, that's how AT&T told it. Lasky always seemed skeptical, alleging that the telephone companies were simply refusing to hand over money Inphomation was owed, making themselves a free loan. Inphomation brokered a deal to move their business elsewhere, signing an exclusive contract with MCI. MCI underdelivered: they withheld just as much revenue, in violation of the contract according to Lasky, and besides the MCI numbers suffered from poor quality and dropped calls. At least, that's how Inphomation told it. Maybe the dropped calls were on Inphomation's end, and maybe they had a convenient positive effect on revenue as callers paid for a few minute of recordings before being connected to no one at all. By the time the Psychic Friends Network fell apart, there was a lot of blame passed around. Lasky would eventually prevail in a lawsuit against MCI for unpaid revenue, but not until too late.

By some combination of a lack of innovation in their product, largely unchanged since 1991, and increasing expenses for both advertising and its founder's lifestyle, Inphomation ended 1997 over $20 million in the red. In 1998 they filed for Chapter 11, and Lasky sought to reorganize the company as debtor-in-possession.

The bankruptcy case brought out some stories of Lasky's personal behavior. While some employees stood by him as a talented salesman and apt finder of opportunities, others had filed assault charges. Those charges were later dropped, but by many accounts, he had quite a temper. Lasky's habit of not just carrying but brandishing a handgun around the office certainly raised eyebrows. Besides, his expensive lifestyle persisted much too far into Inphomation's decline. The bankruptcy judge's doubts about Lasky reached a head when it was revealed that he had tried to hide the company's assets. Much of the infrastructure and intellectual property of the Psychic Friends Network, and no small amount of cash, had been transferred to the newly formed Friends of Friends LLC in the weeks before bankruptcy.

The judge also noticed some irregularities. The company controller had been sworn in as treasurer, signed the bankruptcy petition, and then resigned as treasurer in the space of a few days. When asked why the company chose this odd maneuver over simply having Lasky, corporate president, sign the papers, Lasky had trouble recalling the whole thing. He also had trouble recalling loans Inphomation had taken, meetings he had scheduled, and actions he had taken. When asked about Inphomation's board of directors, Lasky didn't know who they were, or when they had last met.

The judge used harsh language. "I've seen nothing but evidence of concealment, dishonesty, and less than full disclosure... I have no hope this debtor can reorganize with the present management." Lasky was removed, and a receiver appointed to manage Inphomation through a reorganization that quietly turned into a liquidation. And that was almost the end of the Psychic Friends Network.

Psychic Friends Network TV spot

The bankruptcy is sometimes attributed to Lasky's failure to adapt to the times, but PFN wasn't entirely without innovation. The Psychic Friends Network first went online, at psychicfriendsnetwork.com, in 1997. This website, launched in the company's final days, offered not only the PFN's 1-900 number but a landing page for a telephone-based version of "Colorgenics." Colorgenics was a personality test based on the "Lüscher color test," an assessment designed by a Swiss psychotherapist based on nothing in particular. There are dozens of colorgenics tests online today, many of which make various attempts to extract money from the user, but none with quite the verve of a color quiz via 1-900 number.

Inphomation just didn't quite make it in the internet age, or at least not directly. Most people know 1998 as the end of the Psychic Friends Network. The Dionne Warwick infomercials were gone, and that was most of PFN anyway. Without Linda Georgian, could PFN live on? Yes, it turns out, but not in its living form. The 1998 bankruptcy marked PFN's transition from a scam to the specter of a scam, and then to a whole different kind of scam. It was the beginning of the PFN's zombie years.

In 1999, Inphomation's assets were liquidated at auction for $1.85 million, a far cry from the company's mid-'90s valuations in the hundreds of millions. The buyer: Marc Lasky, Michael Lasky's son. PFN assets became part of PFN Holdings Inc., with Michael Lasky and Marc Lasky as officers. PFN was back.

It does seem that the Laskys made a brief second crack at a 1-900 business, but by 1999 the tide was clearly against expensive psychic hotlines. Telephone companies had started their crackdown, and attorney general lawsuits were brewing. Besides, after the buyout PFN Holdings didn't have much capital, and doesn't seem to have done much in the way of marketing. It's obscure what happened in these years, but I think the Laskys licensed out the PFN name.

psychicfriendsnetwork.com, from 2002 to around 2009, directed visitors to Keen. Keen was the Inphomation of the internet age, what Inphomation probably would have been if they had run their finances a little better in '97. Backed by $60 million in venture funding from names like Microsoft and eBay, Keen was a classic dotcom startup. They launched in '99 with the ambitious and original idea of operating a web directory and reference library. Like most of the seemingly endless number of reference website startups, they had to pivot to something else. Unlike most of the others, Keen and their investors had a relaxed set of moral strictures about the company's new direction. In the early 2000s, keen.com was squarely in the ethical swamp that had been so well explored by the 1-900 business. Their web directory specialized in phone sex and psychic advice---all offered by 1-800 numbers with convenient credit card payment, a new twist on the premium phone line model that bypassed the vagaries and regulations of telephone billing.

Keen is, incidentally, still around today. They'll broker a call or chat with empath/medium Citrine Angel, offering both angel readings and clairsentience, just $1 for the first 5 minutes and $2.99 a minute thereafter. That's actually a pretty good deal compared to the Psychic Friends Network's old rates. Keen's parent company, Ingenio, runs a half dozen psychic advice websites and a habit tracking app. But it says something about the viability of online psychics that Keen still seems to do most of their business via phone. Maybe the internet is not as much of a blessing for psychics as it seems, or maybe they just haven't found quite the right business model.

The Laskys enjoyed a windfall during PFN's 2000s dormancy. In 2004, the Inphomation bankruptcy estate settled its lawsuit against bankrupt MCI for withholding payments. The Laskys made $4 million. It's hard to say where that money went, maybe to backing Marc's Pikesville Pictures production company. Pikesville picked up odd jobs producing television commercials, promotional documentaries, and an extremely strange educational film intended to prepare children to testify in court. I only know about this because parts of it appear in the video "Marc Lasky Demo Reel," uploaded to YouTube by "Mike Warren," the old horse race handicapping pseudonym of Michael Lasky. It has 167 views, and a single comment, "my dad made this." That was Gabriela Lasky, Marc's daughter. It's funny how much of modern life plays out on YouTube, where Marc's own account uploaded the full run of PFN infomercials.


Some of that $4 million in MCI money might have gone into the Psychic Friends Networks' reboot. In 2009, Marc Lasky produced a new series of television commercials for PFN. "The legendary Psychic Friends Network is back, bigger and bolder than ever." An extremely catchy jingle goes "all new, all improved, all knowing: call the Psychic Friends Network." On PFN 2.0, you can access your favorite psychic whenever you wish, on your laptop, computer, on your mobile, or on your tablet.

These were decidedly modernized, directing viewers to text a keyword to an SMS shortcode or visit psychicfriendsnetwork.com, where they could set up real-time video consultations with PFN's network of advisors. Some referred to "newpfn.com" instead, perhaps because it was easier to type, or perhaps there was some dispute around the Keen deal. There were still echoes of the original 1990s formula. The younger Lasky seemed to be hunting for a new celebrity lead like Warwick, but having trouble finding one. Actress Vivica A. Fox appeared in one spot, but then sent a cease and desist and went to the press alleging that her likeness was used without her permission. Well, they got her to record the lines somehow, but maybe they never paid. Maybe she found out about PFN's troubled reputation after the shoot. In any case, Lasky went hunting again and landed on Puerto Rican astrologer and television personality Walter Mercado.

Mercado, coming off something like Liberace if he was a Spanish-language TV host, sells the Psychic Friends Network to a Latin beat and does a hell of a job of it. He was a recognizable face in the Latin-American media due to his astrology show, syndicated for many years by Univision, and he appears in a sparkling outfit that lets him deliver the line "the legend is back" with far more credibility than anyone else in the new PFN spots. He was no Dionne Warwick, though, and the 2009 PFN revival sorely lacked the production quality or charm of the '90s infomercial. It seems to have had little impact; this iteration of PFN is so obscure that many histories of the company are completely unaware of it.

Elsewhere, in Nevada, an enigmatic figure named Ya Tao Chang had incorporated Web Wizards Inc. I can tell you almost nothing about this; Chang is impossible to research and Web Wizards left no footprints. All I know is that, somehow, Web Wizards made it to a listing on the OTC securities market. In 2012, PFN Holdings needed money and, to be frank, I think that Chang needed a real business. Or, at least, something that looked like one. In a reverse-merger, PFN Holdings joined Web Wizards and renamed to Psychic Friends Network Inc., PFNI on the OTC bulletin board.

The deal was financed by Right Power Services, a British Virgin Islands company (or was it a Singapore company? accounts disagree), also linked to Chang. Supposedly, there were millions in capital. Supposedly, exciting things were to come for PFN.


Penny stocks are stocks that trade at low prices, under $5 or even more classically under $1. Because these prices are too low to quality for listing on exchanges, they trade on less formal, and less heavily regulated, over-the-counter markets. Related to penny stocks are microcap stocks, stocks of companies with very small market capitalizations. These companies, being small and obscure, typically see miniscule trading volumes as well.

The low price, low volume, and thus high volatility of penny stocks makes them notoriously prone to manipulation. Fraud is rampant on OTC markets, and if you look up a few microcap names it's not hard to fall into a sort of alternate corporate universe. There exists what I call the "pseudocorporate world," an economy that relates to "real" business the same way that pseudoscience relates to science. Pseudocorporations have much of the ceremony of their legitimate brethren, but none of the substance. They have boards, executives, officers, they issue press releases, they publish annual reports. What they conspicuously lack is a product, or a business. Like NFTs or memecoins, they are purely tokens for speculation, and that speculation is mostly pumping and dumping.

Penny stock pseudocompanies intentionally resemble real ones; indeed, their operation, to the extent that they have one, is to manufacture the appearance of operating. They announce new products, that will never materialize, they announce new partnerships, that will never amount to anything, they announce mergers, that never close. They also rearrange their executive leadership with impressive frequency, due in no small part to the tendency of those leaders to end up in trouble with the SEC. All of this means that it's very difficult to untangle their history, and often hard to tell if they were once real companies that were hollowed out and exploited by con men, or whether they were a sham all along.

Web Wizards does not appear to have had any purpose prior to its merger with PFN, and as part of the merger deal the Laskys became the executive leadership of the new company. They seem to have legitimately approached the transaction as a way to raise capital for PFN, because immediately after the merger they announced PFN's ambitious future. This new PFN would be an all-online operation using live webcasts and 1:1 video calling. The PFN website became a landing page for their new membership service, and the Laskys were primed to produce a new series of TV spots. Little more would ever be heard of this.

In 2014, PFN Inc renamed itself to "Peer to Peer Network Inc.," announcing their intent to capitalize on PFN's early gig work model by expanding the company into other "peer to peer" industries. The first and only venture Peer to Peer Network (PTOP on OTC Pink) announced was an acquisition of 321Lend, a silicon valley software startup that intended to match accredited investors with individuals needing loans. Neither company seems to have followed up on the announcement, and a year later 321Lend announced its acquisition by Loans4Less, so it doesn't seem that the deal went through.

I might be reading too much between the lines, but I think there was a conflict between the Laskys, who had a fairly sincere intent to operate the PFN as a business, and the revolving odd lot of investors and executives that seem to grow like mold on publicly-traded microcap companies.

Back in 2010, a stockbroker named Joshua Sodaitis started work on a transit payment and routing app called "Freemobicard." In 2023, he was profiled in Business Leaders Review, one of dozens of magazines, podcasts, YouTube channels, and Medium blogs that exist to provide microcap executives with uncritical interviews that create the resemblance of notability. The Review says Sodaitis "envisioned a future where seamless, affordable, and sustainable transportation would be accessible to all." Freemobicard, the article tells us, has "not only transformed the way people travel but has also contributed to easing traffic congestion and reducing carbon emissions."

It never really says what Freemobicard actually is, but that doesn't matter, because by the time it gets involved in our story Sodaitis had completely forgotten about the transportation thing anyway.


In 2015, disagreements between the psychic promoters and the stock promoters had come to a head. Attributing the move to differences in business vision, the Laskys bought the Psychic Friends Network assets out of Peer to Peer Network for $20,000 and resigned their seats on PTOP's board. At about the same time, PTOP announced a "licensing agreement" with a software company called Code2Action. The licensing agreement somehow involved Code2Action's CEO, Christopher Esposito, becoming CEO of PTOP itself. At this point Code2Action apparently rolled up operations, making the "licensing agreement" more of a merger, but the contract as filed with the SEC does indeed read as a license agreement. This is just one of the many odd and confusing details of PTOP's post-2015 corporate governance.

I couldn't really tell you who Christopher Esposito is or where he came from, but he seems to have had something to do with Joshua Sodaitis, because he would eventually bring Sodaitis along as a board member. More conspicuously, Code2Action's product was called Mobicard---or Freemobicard, depending on which press release you read. This Mobicard was a very different one, though. Prior to the merger it was some sort of SMS marketing product (a "text this keyword to this shortcode" type of autoresponse/referral service), but as PTOP renamed itself to Mobicard Inc. (or at least announced the intent to, I don't think the renaming ever actually happened) the vision shifted to the lucrative world of digital business cards. Their mobile app, Mobicard 1.0, allowed business professionals to pay a monthly fee to hand out a link to a basic profile webpage with contact information and social media links. Kind of like Linktree, but with LinkedIn vibes, higher prices, and less polish.

One of the things you'll notice about Mobicard is that, for a software company, they were pretty short on software engineers. Every version of the products (and they constantly announce new ones, with press releases touting Mobicard 1.5, 1.7, and 2.0) seems to have been contracted to a different low-end software house. There are demo videos of various iterations of Mobicard, and they are extremely underwhelming. I don't think it really mattered, PTOP didn't expect Mobicard to make money. Making money is not the point of a microcap pseudocompany.

That same year, Code2Action signed another license agreement, much like the PTOP deal, but with a company called Cannabiz. Or maybe J M Farms Patient Group, the timeline is fuzzy. This was either a marketing company for medical marijuana growers or a medical marijuana grower proper, probably varying before and after they were denied a license by the state of Massachusetts on account of the criminal record of one of the founders. The whole cannabis aside only really matters because, first, it matches the classic microcap scam pattern of constantly pivoting to whatever is new and hot (which was, for a time, newly legalized cannabis), and second, because a court would later find that Cannabiz was a vehicle for securities fraud.

Esposito had a few years of freedom first, though, to work on his new Peer to Peer Network venture. He made the best of it: PTOP issued a steady stream of press releases related to contracts for Mobicard development, the appointment of various new executives, and events as minor as having purchased a new domain name. Despite the steady stream of mentions in the venerable pages of PRNewswire, PTOP doesn't seem to have actually done anything. In 2015, 2016, 2017, and 2018, PTOP failed to complete financial audits and SEC reports. To be fair, in 2016 Esposito was fined nearly $100,000 by the SEC as part of a larger case against Cannabiz and its executives. He must have had a hard time getting to the business chores of PTOP, especially since he had been barred from stock promotion.

In 2018, with PTOP on the verge of delisting due to the string of late audits, Joshua Sodaitis was promoted to CEO and Chairman of "Peer to Peer Network, Inc., (Stock Ticker Symbol PTOP) a.k.a. Mobicard," "the 1st and ONLY publicly traded digital business card company." PTOP's main objective became maintaining its public listing, and for a couple of years most discussion of the actual product stopped.

In 2020, PTOP made the "50 Most Admired Companies" in something called "The Silicon Valley Review," which I assume is prestigious and conveniently offers a 10% discount if you nominate your company for one of their many respected awards right now. "This has been a monumental year for the company," Sodaitis said, announcing that they had been granted two (provisional) patents and appointed a new advisory board (including one member "who is self-identified as a progressive millennial" and another who was a retired doctor). The bio of Sodaitis mentions the Massachusetts medical marijuana venture, using the name of the company that was denied a license and shuttered by the SEC, not the reorganized replacement. Sodaitis is not great with details.

It's hard to explain Mobicard because of this atmosphere of confusion. There was the complete change in product concept, which is itself confusing, since Sodaitis seems to have given the interview where he discussed Mobicard as a transportation app well after he had started describing it as a digital business card. Likewise, Mobicard has a remarkable number of distinct websites. freemobicard.com, mobicard.com, ptopnetwork.com, and mobicards.ca all seem oddly unaware of each other, and as the business plan continues to morph, are starting to disagree on what mobicard even is. The software contractor or staff developing the product keep changing, as does the version of mobicard they are about to launch. And on top of it all are the press releases.

Oh, the press releases. There's nary a Silicon Valley grift unmentioned in PTOP's voluminous newswire output. Crypto, the Metaverse, and AI all make appearances as part of the digital business card vision. As for the tone, the headlines speak for themselves.

"MOBICARD Set for Explosive Growth in 2024"

"MobiCard's Digital Business Card Revolutionizes Networking & Social Media"

"MOBICARD Revolutionizes Business Cards"

"Peer To Peer Network, aka Mobicard™ Announces Effective Form C Filing with the SEC and Launch of Reg CF Crowdfunding Campaign"

"Joshua Sodaitis, Mobicard, Inc. Chairman and CEO: 'We’re Highly Committed to Keeping Our 'One Source Networking Solution' Relevant to the Ever-Changing Dynamics of Personal and Professional Networking'"

"PTOP ANNOUNCES THE RESUBMISSION OF THE IMPROVED MOBICARD MOBILE APPS TO THE APPLE STORE AND GOOGLE PLAY"

"Mobicard™ Experienced 832% User Growth in Two Weeks"

"Peer To Peer Network Makes Payment to Attorney To File A Provisional Patent for Innovative Technology"

Yes, this company issues a press release when they pay an invoice. To be fair, considering the history of bankruptcy, maybe that's more of an achievement than it sounds.

In one "interview" with a "business magazine," Sodaitis talks about why Mobicard has taken so long to reach maturity. It's the Apple app store review, he explains, a story to which numerous iOS devs will no doubt relate. Besides, based on their press releases, they have had to switch contractors and completely redevelop the product multiple times. I didn't know that the digital business card was such a technical challenge. Sodaitis has been working on it for perhaps as long as fifteen years and still hasn't quite gotten to MVP.


You know where this goes, don't you? After decades of shady characters, trouble with regulators, cosplaying at business, and outright scams, there's only one way the story could possibly end.

All the way back in 2017, PTOP announced that they were "Up 993.75% After Launch Of Their Mobicoin Cryptocurrency." PTOP, the release continues, "saw a truly Bitcoin-esque move today, completely outdoing the strength of every other stock trading on the OTC market." PTOPs incredible market move was, of course, from $0.0005 to $0.0094. With 22 billion shares of common stock outstanding, that gave PTOP a valuation of over $200 million my the timeless logic of the crypto investor.

Of course, PTOP wasn't giving up on their OTC listing, and with declining Bitcoin prices their interest in the cryptocurrency seems to have declined as well. That was, until the political and crypto market winds shifted yet again. Late last year, PTOP was newly describing Mobicoin as a utility token. In November, they received a provisional patent on "A Cryptocurrency-Based Platform for Connecting Companies and Social Media Users for Targeted Marketing Campaigns." This is the latest version of Mobicard. As far as I can tell, it's now a platform where people are paid in cryptocurrency for tweeting advertising on behalf of a brand.

PTOP had to beef up their crypto expertise for this exciting new frontier. Last year, they hired "Renowned Crypto Specialist DeFi Mark," proprietor of a cryptocurrency casino and proud owner of 32,000 Twitter followers. "With Peer To Peer Network, we're poised to unleash the power of blockchain, likely triggering a significant shift in the general understanding of web3," he said.

"I have spoken to our Senior Architect Jay Wallace who is a genius at what he does and he knows that we plan to Launch Mobicard 1.7 with the MOBICOIN fully implemented shortly after the New President is sworn into office. I think this is a great time to reintroduce the world to MOBICOIN™ regardless of how I, or anyone feels about politics we can't deny the Crypto markets exceptional increase in anticipation to major regulatory transformations. I made it very clear to our Tech Team leader that this is a must to launch Mobicard™ 1.7.

Well, they've outdone themselves. Just two weeks ago, they announced Mobicard 2.0. "With enhanced features like real-time analytics, seamless MOBICOIN™ integration, and enterprise-level onboarding for up to 999 million employees, this platform is positioned to set new standards in the digital business card industry."

And how does that cryptocurrency integration work?

"Look the Mobicard™ Reward system is simple. We had something like it previously implemented back in 2017. If a MOBICARD™ user shares his MOBICARD™ 50 times in one week then he will be rewarded with 50 MOBICOIN's. If a MOBICARD user attends a conference and shares his digital business card MOBICARD™ with 100 people he will be granted 100 MOBICOIN™'s."

Yeah, it's best not to ask.


I decided to try out this innovative new digital business card experience, although I regret to say that the version in the Play Store is only 1.5. I'm sure they're just waiting on app store review. The dashboard looks pretty good, although I had some difficulty actually using it. I have not so far been able to successfully create a digital business card, and most of the tabs just lead to errors, but I have gained access to four or five real estate brokers and CPAs via the "featured cards." One of the featured cards is for Christopher Esposito, listed as "Crypto Dev" at NRGai.

Somewhere around 2019, Esposito brought Code2Action back to life again. He promoted a stock offering, talking up the company's bright future and many promising contracts. You might remember that this is exactly the kind of thing that the SEC got him for in 2016, and the SEC dutifully got him again. He was sentenced to five of probation after a court found that he had lied about a plan to merge Code2Action with another company and taken steps to conceal the mass sale of his own stock in the company.

NRGai, or NRG4ai, they're inconsistent, is a token that claims to facilitate the use of idle GPUs for AI training. According to one analytics website, it has four holders and trades at $0.00.

The Laskys have moved on as well. Michael Lasky is now well into retirement, but Marc Lasky is President & Director of Fernhill Corporation, "a publicly traded Web3 Enterprise Software Infrastructure company focused on providing cloud based APIs and solutions for digital asset trading, NFT marketplaces, data aggregation and DeFi/Lending". Fernhill has four subsidiaries, ranging from a cryptocurrency market platform to mining software. None appear to have real products. Fernhill is trading on OTC Pink at $0.00045.

Joshua Sodaitis is still working on Mobicard. Mobicard 2.0 is set for a June 1 launch date, and promises to "redefine digital networking and position [PTOP] as the premier solution in the digital business card industry." "With these exciting developments, we anticipate a positive impact on the price of PTOP stock." PTOP is trading on OTC Pink at $0.00015.

Michael Lasky was reportedly fond of saying that "you can get more money from people over the telephone than using a gun." As it happens, he wielded a gun anyway, but he had a big personality like that. One wonders what he would say about the internet. At some point, in his golden years, he relaunched his handicapping business Mike Warren Sports. The website sold $97/month subscriptions for tips on the 2015 NFL and NCAA football seasons, and the customer testimonials are glowing. One of them is from CNN's Larry King, although it doesn't read much like a testimonial, more like an admission that he met Lasky once.

There might still be some hope. A microcap investor, operating amusingly as "FOMO Inc.," has been agitating to force a corporate meeting for PTOP. PTOP apparently hasn't held one in years, is once again behind on audits, and isn't replying to shareholder inquiries. Investors allege poor management by Sodaitis. The demand letter, in a list of CC'd shareholders the author claims to represent by proxy, includes familiar names: Mike and Marc Lasky. They never fully divested themselves of their kind-of-sort-of former company.

A 1998 article in the Baltimore Sun discussed Lasky's history as a handicapper. It quotes a former Inphomation employee, whose preacher father once wore a "Mike Warren Sports" sweater at the mall.

"A woman came up to him and said 'Oh, I believe in him, Mike Warren.' My father says, 'well, ma'am, everybody has to believe in something."

Lasky built his company on predicting the future, but of course, he was only ever playing the odds. Eventually, both turned on him. His company fell to a series of bad bets, and his scam fell to technological progress. Everyone has to believe in something, though, and when one con man stumbles there are always more ready to step in.

Psychic Friends Network TV spot

2025-02-02 residential networking over telephone

Recently, I covered some of the history of Ethernet's tenuous relationship with installed telephone cabling. That article focused on the earlier and more business-oriented products, but many of you probably know that there have been a number of efforts to install IP networking over installed telephone wiring in a residential and SOHO environment. There is a broader category of "computer networking over things you already have in your house," and some products remain pretty popular today, although seemingly less so in the US than in Europe.

The grandparent of these products is probably PhoneNet, a fairly popular product introduced by Farallon in the mid-'80s. At the time, local area networking for microcomputers was far from settled. Just about every vendor had their own proprietary solution, although many of them had shared heritage and resulting similarities. Apple Computer was struggling with the situation just like everyone; in 1983 they introduced an XNS-based network stack for the Lisa called AppleNet and then almost immediately gave up on it [1]. Steve Jobs made the call to adopt IBM's token ring instead, which would have seemed like a pretty safe bet at the time because of IBM's general prominence in the computing industry. Besides, Apple was enjoying a period of warming relations with IBM, part of the 1980s-1990s pattern of Apple and Microsoft alternately courting IBM as their gateway into business computing.

The vision of token ring as the Apple network standard died the way a lot of token ring visions did, to the late delivery and high cost of IBM's design. While Apple was waiting around for token ring to materialize, they sort of stumbled into their own LAN suite, AppleTalk [2]. AppleTalk was basically an expansion of the unusually sophisticated peripheral interconnect used by the Macintosh to longer cable runs. Apple put a lot of software work into it, creating a pretty impressive zero-configuration experience that did a lot to popularize the idea of LANs outside of organizations large enough to have dedicated network administrators. The hardware was a little more, well, weird. In true Apple fashion, AppleTalk launched with a requirement for weird proprietary cables. To be fair, one of the reasons for the system's enduring popularity was its low cost compared to Ethernet or token ring. They weren't price gouging on the cables the way they might seem to today. Still, they were a decided inconvenience, especially when trying to connect machines across more than one room.

One of the great things about AppleTalk, in this context, is that it was very slow. As a result, even though the physical layer was basically RS-422, the electrical requirements for the cabling were pretty relaxed. Apple had already taken advantage of this for cost reduction, using a shared signal ground on the long cables rather than the dedicated differential pairs typical for RS-422. A hobbyist realized that you could push this further, and designed a passive dongle that used telephone wiring as a replacement for Apple's more expensive dongle and cables. He filed a patent and sold it to Farallon, who introduced the product as PhoneNet.

PhoneNet was a big hit. It was cheaper than Apple's solution for the same performance, and even better, because AppleTalk was already a bus topology it could be used directly over the existing parallel-wired telephone cabling in a typical house or small office. For a lot of people with heritage in the Apple tradition of computing, it'll be the first LAN they ever used. Larger offices even used it because of the popularity of Macs in certain industries and the simplicity of patching their existing telephone cables for AppleTalk use; in my teenage years I worked in an office suite in downtown Portland that hadn't seen a remodel for a while and still had telephone jacks labeled "PhoneNet" at the desks.

PhoneNet had one important limitation compared to the network-over-telephone products that would follow: it could not coexist with telephony. Well, it could, in a sense, and was advertised as such. But PhoneNet signaled within the voice band, so it required dedicated telephone pairs. In a lot of installations, it could use the second telephone line that was often wired but not actually used. Still, it was a bust for a lot of residential installs where only one phone line was fully wired and already in use for phone calls.

As we saw in the case of Ethernet, local area networking standards evolved very quickly in the '80s and '90s. IP over Ethernet became by far the dominant standard, so the attention of the industry shifted towards new physical media for Ethernet frames. While 10BASE-T Ethernet operated over category 3 telephone wiring, that was of little benefit in the residential market. Commercial buildings typically had "home run" telephone wiring, in which each office's telephone pair ran directly to a wiring closet. In residential wiring of the era, this method was almost unheard of, and most houses had their telephone jacks wired in parallel along a small number of linear segments (often just one).

This created a cabling situation much like coaxial Ethernet, in which each telephone jack was a "drop" along a linear bus. The problem is that coaxial Ethernet relied on several different installation measures to make this linear bus design practical, and home telephone wiring had none of these advantages. Inconsistently spaced drops, side legs, and a lack of termination meant that reflections were a formidable problem. PhoneNet addressed reflections mainly by operating at a very low speed (allowing reflections to "clear out" between symbols), but such low bitrate did not befit the 1990s.

A promising solution to the reflection problem came from a company called Tut Systems. Tut's history is unfortunately obscure, but they seem to have been involved in what we would now call "last-mile access technologies" since the 1980s. Tut would later be acquired by Motorola, but not before developing a number of telephone-wiring based IP networks under names like HomeWire and LongWire. A particular focus of Tut was multi-family housing, which will become important later.

I'm not even sure when Tut introduced their residential networking product, but it seems like they filed a relevant patent in 1995, so let's say around then. Tut's solution relied on pulse position modulation (PPM), a technique in which data is encoded by the length of the spacing between pulses. The principal advantage of PPM is that it allows a fairly large number of bits to be transmitted per pulse (by using, say, 16 potential pulse positions to encode 4 bits). This allowed reflections to dissipate between pulses, even at relatively high bitrates.

Following a bit of inter-corporate negotiation, the Tut solution became an industry standard under the HomePNA consortium: HomePNA 1.0. HomePNA 1.0 could transmit 1Mbps over residential telephone wiring with up to 25 devices. A few years later, HomePNA 1.0 was supplanted by HomePNA 2.0, which replaced PPM with QAM (a more common technique for high data rates over low bandwidth channels today) and in doing so improved to 10Mbps for potentially thousands of devices.

I sort of questioned writing an article about all of these weird home networking media, because the end-user experience for most of them is pretty much the same. That makes it kind of boring to look at them one by one, as you'll see later. Fortunately, HomePNA has a property that makes it interesting: despite a lot of the marketing talking more about single-family homes, Tut seems to have envisioned HomePNA mainly as a last-mile solution for multi-family housing. That makes HomePNA a bit different than later offerings, landing in a bit of a gray area between the LAN and the access network.

The idea is this: home run wiring is unusual in residential buildings, but in apartment and condo buildings, it is typical for the telephone lines of each unit to terminate in a wiring closet. This yields a sort of hybrid star topology where you have one line to each unit, and multiple jacks in each unit. HomePNA took advantage of this wiring model by offering a product category that is at once bland and rather unusual for this type of media: a hub.

HomePNA hubs are readily available, even today in used form, with 16 or 24 HomePNA interfaces. The idea of a hub can be a little confusing for a shared-bus media like HomePNA, but each interface on these hubs is a completely independent HomePNA network. In an apartment building, you could connect one interface to the telephone line of each apartment, and thus offer high-speed (for the time) internet to each of your tenants using existing infrastructure. A 100Mbps Ethernet port on the hub then connected to whatever upstream access you had available.

The use of the term "hub" is kind of weird, and I do believe that at least in the case of HomePNA 2.0, they were actually switching devices. This leads to some weird labeling like "hub/switch," perhaps a result of the underlying oddity of a multi-port device on a shared-media network that nonetheless performs no routing.

There's another important trait of HomePNA 2.0 that we should discuss, at least an important one to the historical development of home networking. HomePNA 1.0 was designed not to cause problematic interference with telephone calls but still effectively signaled within the voice band. HomePNA 2.0's QAM modulation addressed this problem completely: it signaled between 4MHz and 10MHz, which put it comfortably above not only the voice band but the roughly up-to-1MHz band used by early ADSL. HomePNA could coexist with pretty much anything else that would have been used on a telephone line at the time.

Over time, control of HomePNA shifted away from Tut Systems and towards a competitor called Epigram, who had developed the QAM modulation for HomePNA 2.0. Later part of Broadcom, Epigram also developed a 100Mbps HomePNA 3.0 in 2005. The wind was mostly gone from HomePNA's sails by that point, though, more due to the rise of WiFi than anything else. There was a HomePNA 3.1, which added support for operation over cable TV wiring, but shortly after, in 2009, the HomePNA consortium endorsed the HomeGrid Forum as a successor. A few years later, HomePNA merged into HomeGrid Forum and faded away entirely.

The HomeGrid Forum is the organization behind G.hn, which is to some extent a successor of HomePNA, although it incorporates other precedents as well. G.hn is actually fairly widely used for the near-zero name recognition it enjoys, and I can't help but suspect that that's a result of the rather unergonomic names that ITU standards tend to take on. "G.hn" kind-of-sort-of stands for Gigabit Home Networking, which is at least more memorable than the formal designation G.9960, but still isn't at all distinctive.

G.hn is a pretty interesting standard. It's quite sophisticated, using a complex and modern modulation scheme (OFDM) along with forward error correction. It is capable of up to 2Gbps in its recent versions, and is kind of hard to succinctly discuss because it supports four distinct physical media: telephone, coaxial (TV) cable, powerline, and fiber.

G.hn's flexibility is probably another reason for its low brand recognition, because it looks very different in different applications. Distinct profiles of G.hn involve different band plans and signaling details for each physical media, and it's designed to coexist with other protocols like ADSL when needed.

Unlike HomePNA, multi-family housing is not a major consideration in the design of G.hn and combining multiple networks with a "hub/switch" is unusual. There's a reason: G.hn wasn't designed by access network companies like Tut; it was mostly designed in the television set-top box (STB) industry.

When G.hn hit the market in 2009, cable and satellite TV was rapidly modernizing. The TiVo had established DVRs as nearly the norm, and then pushed consumers further towards the convenience of multi-room DVR systems. Providing multi-room satellite TV is actually surprisingly complex, because STV STBs (say that five times fast) actually reconfigure the LNA in the antenna as part of tuning. STB manufacturers, dominated by EchoStar (at one time part of Hughes and closely linked to the Dish Network), had solved this problem by making multiple STBs in a home communicate with each other. Typically, there is a "main" STB that actually interacts with the antenna and decodes TV channels. Other STBs in the same house use the coaxial cabling to communicate with the main STB, requesting video signals for specific channels.

Multi-room DVR was basically an extension of this same concept. One STB is the actual DVR, and other STBs remote-control it, scheduling recordings and then having the main STB play them back, transmitting the video feed over the in-home coaxial cabling. You can see that this is becoming a lot like HomePNA, repurposing CATV-style or STV-style coaxial cabling as a general-purpose network in which peer devices can communicate with each other.

As STB services have become more sophisticated, "over the top" media services and "triple play" combo packages have become an important and lucrative part of the home communications market. Structurally, these services can feel a little clumsy, with an STB at the television and a cable modem with telephone adapters somewhere else. STBs increasingly rely on internet-based services, so you then connect the STB to your WiFi so it can communicate via the same cabling but a different modem. It's awkward.

G.hn was developed to unify these communications devices, and that's mostly how it's used. Providers like AT&T U-verse build G.hn into their cable television devices so that they can all share a DOCSIS internet connection. There are two basic ways of employing G.hn: first, you can use it to unify devices. The DOCSIS modem for internet service is integrated into the STB, and then G.hn media adapters can provide Ethernet connections wherever there is an existing cable drop. Second, G.hn can also be applied to multi-family housing, by installing a central modem system in the wiring closet and connecting each unit via G.hn. Providers that have adopted G.hn often use both configurations depending on the customer, so you see a lot of STBs these days with G.hn interfaces and extremely flexible configurations that allow them to either act as the upstream internet connection for the G.hn network, or to use a G.hn network that provides internet access from somewhere else. The same STB can thus be installed in either a single-family home or a multi-family unit.

We should take a brief aside here to mention MoCA, the Multimedia over Coax Alliance. MoCA is a somewhat older protocol with a lot of similarities to G.hn. It's used in similar ways, and to some extent the difference between the two just comes down to corporate alliances: AT&T is into G.hn, but Cox, both US satellite TV providers, and Verizon have adopted MoCA, making it overall the more common of the two. I just think it's less interesting. Verizon FiOS prominently uses MoCA to provide IP-based television service to STBs, via an optical network terminal that provides MoCA to the existing CATV wiring.

We've looked at home networking over telephone wiring, and home networking over coaxial cable. What about the electrical wiring? G.hn has a powerline profile, although it doesn't seem to be that widely used. Home powerline networking is much more often associated with HomePlug.

Well, as it happens, HomePlug is sort of dead, the industry organization behind it having wrapped up operations in 2016. That might not be such a big practical problem, though, as HomePlug is closely aligned with related IEEE standards for data over powerline and it's widely used in embedded applications.

As a consumer product, HomePlug will be found in the form of HomePlug AV2. AV2 offers Gigabit-plus data rates over good quality home electrical wiring, and compared to G.hn and MoCA it enjoys the benefit that standalone, consumer adapters are very easy to buy.

HomePlug selects the most complex modulation the wiring can support (typically QAM with a large constellation size) and uses multiple OFDM carriers in the HF band, which it transmits onto the neutral conductor of an outlet. The neutral wiring in the average house is also joined at one location in the service panel, so it provides a convenient shared bus. On the downside, the installation quality of home electrical wiring is variable and the neutral conductor can be noisy, so some people experience very poor performance from HomePlug. Others find it to be great. It really depends on the situation.

That brings us to the modern age: G.hn, MoCA, and HomePlug are all more or less competing standards for data networking using existing household wiring. As a consumer, you're most likely to use G.hn or MoCA if you have an ISP that provides equipment using one of the two. Standalone consumer installations, for people who just want to get Ethernet from one place to another without running cable, usually use HomePlug.

It doesn't really have to be that way, G.hn powerline adapters have come down in price to where they compete pretty directly with HomePlug. Coaxial-cable and telephone-cable based solutions actually don't seem to be that popular with consumers any more, so powerline is the dominant choice. I can take a guess at the reason: electrical wiring can be of questionable quality, but in a lot of houses I see the coaxial and telephone wiring is much worse. Some people have outright removed the telephone wiring from houses, and the coaxial plant has often been through enough rounds of cable and satellite TV installers that it's a bit of a project to sort out which parts are connected. A large number of cheap passive distribution taps, common in cable TV where the signal level from the provider is very high, can be problematic for coaxial G.hn or MoCA. It's usually not hard to fix those problems, but unless an installer from the ISP sorts it out it usually doesn't happen. For the consumer, powerline is what's most likely to work.

And, well, I'm not sure that any consumers care any more. WiFi has gotten so fast that it often beats the data rates achievable by these solutions, and it's often more reliable to boot. HomePlug in particular has a frustrating habit of working perfectly except for when something happens, conditions degrade, the adapters switch modulations, and the connection drops entirely for a few seconds. That's particularly maddening behavior for gamers, who are probably the most likely to care about the potential advantages of these wired solutions over WiFi.

I expect G.hn, MoCA, and HomePlug to stick around. All three have been written into various embedded standards and adopted by ISPs as part of their access network in multi-family or at least as an installation convenience in single-family contexts. But I don't think anyone really cares about them any more, and they'll start to feel as antiquated as HomePNA.

And here's a quick postscript to show how these protocols might adapt to the modern era: remember how I said G.hn can operate over fiber? Cheap fiber, too, the kind of plastic cables used by S/PDIF. The HomePlug Forum is investigating the potential of G.hn over in-home passive optical networks, on the theory that these passive optical networks can be cheaper (due to small conductor size and EMI tolerance) and more flexible (due to the passive bus topology) than copper Ethernet. I wouldn't bet money on it, given the constant improvement of WiFi, but it's possible that G.hn will come back around for "fiber in the home" internet service.

[1] XNS was a LAN suite designed by Xerox in the 1970s. Unusually for the time, it was an openly published standard, so a considerable number of the proprietary LANs of the 1980s were at least partially based on XNS.

[2] The software sophistication of AppleTalk is all the more impressive when you consider that it was basically a rush job. Apple was set to launch LaserWriter, and as I mentioned recently on Mastodon, it was outrageously expensive. LaserWriter was built around the same print engine as the first LaserJet and still cost twice as much, due in good part to its flexible but very demanding PostScript engine. Apple realized it would never sell unless multiple Macintoshes could share it---it cost nearly as much as three Mac 128ks!---so they absolutely needed to have a LAN solution ready. LaserWriter would not wait for IBM to get their token ring shit together. This is a very common story of 1980s computer networks; it's hard to appreciate now how much printer sharing was one of the main motivations for networking computers at all. There's this old historical theory that hasn't held up very well but is appealing in its simplicity, that civilization arises primarily in response to the scarcity of water and thus the need to construct irrigation works. You could say that microcomputer networking arises primarily in response to the scarcity of printers.

2025-01-20 office of secure transportation

I've seen them at least twice on /r/whatisthisthing, a good couple dozen times on the road, and these days, even in press photos: GMC trucks with custom square boxes on the back, painted dark blue, with US Government "E" plates. These courier escorts, "unmarked" but about as subtle as a Crown Vic with a bull bar, are perhaps the most conspicuous part of an obscure office of a secretive agency. One that seems chronically underfunded but carries out a remarkable task: shipping nuclear weapons.

The first nuclear weapon ever constructed, the Trinity Device, was transported over the road from Los Alamos to the north end of the White Sands Missile Range, near San Antonio, New Mexico. It was shipped disassembled, with the non-nuclear components strapped down in a box truck and the nuclear pit nestled in the back seat of a sedan. Army soldiers, of the Manhattan Engineering District, accompanied it for security. This was a singular operation, and the logistics were necessarily improvised.

The end of the Second World War brought a brief reprieve in the nuclear weapons program, but only a brief one. By the 1950s, an arms race was underway. The civilian components of the Manhattan Project, reorganized as the Atomic Energy Commission, put manufacturing of nuclear arms into full swing. Most nuclear weapons of the late '40s, gravity bombs built for the Strategic Air Command, were assembled at former Manhattan Project laboratories. They were then "put away" at one of the three original nuclear weapons stockpiles: Manzano Base, Albuquerque; Killeen Base, Fort Hood; and and Clarksville Base, Fort Campbell [1].

By the mid-1950s, the Pantex Plant near Amarillo had been activated as a full-scale nuclear weapons manufacturing center. Weapons were stockpiled not only at the AEC's tunnel sites but at the "Q Areas" of about 20 Strategic Air Command bases throughout the country and overseas. Shipping and handling nuclear weapons was no longer a one-off operation, it was a national enterprise.

To understand the considerations around nuclear transportation, it's important to know who controls nuclear weapons. In the early days of the nuclear program, all weapons were exclusively under civilian control. Even when stored on military installations (as nearly all were), the keys and combinations to the vaults were held by employees of the AEC, not military personnel. Civilian control was a key component of the Atomic Energy Act, an artifact of a political climate that disfavored the idea of fully empowering the military with such destructive weapons. Over the decades since, larger and larger parts of the nuclear arsenal have been transferred into military control. The majority of "ready to use" nuclear weapons today are "allocated" to the military, and the military is responsible for storing and transporting them.

Even today, though, civilian control is very much in force for weapons in any state other than ready for use. Newly manufactured weapons (in eras in which there were such a thing), weapons on their way to and from refurbishment or modification, and weapons removed from the military allocation for eventual disassembly are all under the control of the Department of Energy's National Nuclear Security Administration [2]. So too are components of weapons, test assemblies, and the full spectrum of Special Nuclear Material (a category defined by the Atomic Energy Act). Just as in the 1940s, civilian employees of the DoE are responsible for securing and transporting a large inventory of weapons and sensitive assets.

As the Atomic Energy Commission matured, and nuclear weapons became less of an experiment and more of a product, transportation arrangements matured as well. It's hard to find much historical detail on AEC shipping before the 1960s, but we can pick up a few details from modern DoE publications showing how the process has improved. Weapons were transported in box trucks as part of a small convoy, accompanied by "technical couriers, special agents, and armed military police." Technical courier was an AEC job title, one that persisted for decades to describe the AEC staff who kept custody of weapons under transport. Despite the use of military security (references can be found to both Army MPs and Marines accompanying shipments), technical couriers were also armed. A late 1950s photo published by DoE depicts a civilian courier on the side of a road wielding a long suit jacket and an M3 submachine gun.

During that period, shipments to overseas test sites were often made by military aircraft and Navy vessels. AEC couriers still kept custody of the device, and much of the route (for example, from Los Alamos to the Navy supply center at Oakland) was by AEC highway convoy. There have always been two key considerations in nuclear transportation: first, that an enemy force (first the Communists and later the Terrorists) might attempt to interdict such a shipment, and second, that nuclear weapons and materials are hazardous and any accident could create a disaster. More "broken arrow" incidents involve air transportation than anything else, and it seems that despite the potentially greater vulnerability to ambush, the ground has always been preferred for safety.

A 1981 manual for military escort operations, applicable not only to nuclear but also chemical weapons, lays out some of the complexity of the task. "Suits Uncomfortable," "Radiation Lasts and Lasts," quick notes in the margin advise. The manual describes the broad responsibilities of escort teams, ranging from compliance with DOT hazmat regulations to making emergency repairs to contain leakage. It warns of the complexity of such operations near civilians: there may be thousands of civilians nearby, and they might panic.

Escort personnel must be trained to be prepared for problems with the public. If they are not, their problems may be multiplied---perhaps to a point where satisfactory solutions become almost impossible.

During the 1960s, heightened Cold War tensions and increasing concern of terrorism (likely owing to the increasingly prominent anti-war and anti-nuclear movements, sometimes as good as terrorists in the eyes of the military they opposed) lead to a complete rethinking of nuclear shipping. Details are scant, but the AEC seems to have increased the number of armed civilian guards and fully ended the use of any non-government couriers for special nuclear material. I can't say for sure, but this seems to be when the use of military escorts was largely abandoned in favor of a larger, better prepared AEC force. Increasing protests against nuclear weapons, which sometimes blocked the route of AEC convoys, may have made posse comitatus and political optics a problem with the use of the military on US roads.

In 1975, the Atomic Energy Commission gave way to the Energy Research and Development Administration, predecessor to the modern Department of Energy. The ERDA reorganized huge parts of the nuclear weapons complex to align with a more conventional executive branch agency, and in doing so created the Office of Transportation Safeguards (OTS). OTS had two principal operations: the nuclear train, and nuclear trucks.

Trains have been used to transport military ordnance for about as long as they have existed, and in the mid-20th century most major military installations had direct railroad access to their ammunition bunkers. When manufacturing operations began at the Pantex Plant, a train known as the "White Train" for its original color became the primary method of delivery of new weapons. The train was made up of distinctive armored cars surrounded by empty buffer cars (for collision safety) and modified box cars housing the armed escorts. Although the "white train" was repainted to make it less obvious, railfans demonstrate that it is hard to keep an unusual train secret, and anti-nuclear activists were often aware of its movements. While the train was considered a very safe and secure option for nuclear transportation (considering the very heavy armored cars and relative safety of established rail routes), it had its downsides.

In 1985, a group of demonstrators assembled at Bangor Submarine Base. Among their goals was to bring attention to the Trident II SLBM by blocking the arrival of warheads on the White Train. 19 demonstrators were arrested and charged with conspiracy for their interference with the shipment. The jury found all 19 not guilty.

The DoE is a little cagey, in their own histories, about why they stopped using the train. We can't say for sure that this demonstration was the reason, but it must have been a factor. At Bangor, despite the easy rail access, all subsequent shipments were made by truck. Trucks were far more flexible and less obvious, able to operate on unpredictable schedules and vary their routes to evade protests. In the two following years, use of the White Train trailed off and then ended entirely. From 1987, all land transportation of nuclear weapons would be by semi-trailer.

This incident seems to have been formative for the OTS, which in classic defense fashion would be renamed the Office of Secure Transportation, or OST. A briefing on the OST, likely made for military and law enforcement partners, describes their tactical doctrine: "Remain Unpredictable." Sub-bullets of this concept include "Chess Match" and "Ruthless Adherence to Deductive Thought Process," the meaning of which we could ponder for hours, but if not a military briefing this is at least a paramilitary powerpoint. Such curious phrases accompanied by baffling concept diagrams (as we find them here) are part of a fine American tradition.

Beginning somewhere around 1985, the backbone of the OST's security program became obscurity. An early '00s document from an anti-nuclear weapons group notes that there were only two known photographs of OST vehicles. At varying times in their recent history, OST's policy seems to have been to either not notify law enforcement of their presence at all, or to advise state police only that there was a "special operation" that they were not to interfere with. Box trucks marked "Atomic Energy Commission," or trains bearing the reporting symbol "AEC," are long gone. OST convoys are now unmarked and, at least by intention, stealthy.

It must be because of this history that the OST is so little-known today. It's not exactly a secret, and there have been occasional waves of newspaper coverage for its entire existence. While the OST remains low-profile relative to, say, the national laboratories, over the last decade the DoE has rather opened up. There are multiple photos, and even a short video, published by the DoE depicting OST vehicles and personnel. The OST has had a hard time attracting and retaining staff, which is perhaps the biggest motivator of this new publicity: almost all of the information the DoE puts out to the public about OST is for recruiting.

It is, of course, a long-running comedy that the federal government's efforts at low-profile vehicles so universally amount to large domestic trucks in dark colors with push bumpers, spotlights, and GSA license plates. OST convoys are not hard to recognize, and are conspicuous enough that with some patience you can find numerous examples of people with no idea what they are finding them odd enough to take photos. The OST, even as an acknowledged office of the NNSA with open job listings, still feels a bit like a conspiracy.

During the early 1970s, the AEC charged engineers at Sandia with the design of a new, specialized vehicle for highway transportation of nuclear weapons. The result, with a name only the government could love, was the Safe Secure Transporter (SST, which is also often expanded as Safe Secure Trailer). Assembly and maintenance of the SSTs was contracted to Allied Signal, now part of Honeywell. During the 1990s, the SST was replaced by the Safeguards Transporter (SGT), also designed by Sandia. By M&A, the Allied Signal contract had passed to Honeywell Federal Manufacturing & Technology (FM&T), also the operating contractor of the Kansas City Plant where many non-nuclear components of nuclear weapons are made. Honeywell FM&T continues to service the SGTs today, and is building their Sandia-designed third-generation replacement, the Mobile Guardian [3].

Although DoE is no longer stingy about photographs of the SGT, details of its design remain closely held. The SGT consists of a silver semi-trailer, which looks mostly similar to any other van trailer but is a bit shorter than the typical 53' (probably because of its weight). Perhaps the most distinctive feature of the trailers is an underslung equipment enclosure which appears to contain an air conditioner; an unusual way to mount the equipment that I have never seen on another semi-trailer.

Various DoE-released documents have given some interior details, although they're a bit confusing on close reading, probably because the trailers have been replaced and refurbished multiple times and things have changed. They are heavily armored, the doors apparently 12" thick. They are equipped with a surprising number of spray nozzles, providing fire suppression, some sort of active denial system (perhaps tear gas), and an expanding foam that can be released to secure the contents in an accident. There is some sort of advanced lock system that prevents the trailer being opened except at the destination, perhaps using age-old bank vault techniques like time delay or maybe drawing from Sandia's work on permissive action links and cryptographic authentication.

The trailers are pulled by a Peterbilt tractor that looks normal until you pay attention. They are painted various colors, perhaps a lesson learned from the conspicuity of the White Train. They're visibly up-armored, with the windshield replaced by two flat ballistic glass panels, much like you'd see on a cash transport. The sleeper has been modified to fit additional equipment and expand seating capacity to four crew members.

Maybe more obvious, they're probably the only semitrailers and tractors that you'll see with GSA "E" prefix license plates (for Department of Energy).

SGTs are accompanied on the road by a number of escort vehicles, although I couldn't say exactly how many. From published photographs, we can see that these fall into two types: the dark blue, almost black GMC box trucks with not-so-subtle emergency lights and vans with fiberglass bodies that you might mistake for a Winnebago were they not conspicuously undecorated. I've also seen at least one photo of a larger Topkick box truck associated with the OST, as well as dark-painted conventional cargo vans with rooftop AC.

If you will forgive the shilling for my Online Brand, I posted a collection of photos on Mastodon. These were all released by NNSA and were presumably taken by OST or Honeywell staff, you can see that many of them are probably from the same photoshoot. Depending on what part of the country you are in, you may very well be able to pick these vehicles out on the freeway. Hint: they don't go faster than 60, and only operate during the day in good weather.

These escort vehicles probably mostly carry additional guards, but one can assume that they also have communications equipment and emergency supplies. Besides security, one of the roles of the OST personnel is prompt emergency response, taking the first steps to contain any kind of radiological release before larger response forces can arrive. Documents indicate that OST has partnerships with both DoE facilities (such as national labs) and the Air Force to provide a rapid response capability and offer secure stopping points for OST convoys.

The OST has problems to contend with besides security and anti-nuclear activism: its own management. The OST is sort of infamously not in great shape.

Some of the vehicles were originally fabricated in Albuquerque in a motley assortment of leased buildings put together temporarily for the task, others were fabricated at the Kansas City Plant. It's hard to tell which is which, but when refurbishment of the trailers was initiated in the 2000s, it was decided to centralize all vehicle work near the OST's headquarters (also a leased office building) in Albuquerque. At the time, the OST's warehouses and workshops were in poor and declining condition, and deemed too small for the task. OST's communications center (discussed in more detail later) was in former WWII Sandia Base barracks along with NNSA's other Albuquerque offices, and they were in markedly bad shape.

To ready Honeywell FM&T for a large refurbishment project and equip OST with more reliable, futureproof facilities, it was proposed to build the Albuquerque Transportation Technology Center (ATTC) near the Sunport. In 2009, the ATTC was canceled. To this day, Honeywell FM&T works out of various industrial park suites it has leased, mostly the same ones as the 1980s. Facilities plans released by the DoE in response to a lawsuit by an activist organization end in FY2014 but tell a sad story of escalating deferred maintenance, buildings in unknown condition because of the lack of resources to inspect them, and an aging vehicle fleet that was becoming less reliable and more expensive to maintain.

The OST has 42 trucks and about 700 guards, now styled as Federal Agents. They are mostly recruited from military special forces, receive extensive training, and hold limited law enforcement powers and a statutory authorization to use deadly force in the defense of their convoys. Under a little-known and (fortunately) little-used provision of the Atomic Energy Act, they can declare National Security Areas, sort of a limited form of martial law. Despite these expansive powers, a 2015 audit report from the DoE found that OST federal agents were unsustainably overworked (with some averaging nearly 20 hours of overtime per week), were involved in an unacceptable number of drug and alcohol-related incidents for members of the Human Reliability Program, and that a series of oversights and poor management had lead to OST leadership taking five months to find out that an OST Federal Agent had threatened to kill two of his coworkers. Recruiting and retention of OST staff is poor, and this all comes in the context of an increasing number of nuclear shipments due to the ongoing weapons modernization program.

The OST keeps a low profile perhaps, in part, because it is troubled. Few audit reports, GSA evaluations, or even planning documents have been released to the public since 2015. While this leaves the possibility that the situation has markedly improved, refusal to talk about it doesn't tend to indicate good news.

OST is a large organization for its low profile. It operates out of three command centers: Western Command, at Kirtland AFB, Central Command, in Texas at Pantex, and Eastern Command, at Savannah River. The OST headquarters is leased space in an Albuquerque office building near the Sunport, and the communications and control center is nearby in the new NNSA building on Eubank. Agent training takes place primarily on a tenant basis at a National Guard base in Arkansas. OST additionally operates four or five (it was five but I believe one has been decommissioned) communications facilities. I have not been successful in locating those exactly besides that they are in New Mexico, Idaho, Missouri, South Carolina, and Maryland. Descriptions of these facilities are consistent with HF radio sites.

That brings us to the topic of communications, which you know I could go on about at length. I have been interested in OST for a long time, and a while back I wrote about the TacNet Tracker, an interesting experiment in early mobile computing and mesh networking that Sandia developed as a tactical communications system for OST. OST used to use a proprietary, Sandia-developed digital HF radio system for communications between convoys and the control center. That was replaced by ALE, for commonality with military systems, sometime in the 1990s.

More recent documents show that OST continues to use HF radio via the five relay stations, but also uses satellite messaging (which is described as Qualcomm, suggesting the off-the-shelf commercial system that is broadly popular in the trucking industry). Things have no doubt continued to advance since that dated briefing, as more recent documents mention real-time video links and extensive digital communications.

These communications systems keep all OST convoys in constant contact with the communications center in Albuquerque, where dispatchers monitor their status and movements. Communications center personnel provide weather and threat intelligence updates to convoys en route, and in the event of some sort of incident, will request assistance from the DoE, military, and local law enforcement. Some of the detailed communications arrangements emphasize the cautious nature of the OST. When requesting law enforcement assistance, communications center dispatchers provide law enforcement with codewords to authenticate themselves. An OST training video advises those law enforcement responders that, should they not have the codeword or the OST guards refuse the codeword they provide, they are to "take cover."

Paralleling a challenge that exists in the cash handling industry, the fact that law enforcement are routinely armed makes them an especially large threat to secure operations. OST may be required to use force to keep armed people away from a convoy, even when those people appear to be law enforcement. The way that this is communicated to law enforcement---that they must approach OST convoys carefully and get authorization from a convoy commander before approaching the truck---is necessarily a bit awkward. The permits and travel authorizations for the convoy are, law enforcement are warned, classified. They will not be able to check the paperwork.

The OST has assets beyond trucks, although the trucks are the backbone of the system. Three 737s, registered in the NNSA name, make up their most important air assets. Released documents don't rule out the possibility of these aircraft being used to transport nuclear weapons, but suggest that they're primarily for logistical support and personnel transport. Other smaller aircraft are in the OST inventory as well, all operating from a hanger at the Albuquerque Sunport. They fly fairly often, perhaps providing air support to OST convoys, but the NNSA indicates that they also use the OST aircraft for other related NNSA functions like transportation of the Radiological Assistance Program teams.

It should be said that despite the OST's long-running funding and administrative problems, it has maintained an excellent safety record. Some sources state that there has only been one road accident involving an OST convoy, a 1996 accident in which the truck slid off the road during an ice storm in Nebraska. I have actually seen OST documents refer to another incident in Oregon in the early '80s, in which an escort vehicle was forced off the road by a drunk driver and went into the ditch. I think it goes mostly unmentioned since only an escort vehicle was involved and there was no press attention at the time. Otherwise, despite troubling indications of its future sustainability, OST seems to have kept an excellent track record.

Finally, if you have fifteen minutes to kill, this video is probably the most extensive source of information on OST operations to have been made public. Even though I'm pretty sure a couple of the historical details it gives are wrong, but what's new. Special credit if you notice the lady that's still wearing her site-specific Q badge in the video. Badges off! Badges!

Also, if you're former military and can hold down a Q, a CDL, EMT-B, and firearms qualifications, they're hiring. I hear the overtime is good. But maybe the threats of violence not so much.

[1] The early Cold War was a very dynamic time in nuclear history, and plans changed quickly as the AEC and Armed Forces Special Weapons Project developed their first real nuclear strategy. Many of these historic details are thus complicated and I am somewhat simplifying. There were other stockpile sites planned that underwent some construction, and it is not totally clear if they were used before strategies changed once again. Similarly, manufacturing operations moved around quite a bit during this era and are hard to summarize.

[2] The NNSA, not to be confused with the agency with only one N, is a semi-autonomous division of the Department of Energy with programmatic responsibility for nuclear weapons and nuclear security. Its Administrator, currently former Sandia director Jill Hruby, is an Under Secretary of Energy and answers to the Secretary of Energy (and then to the President). I am personally very fond of Jill Hruby because of memorable comments she made after Trump's first election. They were not exactly complimentary to the new administration and I have a hard time thinking her outspokenness was not a factor in her removal as director of the laboratory. I assume her tenure as NNSA Administrator is about to come to an end.

[3] Here's a brief anecdote about how researching these topics can drive you a little mad. Unclassified documents about OST and their vehicles make frequent reference to the "Craddock buildings," where they are maintained and overhauled in Albuquerque. For years, this lead me to assume that Craddock was the name of a defense contractor that originally held the contract and Honeywell had acquired. There is, to boot, an office building near OST headquarters in Albuquerque that has a distinctive logo and the name "Craddock" in relief, although it's been painted over to match the rest of the building. Only yesterday did I look into this specifically and discover that Craddock is a Colorado-based commercial real estate firm that developed the industrial park near the airport, where MITS manufactured the Altair 8800 and Allied Signal manufactured the SSTs (if I am not mistaken Honeywell FM&T now uses the old MITS suite). OST just calls them the Craddock buildings because Craddock is the landlord! Craddock went bankrupt in the '80s, sold off part of its Albuquerque holdings, and mostly withdrew to Colorado, probably why they're not a well-known name here today.

2025-01-05 pairs not taken

So we all know about twisted-pair ethernet, huh? I get a little frustrated with a lot of histories of the topic, like the recent neil breen^w^wserial port video, because they often fail to address some obvious questions about the origin of twisted-pair network cabling. Well, I will fail to answer these as well, because the reality is that these answers have proven very difficult to track down.

For example, I have discussed before that TIA-568A and B are specified for compatibility with two different multipair wiring conventions, telephone and SYSTIMAX. And yet both standards actually originate within AT&T, so why did AT&T disagree internally on the correspondence of pair numbers to pair colors? Well, it's quite likely that some of these things just don't have satisfactory answers. Maybe the SYSTIMAX people just didn't realize there was an existing convention until they were committed. Maybe they had some specific reason to assign pairs 3 and 4 differently that didn't survive to the modern era. Who knows? At this point, the answer may be no one.

There are other oddities to which I can provide a more satisfactory answer. For example, why is it so widely said that twisted-pair ethernet was selected for compatibility with existing telephone cabling, when its most common form (10/100) is in fact not compatible with existing telephone cabling?

But before we get there, let's address one other question that the Serial Port video has left with a lot of people. Most office buildings, it is mentioned, had 25-pair wiring installed to each office. Wow, that's a lot of pairs! A telephone line, of course, uses a single pair. UTP ethernet would be designed to use two. Why 25?

The answer lies in the key telephone system. The 1A2 key telephone system, and its predecessors and successors, was an extremely common telephone system in the offices of the 1980s. Much of the existing communications wiring of the day's commercial buildings had been installed specifically for a 1A2-like system. I have previously explained that key telephone systems, for simplicity of implementation, inverted the architecture we expect from the PBX by connecting many lines to each phone, instead of many phones to each line. This is the first reason: a typical six-button key telephone, with access to five lines plus hold, needed five pairs to deliver those five lines. An eighteen button call director would have, when fully equipped, 17 lines requiring 17 pairs. Already, you will see that we can get to some pretty substantial pair counts.

On top of that, though, 1A2 telephones provided features like hold, busy line indication (a line key lighting up to indicate its status), and selective ringing. Later business telephone systems would use a digital connection to control these aspects of the phone, but the 1A2 is completely analog. It uses more pairs. There is an A-lead pair, which controls hold release. There is a lamp pair for each line button, to control the light. There is a pair to control the phone's ringer, and in some installations, another pair to control a buzzer (used to differentiate outside calls from calls on an intercom line). So, a fairly simple desk phone could require eight or more pairs.

To supply these pair counts, the industry adopted a standard for business telephone wiring: 25-pair cables terminated in Amphenol connectors. A call director could still require two cables, and two Amphenol connectors, and you can imagine how bulky this connection was. 25-pair cable was fairly expensive. These issues all motivated the development of digitally-controlled systems like the Merlin, but as businesses looked to install computer networks, 25-pair cabling remained very common.

But, there is a key difference between the unshielded twisted-pair cables used for telephones and the unshielded twisted-pair we think of today: the twist rate. We mostly interact with this property through the proxy of "cable categories," which seem to have originated with cable distributors (perhaps Anixter) but were later standardized by TIA-568.

  • Category 1: up to 1MHz (not included in TIA-568)
  • Category 2: up to 4MHz (not included in TIA-568)
  • Category 3: up to 16MHz
  • Category 4: up to 20MHz (not included in TIA-568)
  • Category 5: up to 100MHz
  • Category 6: up to 250MHz
  • Category 7: up to 600MHz (not included in TIA-568)
  • Category 8: up to 2GHz

Some of these categories are not, in fact, unshielded twisted-pair (UTP), as shielding is required to achieve the specified bandwidth. The important thing about these cable categories is that they sort of abstract away the physical details of the cable's construction, by basing the definition around a maximum usable bandwidth. At that maximum bandwidth, the cable must meet defined limits for attenuation and crosstalk.

Among the factors that determine the bandwidth capability of a cable is the twist rate, the frequency with which the two wires in a pair switch positions. The idea of twisted pair is very old, dating to the turn of the 20th century and open wire telephone leads that used "transposition brackets" to switch the order of the wires on the telephone pole. More frequent twisting provides protection against crosstalk at higher frequencies, due to the shorter spans of unbalanced wire. As carrier systems used higher frequencies on open wire telephone leads, transposition brackets became more frequent. Telephone cable is much the same, with the frequency of twists referred to as the pitch. The pitch is not actually specified by category standards; cables use whatever pitch is sufficient to meet the performance requirements. In practice, it's also typical to use slightly different pitches for different pairs in a cable, to avoid different pairs "interlocking" with each other and inviting other forms of EM coupling.

Inside telephone wiring in residential buildings is often completely unrated and may be more or less equivalent to category 1, which is a somewhat informal standard sufficient only for analog voice applications. Of course, commercial buildings were also using their twisted-pair cabling only for analog voice, but the higher number of pairs in a cable and the nature of key systems made crosstalk a more noticeable problem. As a result, category 3 was the most common cable type in 1A2-type installations of the 1980s. This is why category 3 was the first to make it into the standard, and it's why category 3 was the standard physical medium for 10BASE-T.

In common parlance, wiring originally installed for voice applications was referred to as "voice grade." This paralleled terminology used within AT&T for services like leased lines. In inside wiring applications, "voice grade" was mostly synonymous with category 3. StarLAN, the main predecessor to 10BASE-T, required a bandwidth of 12MHz... beyond the reliable capabilities of category 1 and 2, but perfectly suited for category 3.

This brings us to the second part of the twisted-pair story that is frequently elided in histories: the transition from category 3 cabling to category 5 cabling, as is required by 100BASE-TX "10/100" ethernet.

On the one hand, the explanation is simple: To achieve 100Mbps, 100BASE-TX requires a 100MHz cable, which means it requires category 5.

On the other hand, remember the whole entire thing about twisted-pair being intended to reuse existing telephone cable? Yes, the move from 10BASE-T to 100BASE-TX, and from category 3 to category 5, abandoned this advantage. The path by which this happened was not an simple one. The desire to reuse existing telephone cabling was still very much alive, and several divergent versions of twisted-pair ethernet were created for this purpose.

Ethernet comes with these kind of odd old conventions for describing physical carriers. The first part is the speed, the second part is the bandwidth/position (mostly obsolete, with BASE for baseband being the only surviving example), and the next part, often after a hyphen, identifies the medium. This medium code was poorly standardized and can be a little confusing. Most probably know that 10BASE5 and 10BASE2 identify 10Mbps Ethernet over two different types of coaxial cable. Perhaps fewer know that StarLAN, over twisted pair, was initially described as 1BASE5 (it was, originally, 1Mbps). The reason for the initial "5" code for twisted pair seems to be lost to history; by the time Ethernet over twisted pair was accepted as part of the IEEE 802.3 standard, the medium designator had changed to "-T" for Twisted Pair: 10BASE-T.

And yet, 100Mbps "Fast Ethernet," while often referred to as 100BASE-T, is more properly 100BASE-TX. Why? To differentiate it from the competing standard 100BASE-T4, which was 100Mbps Ethernet over Category 3 twisted pair cable. There were substantial efforts to deploy Fast Ethernet without requiring the installation of new cable in existing buildings, and 100BASE-TX competed directly with both 100BASE-T4 and the oddly designated 100BaseVG. In 1995, all three of these media were set up for a three-way faceoff [1].

For our first contender, let's consider 100BASE-T4, which I'll call "T4" for short. The T4 media designator means Twisted pair, 4 pairs. Recall that, for various reasons, 10BASE-T only used two pairs (one each direction). Doubling the number of required pairs might seem like a bit of a demand, but 10BASE-T was already routinely used with four-pair cable and 8P8C connectors, and years later Gigabit 1000BASE-T would do the same. Using these four pairs, T4 could operate over category 3 cable at up to 100 meters.

T4 used the pairs in an unusual way, directly extending the 10BASE-T pattern while compromising to achieve the high data rate over lower bandwidth cable. T4 had one pair each direction, and two pairs that dynamically changed directions as required. Yes, this means that 100BASE-T4 was only half duplex. T4 was mostly a Broadcom project, who offered chipsets for the standard and brought 3Com on board as the principal (but not only) vendor of network hubs.

The other category 3 contender, actually a slightly older one, was Hewlett-Packard's 100BaseVG. The "VG" media designator stood for "voice grade," indicating suitability for category 3 cables. Like T4, VG required four pairs. VG also uses those pairs in an unusual way, but a more interesting one: VG switches between a full-duplex, symmetric "control mode" and a half-duplex "transmission mode" in which all four pairs are used in one direction. Coordinating these transitions required a more complex physical layer protocol, and besides, HP took the opportunity to take on the problem of collisions. In 10BASE-T networks, the use of hubs meant that multiple hosts were in a collision domain, much like with coaxial Ethernet. As network demands increased, collisions became more frequent and the need to retransmit after collisions could appreciably reduce the effective capacity of the network.

VG solved both problems at once by introducing, to Ethernet, one of the other great ideas of the local area networking industry: token-passing. The 100BaseVG physical layer incorporated a token-passing scheme in which the hub assigned tokens to nodes, both setting the network operation mode and preventing collisions. The standard even included a simple quality of service scheme to the tokens, called demand priority, in which nodes could indicate a priority level when requesting to transmit. The token-passing system made the effective throughput of heavily loaded VG networks appreciably higher than other Fast Ethernet networks. Demand priority promised to make VG more suitable for real-time media applications in which Ethernet had traditionally struggled due to its nondeterministic capacity allocation.

Given that you have probably never heard of either of these standards, you are probably suspecting that they did not achieve widespread success. Indeed, the era of competition was quite short, and very few products were ever offered in either T4 or VG. Considering the enormous advantage of using existing Category 3 cabling, that's kind of a surprise, and it undermines the whole story that twisted pair ethernet succeeded because it eliminated the need to install new cabling. Of course, it doesn't make it wrong, exactly. Things had changed: 10BASE-T was standardized in 1990, and the three 100Mbps media were adopted in 1994-1995. Years had passed, and purpose-built computer network cabling had become more common. Besides, despite their advantages, T4 and VG were not without downsides.

To start, both were half-duplex. I don't think this was actually that big of a limitation at the time; half-duplex 100Mbps was still a huge improvement in real performance over even full-duplex 10Mbps, and the vast majority of 10BASE-T networks were hub-based and only half-duplex as well. A period document from a network equipment vendor notes this limitation of T4 but then describes full-duplex as "unneeded for workstations." That might seem like an odd claim today, but I think it was a pretty fair one in the mid-'90s.

A bigger problem was that both T4 and VG were meaningfully more complicated than TX. T4 used a big and expensive DSP chip to recover the complex symbols from the lower-grade cable. VG's token passing scheme required a more elaborate physical layer protocol implementation. Both standards were correspondingly more expensive, both for adapters and network appliances. The cost benefit of using existing cabling was thus a little fuzzier: buyers would have to trade off the cost of new cabling vs. the savings of using less complex, less expensive TX equipment.

For similar reasons, TX is also often said to have been more reliable than T4 or VG, although it's hard to tell if that's a bona fide advantage of TX or just a result of TX's much more widespread adoption. TX transceivers benefited from generations of improvement that T4 and VG transceivers never would.

Let's think a bit about that tradeoff between new cable and more expensive equipment. T4 and VG both operated on category 3, but they required four pairs. In buildings that had adopted 10BASE-T on existing telephone wiring, they would most likely have only punched down two pairs (out of a larger cable) to their network jacks and equipment. That meant that an upgrade from 10BASE-T to 100BASE-T4, for example, still involved considerable effort by a telecom or network technician. There would often be enough spare pairs to add two more to each network device, but not always. In practice, upgrading an office building would still require the occasional new cable pull. T4 and VG's poor reputation for reliability, or moreover poor reputation for tolerating less-than-perfect installations, meant that even existing connections might need time-consuming troubleshooting to bring them up to full category 3 spec (while TX, by spec, requires the full 100MHz of category 5, it is fairly tolerant of underperforming cabling).

There's another consideration as well: the full-duplex nature of TX makes it a lot more appealing in the equipment room and data center environment, and for trunk connections (between hubs or switches). These network connections see much higher utilization, and often more symmetric utilization as well, so a full-duplex option really looks 50% faster than a half-duplex one. Historically, plenty of network architectures have included the use of different media for "end-user" vs trunk connections. Virtually all consumer and SMB internet service providers do so today. It has never really caught on in the LAN world, though, where a smaller staff of network technicians are expected to maintain both sides.

Put yourself in the shoes of an IT manager at a midsized business. One option is T4 or VG, with more expensive equipment and some refitting of the cable plant, and probably with TX used in some cases anyway. Another option is TX, with less expensive equipment and more refitting of the cable plant. You can see that the decision is less than obvious, and you could easily be swayed in the all-TX direction, especially considering the benefit of more standardization and fewer architectural and software differences from 10BASE-T.

That seems to be what happened. T4 and VG found little adoption, and as inertia built, the cost and vendor diversity advantage of TX only got bigger. Besides, a widespread industry shift from shared-media networks (with hubs) to switched networks (with, well, switches) followed pretty closely behind 100BASE-TX. A lot of users went straight from 10BASE-T to switched 100BASE-TX, which almost totally eliminated the benefits of VG's token-passing scheme and made the cost advantage of TX even bigger.

And that's the story, right? No, hold on, we need to talk about one other effort to upon 10BASE-T. Not because it's important, or influential, or anything, but because it's very weird. We need to talk about IsoEthernet and IsoNetworks.

As I noted, Ethernet is poorly suited to real-time media applications. That was true in 1990, and it's still true today, but network connections have gotten so fast that the level of performance overhead available mitigates the problem. Still, there's a fundamental limitation: real-time media, like video and audio, requires a consistent amount of delivered bandwidth for the duration of playback. The Ethernet/IP network stack, for a couple of different reasons, provides only opportunistic or nondeterministic bandwidth to any given application. As a result, achieving smooth playback requires some combination of overprovisioning of the network and buffering of the media. This buffering introduces latency, which is particularly intolerable in real-time applications. You might think this problem has gone away entirely with today's very fast networks, but you can still see Twitch streamers struggling with just how bad the internet is at real-time media.

An alternative approach comes from the telephone industry, which has always had real-time media as its primary concern. The family of digital network technologies developed in the telephone industry, SONET, ISDN, what have you, provide provisioned bandwidth via virtual circuit switching. If you are going to make a telephone call at 64Kbps, the network assigns an end-to-end, deterministic 64Kbps connection. Because this bandwidth allocation is so consistent and reliable, very little or no buffering is required, allowing for much lower latency.

There are ways to address this problem, but they're far from perfect. The IP-based voice networks used by modern cellular carriers make extensive use of quality of service protocols but still fail to deliver the latency of the traditional TDM telephone network. Even with QoS, VoIP struggles to reach the reliability of ISDN. For practical reasons, consumers are rarely able to take any advantage of QoS for ubiquitous over-the-top media applications like streaming video.

What if things were different? What if, instead of networks, we had IsoNetworks? IsoEthernet proposed a new type of hybrid network that was capable of both nondeterministic packet switching and deterministic (or, in telephone industry parlance, isochronous) virtual circuit switching. They took 10BASE-T and ISDN and ziptied them together, and then they put Iso in front of the name of everything.

Here's how it works: IsoEthernet takes two pairs of category 3 cabling and runs 16.144 Mbps TDM frames over them at full duplex. This modest 60% increase in overall speed allows for a 10Mbps channel (called a P-channel by IsoEthernet) to be used to carry Ethernet frames, and the remaining 6.144Mbps to be used for 96 64-Kbps B-channels according to the traditional ISDN T2 scheme.

An IsoEthernet host (sadly not called an IsoHost, at least not in any documents I've seen) can use both channels simultaneously to communicate with an IsoHub. An IsoHub functions as a standard Ethernet hub for the P-channel, but directs the B-channels to a TDM switching system like a PABX. The mention of a PABX, of course, illustrates the most likely application: telephone calls over the computer.

I know that doesn't sound like that much of a win: most people just had a computer on their desk, and a phone on their desk, and despite decades of effort by the Unified Communications industry, few have felt a particular need to marry the two devices. But the 1990s saw the birth of telepresence: video conferencing. We're doing Zoom, now!

Videoconferencing over IP over 10Mbps Ethernet with multiple hosts in a collision domain was a very, very ugly thing. Media streaming very quickly caused almost worst-case collision behavior, dropping the real capacity of the medium well below 10Mbps and making even low resolution video infeasible. Telephone protocols were far more suited to videoconferencing, and so naturally, most early videoconferencing equipment operated over ISDN. I had a Tandberg videoconferencing system, for example, which dated to the mid '00s. It still provided four jacks on the back suitable for 4x T1 connections or 4 ISDN PRIs (basically just a software difference), providing a total of around 6Mbps of provisioned bandwidth for silky smooth real-time video.

These were widely used in academia and large corporations. If you ever worked somewhere with a Tandberg or Cisco (Cisco bought Tandberg) curved-monitor-wall system, it was most likely running over ISDN using H.320 video and T.120 application sharing ("application sharing" referred to things like virtual whiteboards). Early computer-based videoconferencing systems like Microsoft NetMeeting were designed to use existing computer networks. They used the same protocols, but over IP, with a resulting loss in reliability and increase in latency [2].

With IsoEthernet, there was no need for this compromise. You could use IP for your non-realtime computer applications, but your softphone and videoconferencing client could use ISDN. What a beautiful vision! As you can imagine, it went nowhere. Despite IEEE acceptance as 802.9 and promotion efforts by developer National Semiconductor, IsoEthernet never got even as far as 100BASE-T4 or 100BaseVG. I can't tell you for sure that it ever had a single customer outside of evaluation environments.

[1] A similar 100Mbps-over-category 3 standard, called 100BASE-T2, also belongs to this series. I am omitting it from this article because it was standardized in 1998 after industry consolidation on 100BASE-TX, so it wasn't really part of the original competition.

[2] The more prominent WebEx has a stranger history which will probably fill a whole article here one day---but it did also use H.320.

2024-12-21 something over New Jersey

There were thousands of reports: strange aircraft, floating through the sky. A retrospective sum of press accounts finds that some 100,000 people were reported to have witnessed aerial intruders. Despite the scant details associated with most reports, an eager press repeated the claims with fervor. The claims became more fantastical. Prominent people claimed secret knowledge of the origins of the crafts. This was 1896. The airship had just barely been invented, and already the public was seeing them everywhere they looked.

John Keel was a writer and prominent UFOlogist, although he's probably remembered most of all for his cryptozoological book, The Mothman Prophecies. Like most UFOlogists of his era, Keel was sort of a mixed bag to those readers who are at least attempting to keep a rational perspective. In some ways he was more critical than average, turning against the extraterrestrial hypothesis as impractical and always calling for a shift away from "investigating" based on lone contactee accounts. On the other hand, he was as prone as anyone to fancy and it now seems that his books took some liberties with the information he'd been given. Still, his popular newspaper articles during the 1960s shaped much of our modern parlance around UFOs. Among the terms he seems to have introduced, or at the least popularized, is the "flap."

A flap is a concentrated set of UFO reports in a specific place and time. The 1896-1897 airship flap, which started in California and eventually spread across the nation to New York City, might be called the first. Of course, there is a straightforward argument that the airship flap was the first only in that it was the first flap during which aviation was on the public mind; by this token other paranormal episodes like dancing plagues and witch trials could be considered flaps. Still, "flap" is usually reserved for those times during which the general public is seeing things in the sky: something up there.

Flaps are a well-known phenomenon in UFOlogical circles (although not always by that name) since 1947. Widespread reports of flying saucers that year kicked off our modern UFO culture. Almost every decade had some sort of major flap until the 1990s, the decade during which UFOlogy could be said to have died. This is a more complex topic than I can explain here as preamble, and my opinion is somewhat controversial, but UFOlogy enjoyed a golden age during the '60s and '70s and by the time I came onto the scene had largely collapsed. The end of the Cold War, improving digital media, and sidelining (and often outright suppression) of serious investigations into UFOs were all factors. There was also a certain qualitative change in the UFO community: the most prominent names in UFOs were increasingly untrustworthy, forced by desperation or, more cynically, encouraged by money to become less and less careful about the ideas they endorsed.

It cannot be ignored that there are complexities, UFOlogical mysteries, to some of this decline as well. The single hardest blow to UFOlogy came in 1989, when Bill Moore stood before a MUFON conference to admit that the UFO materials he had distributed throughout the community, including the Majestic 12 papers, were fakes. This confession triggered a dramatic unraveling of the established canon of paranormal knowledge. By the early '90s, it seemed that nearly all of the major UFO news of the decade before had originated with a small number of people, often in collusion, who ranged from extremely unreliable (Bob Lazar) to admitted fabricators (Richard Doty). The fact that some of these people had connections to military intelligence, and that there remains some reason to believe they were intentionally spreading disinformation on behalf of those agencies, leaves plenty of intrigue but does nothing to resolve the basic fact that the UFOlogy of the '80s and '90s turned out to be mostly bullshit---not even of the vague cultural kind, but with specific, known authors [1].

It was this climate that lead us to the 21st century, which for nearly two decades was surprisingly devoid of UFO discourse. Around 2017, though, a motley crew including such personalities as a famous rock musician, a powerful US senator, and an eccentric hotel-aerospace billionaire thrust UFOs back into the popular media. I have written before with my analysis that the late-2010s UFO "revelations" (and, moreover, the lack thereof) were most likely the result of Bigelow taking advantage of the DoD's lax contract supervision and Sen. Harry Reid's personal interest in order to fund his hobby projects. Still, the whole unfortunate affair seems to have had the upside of renewing public and political attention to the topic.

The DoD was forced to at least try to get its act together, creating a new organization (the AARO) with tighter reins and more credibility. NASA formed its own review. The government seems to now be involved in its most serious efforts to understand the UFO phenomenon since the 1960s, which we can dream will be a departure from the conflicted, shambolic, and dismissive way that it addressed strange objects in the sky for fifty years. Or, like every other such effort to date, it will collapse into a hasty effort to close the whole topic and avoid admitting the failure of the intelligence community to make any real progress on a matter of obvious military and public interest. Only time will tell.

Anyway, that is all setting the stage for what has been going on for the last month in New Jersey: people are seeing drones.

The New Jersey Drones have all of the makings of a classic UFO flap. Unmanned aircraft are a topic of widespread public attention, tied up in everything from global conflict (Ukrainian combat drones) to intelligence intrigue (Chinese spy balloons) to domestic politics (DJI import bans). The real prevalence of drones flying around is increasing as they continue to come down in price and the FAA adopts a more permissive regulatory scheme. The airship flap happened a few years after airships started to make headlines (manned flight was barely achievable at the time, but there had been promising experiments and they inspired a great deal of speculation). Similarly, the drone flap happens a few years after foreign unmanned aircraft gained widespread media attention.

And this is the simplest explanation of what is happening in New Jersey: when people look up at the sky, they see things there.

The universe is full of more than humans can comprehend, but we make our peace with that by engaging with it through the sky. There is so much that we do not know about stars, galaxies, and the myriad of objects that surround us constantly, but we do know that they exist and we can see them. Even that can quickly become hazy when you really look up, though. Perceptual psychology offers a variety of explanations. For example: when the visual field is lacking in reliable, easily distinguishable features, our eyes can lose their ability to maintain a fixed target. The stars themselves begin to wander, moving erratically, as if under the control of some unknown intelligence. They are, in a sense, but that unknown intelligence is our own visual system performing poorly in a challenging environment. When a camera aimlessly hunts for focus we understand that it is a technical problem with the observation, but when our own eyes have similar trouble we have a hard time being so objective.

And then there are those phenomenon that are less common but still well known: meteorites, for example, which incidentally reached their peak frequency of the year, in the northern hemisphere, during the New Jersey flap. There are satellites, some of which can reflect light from the sun beyond the horizon in odd flashing patterns, and which are becoming far more numerous as Starlink continues its bulk launches. In the good seeing conditions of the rural Southwest you can hardly look at the sky and not find a satellite, or two, or three, or four, lazily wandering between the stars. Failing to find a moving light is more unusual than looking up and having one catch your eye.

But, most of all, there are airplanes. The FAA reports that their Air Traffic Organization provides services to about 45,000 flights per day, an underestimate of the total number of aircraft operations. There are some 800,000 certificated pilots in the US. During the peak aviation hours of the mid-day to evening, there are about 5,000 IFR flights over the US at any given moment---and that's excluding the many VFR operations. The nation's busiest airports, several of which are located in the New Jersey region, handle more than one arriving or departing flight per minute.

The sky is increasingly a busy place.

When the drone flap reached its apex a few weeks ago, news stations and websites posted video montages of the "drone sightings" sent in by their viewers or, well, found on Twitter by their staff. The vast majority of the objects in these videos were recognizably commercial aircraft. Green light, left wingtip. Red light, right wingtip. White light, tail. Flashing light, location varies, usually somewhere in the middle. During approach and departure, airliners are likely to have landing lights (forward-facing) and inspection lights (pointed back at the engines and wings) turned on. If you live near an airport, you probably see this familiar constellation every day, but you aren't calling it in to the news.

And this is where the UFO phenomenon is unavoidably psychosocial.

For as long as UFOs have been observed, skeptics (and psychosocial theorists) have noted that those observations tend to follow a fashion. In the late nineteenth century, the only thing anyone had made to fly were airships, and so everyone saw airships. By the mid-20th century, the flying saucer had been introduced. The exact origin of the flying saucer is actually surprisingly complicated (having precedents going back decades in fiction), but the 1947 UFO flap solidified it as the "classic" form of UFO. For most of the golden age of UFOlogy, flying saucers were a norm punctuated only by the occasional cigar.

During the 1970s, the development of computer modeling for the radar return of flat surfaces (ironically by a Soviet physicist who seemed largely unaware of the military applications and so published his work openly) enabled the development of "stealth" aircraft. Practical matters involving the limitations of the modeling methods (the fewer vertices the better) and the low-RF-reflectivity materials known at the time meant that these aircraft were black and triangular. During the 1980s and 1990s, a wave of "black triangle" UFO sightings spanned the country, almost displacing the flying saucer as the archetypal UFO. Some of these were probably genuine sightings of the secret F-117, but far more were confirmation bias. The popular media and especially UFO newsletters promulgated this new kind of craft. People were told to look for black triangles, so they looked, and they saw black triangles.

This phenomenon is often termed "mass hysteria," but I try to avoid that language. "Hysteria" can invoke memories of "female hysteria" and a long history of dismissive and unhelpful treatment of disempowered individuals. To the extent that mass hysteria has a formal definition, it tends to refer to symptoms of such severity as to be considered illness. A flap has a different character: I am not sure that it is fair to say that someone is "hysterical" when they look in an unfamiliar place and see what they have been told everyone else is seeing.

While rather less punchy, I think that "mass confirmation bias" is a better term. "Mass susceptibility to suggestion," perhaps. "Mass priming effects." "Mass misunderstanding." "Mass surprise at the facets of our world that are always there but you seldom care to notice."

There are a surprising number of balloons in the air. Researchers launch them, weather agencies launch them, hobbyists launch them. They can drift around for days, or longer if carefully engineered. They are also just about as innocuous as an aerial object can be, rarely doing anything more nefarious than transmitting their location and some environmental measurements. And yet, when a rare sophisticated spy balloon drifts across the country, everyone starts noticing balloons for the first time. The Air Force shoots a few down. Then, cooler heads prevail, and we all end up feeling a bit silly.

There are some lessons we can learn from the Chinese spy balloon incident. First, there are strange things up there: spy balloons have a long history, having been used by the United States to observe the Soviet Union in the 1950s. That balloon program, short-lived for diplomatic reasons, laid the groundwork for a surprising number of following military and scientific developments on the part of both countries (and, in true American fashion, General Mills). From this perspective it is no surprise that the Chinese have a spy balloon program, they are treading down a proven path and once again finding that the political problems are more difficult than the technical ones (In the 1950s, the United States took the position that countries did not have a claim to control of their upper airspace, an argument that the Chinese would have a hard time making today).

Second, there are a lot of routine things up there. In the great menagerie of aerial balloons, spy balloons are probably the rarest type. Any wispy, high-altitude drifter you might see is vastly more likely to be a scientific or hobby project. Far from unusual, they are in this field the definition of "usual." Normal denizens of the sky, like airliners and satellites and stars.

Third, it is difficult to tell the difference. Even the military struggles to tell one from the other, since balloons operate at high altitudes, are small in size, and even smaller in radar cross section due to the radio transparency of the envelope. The general public has little hope. So, they interpret things as they have been primed.

Normally, people do not see balloons, because they do not look. On the occasion they happen to notice one, they dismiss it as, well, probably a weather balloon. Then, a Chinese spy balloon makes the news. Suddenly people look: they notice more balloons, and when they do, their first thought is of Chinese intelligence. They interpret things as they have just been told to.


I do most of my writing from Flying Star, and you can help pay for my posole and cake. That is a sentence that will probably only make sense to people in the immediate area. Anyway, the point is, if you enjoy my writing consider supporting me on ko-fi. I send out an occasional special newsletter, EYES ONLY, to my supporters.

I have another appeal as well: I am considering starting a separate newsletter, probably once monthly, in which I round up the UFO/UAP news with an eye towards cutting it down to just the meaningful new information. If you're vaguely aware that there keep being congressional hearings and occasionally new reports, this would bring you up to date on the important parts. Is that something you'd be interested in? Let me know, there's contact info in the footer.


If I seem to be belaboring the point, appreciate that I am trying to thread a needle. It is ridiculous, unreasonable, and frankly embarrassing for the media to disseminate "evidence" of a "drone incursion" that are plainly just blurry videos of Southwest flights on final. I am fast to fault the media. At the same time, I am much slower to blame the people who take these videos. They are, in a sense, just doing what they were told. They started looking for the drones, and now they are seeing drones.

The media has never been a friend to serious inquiry into UFOs. For much of the 20th century, "yellow journalism," intentional sensationalism, was the main vector by which UFO reports spread. These newspaper accounts held up to no scrutiny, and the journalists that filed them were often fully aware of that fact. The papers would print pretty much anything. There was a certain wink-and-nod aspect to most UFO reporting, which both spread UFOs as a popular phenomenon and hopelessly undermined the credibility of any actual sightings.

Today, yellow journalism is mostly a thing of the past, but it has been replaced by a new practice with similar outcomes. I think it has a lot to do with the fundamental collapse of journalism as an industry: the average city newsroom seems to consist of about three half-time reporters whose main source is their Twitter feed and primary interest is keeping their jobs by producing content fast enough to stay "fresh." They hardly have time to find out what happened at the City Council meeting, much less to critically evaluate twenty different UFO tips. The papers will print just about anything.

To the workaday New Jersey reporter, the drone flap must be a bit of a godsend. News is falling right into their laps. Video---the most important form of online content, the best engagement driver, the promised beachhead of the media conglomerate into the TikTokified culture of youths, is just showing up in their inboxes. This person says they saw a drone! Just like everyone's talking about! They have video! Of course you publish it. You'd be stupid not to.

It is, of course, an airplane. Maybe the reporter knows that, I think they often do. The text somewhere around the video player, for anyone that reads it, usually has an appropriate number of weasel words cushioned in vague language. They're not saying that this person caught a drone on video, they're just saying that this person says they caught a drone on video. Please watch the video. Share it with your friends, on one of the platforms where that's still worth something.

Okay, I'll knock it off, I'm trying not to just be a doomer about the decline of the media to such an extent that no one knows what's going on anywhere except for Twitter and ivy league universities for some reason. I have to skim the City Council meeting videos myself because there are sometimes literally zero journalists who are paid to sit through them. I once gave an impassioned speech about some homelessness project at a city meeting, and when some guy walked up to me after the meeting and introduced himself as a reporter from the Journal, I actually said "the Journal has reporters?" to his face. I thought they just syndicated from the five remaining AP writers and select Facebook pages. And I guess whatever Doug Peterson is on about, but seriously, now that 've gotten onto local issues I really need to stop before I get into Larry Barker Investigates memes.

So let's talk about the drones. Drones are in the news, in a military context, in a regulatory context, in popular media. Tensions with China continue to heighten, and it's clear that China doesn't have too many compunctions about US airspace sovereignty. I mean, I think I actually believe them that the balloon incursion into US airspace was unintentional (better to stay off the coasts, right? that's where a lot of the good military exercises are anyway, and we can imagine that the balloon's maneuvering capabilities are probably quite limited and flight planning depends a lot on wind forecasting which is not exact). But if they were really that broken up about it, they probably would have apologized via diplomatic channels before it became a major event. Clearly they were hoping it would go unnoticed.

First some items hit the news about mysterious drones. I'd love to identify a Patient Zero, but I don't think it's quite that simple, there was a confluence of a few things. Another congressional UAP hearing, reporting of drone incursions over Ramstein air base and Picatinny arsenal, and then a few random public reports of odd lights in the sky, as have always happened from time to time. But these separate incidents come together in the minds of the American public. A few people who are already inclined towards seeing strange things in the sky start looking for drones, and they see drones, or at least things that they are willing to conform to that expectation, even if only tentatively. They post on the internet. A cycle starts; it feeds on itself; more people looking, more sightings, more people looking, more sightings.

Somewhere along the way, US politics being what they are, Rep. Jeff Van Drew of New Jersey reports that he heard from "high sources" that the "drones" were coming from an "Iranian" "mothership" off the coast in the "Atlantic."

"These are from high sources. I don't say this lightly."

He added that the drones should be "shot down". [2]

Where the hell did that come from?! The thing is, it doesn't matter. Congresspeople going off on wild tangents repeating completely unconfirmed information that probably came via email from someone claiming to work for the CIA or whatever is just par for the course. I suppose it's always been true that if you want to find the truth you have to ignore the politicians, but it sure feels extra true right now. I don't think they're even exactly lying, they're just repeating whatever they hear that might serve an aim. It's almost an involuntary reflex. The entire series of congressional UAP hearings have been like this, basically devoid of any meaningful new information, but completely full of bizarre claims from unnamed sources that will never be seriously evaluated because no one thinks there's really anything to seriously evaluate.

The New Jersey Drone Flap is definitely that, a flap. Virtually everything you have heard is probably meaningless, just routine misperceptions that are both induced and amplified by the media. Politicians making solemn statements about needing to get on top of this, demanding a serious investigation, the DoD not doing enough, how we should shoot them down, are just doing what politicians do: they are Taking It Seriously, whatever It is. In a few weeks they will be Taking Something Else Seriously and American political discourse will move on without ever following up on any of it.

There's something curious about this flap, though, that I think does actually make it fundamentally different from the UFO flaps of yesteryear. It's the degree of strangeness involved. UFO enthusiasts sometimes use the phrase "high strangeness" to describe the more outlandish, the more inexplicable parts of UFO encounters. What people are claiming to see in New Jersey, though, is not high strangeness. It is not even strangeness. It's just... a little odd, at most.

The most authoritative government response to the New Jersey drones comes in the form of the "DHS, FBI, FAA & DoD Joint Statement on Ongoing Response to Reported Drone Sightings". Such a boring title gives you a degree of confidence that this is a Genuine Government Position, straight out of some ponderous subcommittee of the faceless bureaucracy. In other words, it's the real shit, too worked over by public information staff to likely contain baseless speculation or meaningless repetition of political discourse. If it's untruthful, it's at least intentionally untruthful, in some big organizational sense. It reads in part:

Having closely examined the technical data and tips from concerned citizens, we assess that the sightings to date include a combination of lawful commercial drones, hobbyist drones, and law enforcement drones, as well as manned fixed-wing aircraft, helicopters, and stars mistakenly reported as drones.

Here the government is saying: those aircraft you're seeing in the sky? Well, they're aircraft. You know, airplanes and stuff. Some of them are even drones! You know people just have drones, right? You can buy them at Costco. I don't think they have Costco in Iran, so I don't know where the Mothership gets them, but here in the god-bless-the-USA the DJI Mavic 3 Pro is $3,000 on Amazon and you can fly it all around New Jersey, at least for the moment. Probably just for the moment. If you're thinking about it I'd recommend that you buy now.

The real Fortean strangeness of the drone flap is that it is not Fortean. It's not paranormal, it's not mysterious. People are just looking at the sky and claiming to see something that is manifestly, objectively, actually a real thing that exists in the sky.

And yet they are still wrong about it most of the time.

I think that's why the government's messaging has been so weird and scattered. It's not like the Air Force is going to reassure us that there are no drones in the sky, because there are. I know people are getting really tired of the "does not pose a threat" language but what else are they supposed to say? It's like if there was a New Jersey Bird Flap. The National Audobon Society continues to examine the data, but to date the reported sightings of birds are assessed to be lawfully operating birds, or airplanes or helicopters or stars mistaken for birds. There is no indication that they pose a danger to national security.

And after all of this, what is left? Well, as always, the mystery is left.

For every ten thousand sounding balloons, there is a Chinese Spy Balloon (these numbers are made up for the purpose of rhetoric, please do not check my math). For every ten thousand "drone sightings," there is a real drone, operating somewhere it shouldn't, for unknown reasons.

The Joint Statement again:

Additionally, there have been a limited number of visual sightings of drones over military facilities in New Jersey and elsewhere, including within restricted air space. Such sightings near or over DoD installations are not new.

The military, and airports, and other security-sensitive installations have experienced occasional drone incursions for years. It rarely gets press. Most of the time it's some clueless hobbyist who crosses a line they shouldn't' have; this problem got bad enough that the FAA ended up deciding technical controls were required to make these mistakes more difficult.

There may be more more afoot: weeks ago some Chinese citizen, Yinpiao Zhou, was arrested for flying a consumer drone over Vandenberg Space Force Base to take photos. He reportedly said to federal investigators that the whole thing was "probably not a good idea," and it seems most likely he was just a SpaceX fan who wanted to get closeups of their facility at Vandenberg and severely didn't think things out. But there are reasons to be suspicious, a couple of months ago five Chinese nationals who had been attending a US college were arrested for sneaking around a military exercise taking photos of sensitive equipment. Their whole sequence of activities, including lying about their travel and coordinating to destroy evidence, can succinctly be described as "very suspicious." They seem to have been fully aware that they were doing something illegal, which encourages one to speculate about their motivations even if the charges of espionage have not yet been adjudicated in court.

There is good evidence that Chinese intelligence coordinates with more or less random people that travel between the US and China to opportunistically collect information on military capabilities, so the idea that there are people operating consumer drones around military bases in service of Chinese interests is not a particularly far-fetched one. It just kind of makes sense. If you were a Chinese intelligence agent, wouldn't you give it a try? It's so low risk and low cost it could practically be some handler's side project.

Foreign adversaries do provide reasons to keep a close eye on drones, especially as they interact with sensitive sites and military operations. The DoD has an admitted inability to do so effectively, leading to a significant investment in methods of detecting and countering small drones. There is a drone problem. It's just not new, it's not specific to New Jersey, and it's not some big dramatic event, but a slow evolution of military and intelligence practice akin to the development of aviation itself.

The FAA has issued a number of temporary flight restrictions in the area, and the media has made a pretty big deal of that. But most of the flight restrictions aren't even that restrictive (they allow private operations if the FAA is notified and provided with a statement of work), and the FAA tends to reflexively issue flight restrictions when anyone gets nervous. It's probably a wise decision: all this talk of drones has, ironically, almost certainly brought the drones out. People probably are more likely to operate in an unsafe fashion near sensitive infrastructure sites. They're using their drones to look for all these drones they're hearing about! And they barely even know what drones are!

[1] One of the reasons I don't write about UFOs that often, besides the fact that it gets me more weird threatening emails than any other topic, is that it's very hard to explain a lot of the events of UFO history without providing extensive background. The beliefs of individual people in the UFO community vary widely with respect to the credibility of well-known individuals. When someone admits a hoax, there is virtually always someone else who will claim the admission of the hoax to itself be a hoax (if not CIA disinfo). Some people, like Doty, have gone through this cycle so many times that it's hard to tell which of his lies he's lying about. The point is that you can't really say anything about UFOs without someone disagreeing with you to the point of anger, and so if I'm going to say anything at all I have to sort of push through and just write what I think. I encourage vigorous debate, and historically it has often been the lack of such debate that has created the biggest problems. But, you know, please be polite. If I am a CIA shill they're not paying me much for it.

[2] Inconsistent quotation-and-punctuation style is in the original due to the BBC's internally consistent but odd looking style manual rules for putting the punctuation inside or outside of the quote. They are, incidentally, pretty close to what I usually do. See, it's not just me struggling with where to put the period.

2024-12-11 travelers information stations

Histories of radio broadcasting often make a particular focus on the most powerful stations. For historic reasons, WBCT of Grand Rapids, Michigan broadcasts FM at 320¸000 watts. Many AM stations are licensed to operate at 50,000 watts, but this modern license limit represented a downgrade for some. WLW, of Cincinnati, once made 500,000. Less is made of the fun you can have under 10 watts: what we now call the Traveler's Information Station (TIS).

The TIS was not formally established as a radio service until 1977, but has much earlier precedents. The American Association of Information Radio Operators, an advocacy group for TIS, has collected some of the history of early experimental low-power radio stations. Superintendent James R. McConaghie of Vicksburg National Military Park must have been something of a tinkerer, as he built a low-power AM transmitter for his car in the mid-1950s and used it to lead auto tours. He suggested that a tape recorder might be added to provide a pre-recorded narration, and so anticipated not only the TIS but a common system of narration for group tours to this day.

During the New York World's Fair in 1964, a "leaky cable" AM system was installed on the George Washington Bridge to provide driving directions to visitors. This is the first example I can find of a low-power AM station used for traffic guidance. I can't find much information about this system except that it was the work of William Halstead, a pioneering radio engineer. Halstead is best known for developing FM stereo, but as we will see, he was a major force in TIS as well.

The National Park Service continued to innovate in radio. Low-power stations offered a promising solution to the challenge of interpreting a park to increasing numbers of visitors, especially in the era of the automobile, when rangers no longer lead tour groups from place to place. In 1968, Yellowstone acquired six custom-built low power AM transmitters that were installed at fixed locations around the park. Connected to an 8-track player with a continuous loop cartridge, they broadcast park announcements and interpretive information to visitors approaching popular attractions.

As an experiment, Yellowstone installed a five-mile "auto nature trail," a road with regularly spaced AM transmitters built for the experiment by Montana State University. The notion of an "auto nature trail" confounds our modern sensibilities, but such were the 1960s, when experiencing the world from the interior of your car was an American pastime. In a 1972 article on the effort, park service employees once again pointed out applications beyond park interpretation:

Not only is this new aspect of radio communications opening interpretation of natural areas to motorists, but the idea of being able to communicate with hundreds of motorists without having them stop their cars is a patrolman's blessing.

Along these lines, the NPS article mentions that the California Department of Transportation had deployed a low-power radio station to advise travelers of a detour on I-5 following the San Fernando earthquake. I have, unfortunately, not been able to find much information about this station---but the NPS article does tell us it used equipment from Info Systems.

Info Systems, Inc. appears to have been the first vendor of purpose-built transmitters for low-power informational stations. I haven't been able to find much information about them, and I'm a little unclear on the nature of the company--- they were apparently reselling transmitters built by vendors including ITT. I'm not sure if they were built to Info Systems designs, or if Info Systems was merely a reseller of equipment originally intended for some other application. Of course, I'm not sure what that application would have been, because at the time no such radio service existed. These transmitters operated either at milliwatt power levels under Part 15 rules, or at 10w under experimental licenses. This perhaps explains why the National Park Service figures so prominently into the history of low-power radio: as a federal agency, they presumably obtained their authorization to use radio equipment from the NTIA, not the FCC. The NTIA was likely more willing (or at least faster) to issue these experimental licenses. Info Systems transmitters were extensively installed by NPS, likely over a dozen just at Yellowstone.

In 1970, the general manager of Los Angeles International Airport became frustrated with the traffic jams at the arrival and departure lanes. He hoped to find a way to communicate with approaching drivers to better direct them---a project for which he hired William Halstead. Halstead partnered with radio consultant Richard Burden to design and install the system, and we are fortunate that Burden wrote a history of the project.

In 1972, a leaky cable antenna was buried along the median of Century Boulevard as it approached the airport. A second antenna was buried along the main airport loop, and two different NAB cartridge message repeaters (tape loop players) drove two separate transmitters. Drivers would thus begin to hear a different message as they crossed the overpass at Sepulveda Boulevard. Here, the short range of the low-power transmitters and inefficient antennas became an advantage, enabling a fairly small transition area between the two signals that would otherwise interfere.

Each of the message repeaters had three different cartridges they rotated through: a list of airlines using each terminal, parking information, and traffic information. Some of these recordings, like the traffic information, had different prerecorded variations that could be used depending on the weather and traffic conditions.

An interesting detail of the LAX radio system is that it was coupled to a new signage strategy. During development of the recordings, Burden realized that it was very difficult to direct drivers to terminals, since the terminal numbers were indicated by high-up signs that weren't noticeable from road level. Brand new signs were installed that were color coded (to identify terminals or parking areas) and bore large terminal numbers and a list of airlines served. The signs from this project were apparently in use at LAX at least until 2012. There is, of course, a lesson here, in that any new interpretive or information system will be most effective when it's installed as part of a larger, holistic strategy.

LAX's new traffic radio station operated at 830 kHz under an experimental license. Unfortunately, early experience with the system showed that drivers had a hard time tuning to 830 kHz using the slider-type tuners of the era, creating a dangerous wave of distraction as they passed the signs advertising the new radio station. Burden wanted to move the station to an extreme end of the AM band, where drivers could just push the slider until it stopped. Unfortunately, 540 kHz, the bottom of the established AM band, was licensed to a Mexican clear-channel station and could not be allocated so near to the border. Instead, Burden convinced the FCC to allow an experimental license for 530 kHz: the vast majority of cars, they found, would receive 530 kHz just fine when tuned to the bottom of their range. The frequency was formally allocated for aviation NDBs, but not in use at LAX or almost any other airport. Thus we have the origin of 530 kHz as one of the two standard frequencies for TIS [1].

By 1973, the FCC had started the rulemaking process to create a 10w TIS radio service. The National Park Service, apparently wanting to take a conservative approach to equipment purchasing, chose to stop buying new low-power AM transmitters until transmitters certified under the new FCC rules were available. In practice, this would take four years, during which time the lost sales to NPS were so great that Info Systems went out of business.

During this period, a company called Audio-Sine continued to manufacture and promote Part 15 AM transmitters---but for a different application. The "talking billboard," they proposed, would improve outdoor advertising by allowing travelers to tune their radio for more information on a product they saw along the roadside. The talking billboard concept never really caught on---a prototype, in Minneapolis, advertised for the idea of the talking billboard itself. "Look for talking billboards throughout this area in the near future." At least one other was installed, but in Duluth, advertising for Dean Nyquist's primary race for Minnesota Attorney General. "The Audio Sign... gives a very positive pitch for the City of Duluth..." the campaign manager said. "I would advise the city or chamber of commerce to use one or more all the time." I wonder if he was invested in Audio-Sine. A newspaper article a few days later comments that the talking billboard apparently did not work, something the same campaign manager attributed to a railroad trestle blocking the signal.

This is an obvious limitation of Part 15 AM transmitters: the power limit is very low. Audio-Sine only really claimed a range of "4-8 blocks," and today I think you would struggle to meet even that. The more powerful 10W stations, operated under experimental licenses, could reach as much as eight miles in good conditions.

Despite their limitations, the Audio-Sine milliwatt transmitters did find some use as early equivalents of TIS. This overlap does make it amusing that when the California Department of Transportation introduced their first changeable message signs around the same time, they called them "talking billboards" in the press.

There exists to this day a "microbroadcasting" hobby, of individuals who operate low-power FM and AM transmitters under Part 15 rules. To these hobbyists, who are always looking to transmit the best signal they can within the rules, the specific technical details of these early transmitters are of great interest. They remain, to this day, just about the state of the art in intentional broadcast radio transmission within Part 15 rules. In fact, the availability of these commercially-manufactured low-power AM transmitters seems to have lead to a short-lived boom of "whip and mast" Part 15 AM stations that attracted the attention of the FCC---not in a good way. Various details of our contemporary Part 15, such as the 3-meter antenna, feed line, and ground lead limitation of 47 CFR 15.219, seem to have been written to limit the range of the early 1970s Info Systems and Audio-Sine transmitters, along with a few other less prominent manufacturers of the day.

There are historical questions here that are very difficult to answer, which is frustrating. The exact interpretation of the limits on Part 15 intentional radiators are of great interest to hobbyists in the pirate-radio-adjacent space of legal unlicensed broadcasting, but the rules can be surprisingly confusing. You can imagine this leads to a lot of squinting at the CFRs, the history, and what exactly the FCC intended the rules to be when they were originally written. The fact that the FCC actually enforces according to a booklet of standards that it won't release but may be based on 1970s installation practices only makes the matter more intriguing.

In 1977, the FCC promulgated Part 90 rules formally establishing the Traveler's Information Station/Highway Advisory Radio service. TIS were allocated 530 kHz and 1610kHz, the two extremes of the American AM broadcast band at the time. Incidentally, the AM broadcast band would later be extended up to 1700kHz, but TIS on 1610 has not been moved. 530 and 1610 remain de facto exclusively allocated to TIS today. TIS rules remain largely unchanged today, although there have been some revisions to clarify that the established practice of "ribbons" (sequences of TIS transmitters) was permissible and to allow 5 kHz of audio bandwidth rather than the former 3 kHz.

Part 90-certified TIS transmitters are now commercially available from several manufacturers, and widely installed. Power is limited primarily in terms of field strength, although there is an RF output power limit as well. Leaky cable systems are permitted up to 50 watts into a 3 km long antenna to produce a field of 2 mV/m at 60 m from the antenna; conventional antenna stations are limited to 10 watts power into a vertically polarized antenna up to 15 m high and a field strength of 2 mV/m at 1.5 km. Most TIS installations are "whip and mast" types similar to those at the genesis of the category, using a monopole antenna mounted at the top of a signpost-type mast with the transmitter in a weathertight enclosure mounted to the side of the mast. You learn to recognize them. Typical coverage for a TIS station is 3 km (indeed, that is the limit on the planned coverage area).

Searching for TIS licenses is a little odd because of the formalities of the licensing. All TIS licenses must be issued to "government entities or park districts," in part because TIS is technically part of the public safety pool. The AM frequencies allocated to TIS stations are sort of "transferred" to the public safety pool (on a primary basis for 530 kHz and secondary basis for 1600-1700 kHz). In other words, TIS licenses are best found in ULS by searching the PW (public safety pool, conventional) service for frequencies between 0.530-1.700 MHz. There are 1,218 such licenses active.

I'm not going to provide a breakdown on all thousand-plus licenses, but I did take a quick look for any "interesting" entries, and some boring ones as examples of a typical application.

Consider the very first result, KMH441, licensed to the State of Illinois for 1610 kHz. It appears to have a surprisingly large tophat antenna. It probably serves weather advisories for the nearby freeway. Rather dull, but most TIS are just like this, except with less impressive antennas. KNIP553 is licensed to the Foothill-De Anza Community College District Police in Los Altos Hills, CA, at 1610 kHz as well. It's probably on the roof of one of the campus buildings. Like most TIS, there are essentially no mentions of this station on the internet, except in listings of TIS based on licenses.

KNNN871 1610 kHz is licensed to the city of Vail, Colorado, and this one got a local news article when it was installed. There are two transmitters. WNKG901, Greater New Orleans Expressway Commission, is on 1700 kHz and has four licensed transmitters at various toll plazas. The transmitters are standard whips on masts, but this one is in an unusual place.

WNRO290, State of New Mexico, operates at 530 kHz at the St. Francis/I-25 interchange in Santa Fe. The transmitter is totally typical and shoved into a median space.

WPEZ840 is assigned to the Lower Colorado River Authority and covers 1610 or 1670 kHz at six locations, each a power plant (some of them hydroelectric, but the Lower Colorado River Authority apparently operates some coal plants). Like many emergency-oriented TIS, these stations normally rebroadcast NOAA All-Hazards Weather Radio.

While TIS are limited to government agencies, there are definitely some cases of private organizations finding a government sponsor to obtain a TIS license. For example, Meteor Crater in Arizona has signs at the freeway advising that there is attraction information on 1610 kHz. This is WQDF361, which is actually licensed to the nearby City of Winslow. Like many TIS stations, the license contact is Information Station Specialists, a company that specializes in TIS including both equipment and licensing.

Because TIS are ubiquitous low-power AM stations, some DX (long-distance receiving) enthusiasts will try to pick up very distant TIS. Historically, some TIS operators would issue QSL cards. Considering that there are quite a few TIS in service that are government-registered but seem to be physically maintained by radio clubs or amateur radio operators, there are probably still a fair number out there that will return a QSL card if you try.

Having discussed TIS, we finally need to consider the fact that there are a lot of things that look and feel like TIS but are not. Most notably, when the Low Power FM (LPFM) class was established in 2000, one of the authorized functions of LPFM stations is something that is very much like, but not quite, TIS. A notable advantage of LPFM stations for this purpose (besides the higher popularity of FM radio despite its poorer range) is that the license class explicitly allows large-area networks composed of many low-power transmitters---something that is kind-of-sort-of possible with TIS using very long "ribbon" sequences, but not encouraged. These rules mean that TIS-type LPFM networks can feasibly cover multiple towns.

A major example is in Colorado, where the state operates eleven LPFM stations such as KASP-LP, 107.9 FM Aspen. Anyone familiar with the extreme difficulty of actually getting LPFM licenses will be rather jealous of the State of Colorado for bagging eleven, but then government agencies do get preference. The Colorado stations rebroadcast NOAA All-Hazards Weather Radio with 100 W of power, mostly just allowing people to listen to them without having a tuner capable of covering the 160MHz weather band (an unfortunately common problem).

It's hard to know what the future holds for TIS. The broad decline in AM radio suggests that TIS may fade away as well, although it appears that AM receivers will be mandated in vehicles sold in the US. Some states, such as Virginia, have significantly reduced the number of TIS in operation. Still, some TIS systems are popular enough with drivers that plans to eliminate them lead to public objections. Most TIS operators are increasingly focusing on emergency communications rather than traffic advisories, since TIS offers a very reliable option for communications that is completely under local control---very local control, considering the short range.

[1] Wikipedia suggests that an NDB on 529 kHz at Manchester, TN can be heard in many parts of the US. There's a weird lack of basic information on this NDB, such as its location or the name of the airport it is located at. It seems to have been installed at a private airport by an amateur radio operator, probably as more of a hobby project than anything. I cannot find it on contemporary charts or even find an airport that fits the description, and I don't see references to it newer than 2009, so I think at least the NDB and possibly the entire airport are gone to history.

CodeSOD: Message Oriented Database

Mark was debugging some database querying code, and got a bit confused about what it was actually doing. Specifically, it generated a query block like this:

$statement="declare @status int
        declare @msg varchar(30)
        exec @status=sp_doSomething 'arg1', ...
        select @msg=convert(varchar(10),@status)
        print @msg
        ";

$result = sybase_query ($statement, $this->connection);

Run a stored procedure, capture its return value in a variable, stringify that variable and print it. The select/print must be for debugging, right? Leftover debugging code. Why else would you do something like that?

if (sybase_get_last_message()!=='0') {
    ...
}

Oh no. sybase_get_last_message gets the last string printed out by a print statement. This is a pretty bonkers way to get the results of a function or procedure call back, especially when if there are any results (like a return value), they'll be in the $result return value.

Now that said, reading through those functions, it's a little unclear if you can actually get the return value of a stored procedure this way. Without testing it myself (and no, I'm not doing that), we're in a world where this might actually be the best way to do this.

So I'm not 100% sure where the WTF lies. In the developer? In the API designers? Sybase being TRWTF is always a pretty reliable bet. I suppose there's a reason why all those functions are listed as "REMOVED IN PHP 7.0.0", which was was rolled out through 2015. So at least those functions have been dead for a decade.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

A Single Mortgage

We talked about singletons a bit last week. That reminded John of a story from the long ago dark ages where we didn't have always accessible mobile Internet access.

At the time, John worked for a bank. The bank, as all banks do, wanted to sell mortgages. This often meant sending an agent out to meet with customers face to face, and those agents needed to show the customer what their future would look like with that mortgage- payment calculations, and pretty little graphs about equity and interest.

Today, this would be a simple website, but again, reliable Internet access wasn't a thing. So they built a client side application. They tested the heck out of it, and it worked well. Sales agents were happy. Customers were happy. The bank itself was happy.

Time passed, as it has a way of doing, and the agents started clamoring for a mobile web version, that they could use on their phones. Now, the first thought was, "Wire it up to the backend!" but the backend they had was a mainframe, and there was a dearth of mainframe developers. And while the mainframe was the source of truth, and the one place where mortgages actually lived, building a mortgage calculator that could do pretty visualizations was far easier- and they already had one.

The client app was in .NET, and it was easy enough to wrap the mortgage calculation objects up in a web service. A quick round of testing of the service proved that it worked just as well as the old client app, and everyone was happy - for awhile.

Sometimes, agents would run a calculation and get absolute absurd results. Developers, putting in exactly the same values into their test environment wouldn't see the bad output. Testing the errors in production didn't help either- it usually worked just fine. There was a Heisenbug, but how could a simple math calculation that had already been tested and used for years have a Heisenbug?

Well, the calculation ran by simulation- it simply iteratively applied payments and interest to generate the entire history of the loan. And as it turns out, because the client application which started this whole thing only ever needed one instance of the calculator, someone had made it a singleton. And in their web environment, this singleton wasn't scoped to a single request, it was a true global object, which meant when simultaneous requests were getting processed, they'd step on each other and throw off the iteration. And testing didn't find it right away, because none of their tests were simulating the effect of multiple simultaneous users.

The fix was simple- stop being a singleton, and ensure every request got its own instance. But it's also a good example of misapplication of patterns- there was no need in the client app to enforce uniqueness via the singleton pattern. A calculator that holds state probably shouldn't be a singleton in the first place.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

Error'd: Sentinel Headline

When faced with an information system lacking sufficient richness to permit its users to express all of the necessary data states, human beings will innovate. In other words, they will find creative ways to bend the system to their will, usually (but not always) inconsequentially.

In the early days of information systems, even before electronic computers, we found users choosing to insert various out-of-bounds values into data fields to represent states such as "I don't know the true value for this item" or "It is impossible accurately state the true value of this item because of faulty constraint being applied to the input mechanism" or other such notions.

This practice carried on into the computing age, so that now, numeric fields will often contain values of 9999 or 99999999. Taxpayer numbers will be listed as 000-00-0000 or any other repetition of the same digit or simple sequences. Requirements to enter names collected John Does. Now we also see a fair share of Disney characters.

Programmers then try to make their systems idiot-proof, with the obvious and entirely predictable results.

The mere fact that these inventions exist at all is entirely due to the ommission of mechanisms for the metacommentary that we all know perfectly well is sometimes necessary. But rather than provide those, it's easier to wave our hands and pretend that these unwanted states won't exist, can be ignored, can be glossed over. "Relax" they'll tell you. "It probably won't ever happen." "If it does happen, it won't matter." "Don't lose your head over it."

The Beast in Black certainly isn't inclined to cover up an errant sentinel. "For that price, it had better be a genuine Louis XVI pillow from 21-January-1793." A La Lanterne!

3

 

Daniel D. doubled up on Error'ds for us. "Do you need the error details? Yes, please."

0

 

And again with an alert notification oopsie. "Google Analytics 4 never stops surprising us any given day with how bugged it is. I call it an "Exclamation point undefined". You want more info? Just Google it... Oh wait." I do appreciate knowing who is responsible for the various bodges we are sent. Thank you, Daniel.

1

 

"Dark pattern or dumb pattern?" wonders an anonymous reader. I don't think it's very dark.

2

 

Finally, Ian Campbell found a data error that doesn't look like an intentional sentinel. But I'm not sure what this number represents. It is not an integral power of 2. Says Ian, "SendGrid has a pretty good free plan now with a daily limit of nine quadrillion seven trillion one hundred ninety-nine billion two hundred fifty-four million seven hundred forty thousand nine hundred ninety-two."

4

 

[Advertisement] Plan Your .NET 9 Migration with Confidence
Your journey to .NET 9 is more than just one decision.Avoid migration migraines with the advice in this free guide. Download Free Guide Now!

CodeSOD: A Steady Ship

You know what definitely never changes? Shipping prices. Famously static, despite all economic conditions and the same across all shipping providers. It doesn't matter where you're shipping from, or to, you know exactly what the price will be to ship that package at all times.

Wait, what? You don't think that's true? It must be true, because Chris sent us this function, which calculates shipping prices, and it couldn't be wrong, could it?

public double getShippingCharge(String shippingType, bool saturday, double subTot)
{
    double shCharge = 0.00;
    if(shippingType.Equals("Ground"))
    {
        if(subTot <= 29.99 && subTot > 0)
        {
            shCharge = 4.95;
        }
        else if(subTot <= 99.99 && subTot > 29.99)
        {
            shCharge = 7.95;
        }
        else if(subTot <= 299.99 && subTot > 99.99)
        {
            shCharge = 9.95;
        }
        else if(subTot > 299.99)
        {
            shCharge = subTot * .05;
        }              
    }
    else if(shippingType.Equals("Two-Day"))
    {
        if(subTot <= 29.99 && subTot > 0)
        {
            shCharge = 14.95;
        }
        else if(subTot <= 99.99 && subTot > 29.99)
        {
            shCharge = 19.95;
        }
        else if(subTot <= 299.99 && subTot > 99.99)
        {
            shCharge = 29.95;
        }
        else if(subTot > 299.99)
        {
            shCharge = subTot * .10;
        }              
    }
    else if(shippingType.Equals("Next Day"))
    {
        if(subTot <= 29.99 && subTot > 0)
        {
            shCharge = 24.95;
        }
        else if(subTot <= 99.99 && subTot > 29.99)
        {
            shCharge = 34.95;
        }
        else if(subTot <= 299.99 && subTot > 99.99)
        {
            shCharge = 44.95;
        }
        else if(subTot > 299.99)
        {
            shCharge = subTot * .15;
        }              
    }
    else if(shippingType.Equals("Next Day a.m."))
    {
        if(subTot <= 29.99 && subTot > 0)
        {
            shCharge = 29.95;
        }
        else if(subTot <= 99.99 && subTot > 29.99)
        {
            shCharge = 39.95;
        }
        else if(subTot <= 299.99 && subTot > 99.99)
        {
            shCharge = 49.95;
        }
        else if(subTot > 299.99)
        {
            shCharge = subTot * .20;
        }              
    }                                      
    return shCharge;
}

Next you're going to tell me that passing the shipping types around as stringly typed data instead of enums is a mistake, too!

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

CodeSOD: Single or Mingle

Singletons is arguably the easiest to understand design pattern, and thus, one of the most frequently implemented design patterns, even- especially- when it isn't necessary. Its simplicity is its weakness.

Bartłomiej inherited some code which implemented this pattern many, many times. None of them worked quite correctly, and all of them tried to create a singleton a different way.

For example, this one:

public class SystemMemorySettings
{
    private static SystemMemorySettings _instance;

    public SystemMemorySettings()
    {
        if (_instance == null)
        {
            _instance = this;
        }
    }

    public static SystemMemorySettings GetInstance()
    {
        return _instance;
    }

    public void DoSomething()
    {
    ...
        // (this must only be done for singleton instance - not for working copy)
        if (this != _instance)
        {
            return;
        }
    ...
    }
}

The only thing they got correct was the static method which returns an instance, but everything else is wrong. They construct the instance in the constructor, meaning this isn't actually a singleton, since you can construct it multiple times. You just can't use it.

And you can't use it because of the real "magic" here: DoSomething, which checks if the currently active instance is also the originally constructed instance. If it isn't, this function just fails silently and does nothing.

A common critique of singletons is that they're simply "global variables with extra steps," but this doesn't even succeed at that- it's just a failure, top to bottom.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

CodeSOD: Insanitize Your Inputs

Honestly, I don't know what to say about this code sent to us by Austin, beyond "I think somebody was very confused".

string text;
text = "";
// snip
box.Text = text;
text = "";
text = XMLUtil.SanitizeXmlString(text);

This feels like it goes beyond the usual cruft and confusion that comes with code evolving without ever really being thought about, and ends up in some space outside of meaning. It's all empty strings, signifying nothing, but we've sanitized it.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

CodeSOD: Unnavigable

Do you know what I had forgotten until this morning? That VBScript (and thus, older versions of Visual Basic) don't require you to use parentheses when calling a function. Foo 5 and Foo(5) are the same thing.

Of course, why would I remember that? I thankfully haven't touched any of those languages since about… 2012. Which is actually a horrifyingly short time ago, back when I supported classic ASP web apps. Even when I did, I always used parentheses because I wanted my code to be something close to readable.

Classic ASP, there's a WTF for you. All the joy of the way PHP mixes markup and code into a single document, but with an arguably worse and weirder language.

Which finally, brings us to Josh's code. Josh worked for a traveling exhibition company, and that company had an entirely homebrewed CMS written in classic ASP. Here's a few hundred lines out of their navigation menu.

  <ul class=menuMain>
        <%  if menu = "1" then
                Response.Write "<li class='activ'><b></b><i></i><a href='/home.asp' title='Home'>Home</a></li>"
            else
                Response.Write "<li><a href='/home.asp' title='Home'>Home</a></li>"
            end if
            if  menu = "2" then
                Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/about_wc_homepage.asp' title='About World Challenge'>About us</a></li>"
            else
                Response.Write "<li><a href='/expeditions/about_wc_homepage.asp' title='About World Challenge'>About us</a></li>"
            end if
            if  menu = "3" then
                Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/book-a-school-expedition.asp' title='How to book'>How to book</a></li>"
            else
                Response.Write "<li><a href='/expeditions/book-a-school-expedition.asp' title='How to book'>How to book</a></li>"
            end if
            if  menu = "4" then
                Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/expeditions_home.asp' title='Expeditions'>Expeditions</a></li>"
            else
                Response.Write "<li><a href='/expeditions/expeditions_home.asp' title='Expeditions'>Expeditions</a></li>"
            end if 
            if  menu = "5" then
                Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/safety_home.asp' title='Safety'>Safety</a></li>"
            else 
                Response.Write "<li><a href='/expeditions/safety_home.asp' title='Safety'>Safety</a></li>"
            end if 
            if  menu = "6" then
                Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/mm_what_is_mm.asp' title='Fundraising support'>Fundraising</a></li>"
            else 
                Response.Write "<li><a href='/expeditions/mm_what_is_mm.asp' title='Fundraising support'>Fundraising</a></li>"
            end if 
            if  menu = "7" then
                Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/careers_home.asp' title='Work for us'>Work for us</a></li>"
            else
                Response.Write "<li><a href='/expeditions/careers_home.asp' title='Work for us'>Work for us</a></li>"
            end if          
            if  menu = "8" then
                Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/contact_us_home.asp' title='Contact us'>Contact us</a></li>"
            else 
                Response.Write "<li><a href='/expeditions/contact_us_home.asp' title='Contact us'>Contact us</a></li>"
            end if
        Response.Write "</ul>"
        Response.Write "<ul class='menuSub'>"
               if menu = "1" then
               end if
 
               if menu = "2" then   
                   if submenu = "1" then   
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/about_wc_who_we_are.asp'  title='Who we are'>Who we are</a></li>"
                   else   
                    Response.Write "<li><a href='/expeditions/about_wc_who_we_are.asp'title='Who we are'>Who we are</a></li>"
                   end if
                   if submenu = "2" then   
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/world_challenge_CSR.asp' title='CSR'>CSR</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/world_challenge_CSR.asp' title='CSR'>CSR</a></li>"
                   end if
 
                   if submenu = "3" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/World-Challenge-Accreditation.asp' title='Partners and accreditation'>Partners and accreditation</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/World-Challenge-Accreditation.asp' title='Partners and accreditation'>Partners and accreditation</a></li>"
                   end if
 
                   if submenu = "4" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/curriculum-links.asp' title='Curriculum links'>Curriculum links</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/curriculum-links.asp' title='Curriculum links'>Curriculum links</a></li>"
                   end if
 
                   if submenu = "5" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/expedition_advice.asp' title='Expedition advice'>Expedition advice</a></li>"
                   else   
                    Response.Write "<li><a href='/expeditions/expedition_advice.asp' title='Expedition advice'>Expedition advice</a></li>"
                   end if                   
                   if submenu = "6" then   
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/about_wc_press_and_publications.asp' title='Press resources'>Press resources</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/about_wc_press_and_publications.asp' title='Press resources'>Press resources</a></li>"
                   end if   
               end if
 
               if menu = "3" then
               Response.Write "<li></li>"
               end if
 
               if menu = "4" then
                   if submenu = "1" then   
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/exped_lh_dest_ca.asp' title='Central & North America'>Central and North America</a></li>"
                   else   
                    Response.Write "<li><a href='/expeditions/exped_lh_dest_ca.asp'  title='Central and North America'>Central and North America</a></li>"
                   end if   
                   if submenu = "2" then   
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/exped_lh_dest_sa.asp' title='South America'>South America</a></li>"
                   else   
                    Response.Write "<li><a href='/expeditions/exped_lh_dest_sa.asp'  title='South America'>South America</a></li>"
                   end if
                   if submenu = "3" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/exped_lh_dest_sea.asp' title='South East Asia'>South East Asia</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/exped_lh_dest_sea.asp' title='South East Asia'>South East Asia</a></li>"
                   end if
                   if submenu = "4" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/exped_lh_dest_asia.asp' title='Asia'>Asia</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/exped_lh_dest_asia.asp' title='Asia'>Asia</a></li>"
                   end if
                   if submenu = "5" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/exped_lh_dest_africa.asp' title='Africa'>Africa</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/exped_lh_dest_africa.asp' title='Africa'>Africa</a></li>"
                   end if
                   if submenu = "6" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/europe_school_expeditions.asp' title='Europe'>Europe</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/europe_school_expeditions.asp' title='Europe'>Europe</a></li>"
                   end if
                   if submenu = "7" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/community-projects.asp' title='Community projects'>Community projects</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/community-projects.asp' title='Community projects'>Community projects</a></li>"
                   end if
                   if submenu = "8" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/exped_indiv_home.asp' title='Independent'>Independent</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/exped_indiv_home.asp' title='Independent'>Independent</a></li>"
                   end if
               end if
 
               if menu = "5" then
                   if submenu = "1" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/safe-people.asp' title='Safe People'>Safe people</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/safe-people.asp' title='Safe People'>Safe people</a></li>"
                   end if
                   if submenu = "2" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/safe-place.asp' title='Safe places'>Safe places</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/safe-place.asp' title='Safe places'>Safe places</a></li>"
                   end if
                   if submenu = "3" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/safe-policies-practises.asp' title='Safe practices and policies'>Safe practices and policies</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/safe-policies-practises.asp' title='Safe practices and policies'>Safe practices and policies</a></li>"
                   end if
                   if submenu = "4" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/safe-resources.asp' title='Safe Resources'>Safe resources</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/safe-resources.asp' title='Safe Resources'>Safe resources</a></li>"
                   end if
                   if submenu = "5" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/safety_ops_centre.asp'  title='Operations Centre'>Operations Centre</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/safety_ops_centre.asp' title='Operations Centre'>Operations Centre</a></li>"
                   end if
                   if submenu = "6" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/travel_safety_course.asp' title='Travelsafe course'>Travelsafe course</a></li>"
                   else   
                    Response.Write "<li><a href='/expeditions/travel_safety_course.asp'  title='Travelsafe course'>Travelsafe course</a></li>"
                   end if
               end if  
            
               if menu = "6" then
 
'                  if submenu = "1" then   
'                   Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/fundraising-team.asp' title='Fundraising team'>Fundraising team</a></li>"
'                  else   
'                   Response.Write "<li><a href='/expeditions/fundraising-team.asp'  title='Fundraising team'>Fundraising team</a></li>"
'                  end if   
 
                   if submenu = "2" then   
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/mm_ideas.asp' title='Fundraising ideas'>Fundraising ideas</a></li>"
                   else   
                    Response.Write "<li><a href='/expeditions/mm_ideas.asp'  title='Fundraising ideas'>Fundraising ideas</a></li>"
                   end if                   
                   if submenu = "3" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/about_wc_events_challenger_events.asp'  title='Fundraising events'>Fundraising events</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/about_wc_events_challenger_events.asp' title='Fundraising events'>Fundraising events</a></li>"
                   end if                   
               end if
 
               if menu = "7" then
                   if submenu = "1" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/careers_leader_ops_overseas.asp' title='Lead an expedition'>Lead an expedition</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/careers_leader_ops_overseas.asp'  title='Lead an expedition'>Lead an expedition</a></li>"
                   end if
                   if submenu = "2" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/permanent_jobs_world_challenge.asp'  title='Office based positions'>Office based positions</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/permanent_jobs_world_challenge.asp' title='Office based positions'>Office based positions</a></li>"
                   end if
               end if
 
               if menu = "8" then
                   if submenu = "1" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/pages/forms-brochure.asp'  title='Request a brochure'>Request a brochure</a></li>"
                   else
                    Response.Write "<li><a href='/pages/forms-brochure.asp'  title='Request a brochure'>Request a brochure</a></li>"
                   end if
                   if submenu = "2" then
                    Response.Write "<li class='activ'><b></b><i></i><a rel='external' href='http://f.chtah.com/s/3/2069554126/signup.html'  title='Sign up for e-news'>Sign up for e-news</a></li>"
                   else
                    Response.Write "<li><a rel='external' href='http://f.chtah.com/s/3/2069554126/signup.html'  title='Sign up for e-news'>Sign up for e-news</a></li>"
                   end if
                   if submenu = "3" then
                    Response.Write "<li class='activ'><b></b><i></i><a href='/expeditions/about_wc_press_and_publications.asp'  title='Press resources'>Press resources</a></li>"
                   else
                    Response.Write "<li><a href='/expeditions/about_wc_press_and_publications.asp'  title='Press resources'>Press resources</a></li>"
                   end if
               end if %>
                  </ul>

This renders the whole menu, but based on the selected menu and submenu, it adds an activ class to the HTML elements. Which means that each HTML element is defined here twice, once with and without the CSS class on it. I know folks like to talk about dry code, but this code is SOGGY with repetition. Just absolutely dripping wet with the same thing multiple times. Moist.

[Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.

Error'd: Mais Que Nada

I never did explain the elusive off-by-one I hinted at, did I? A little meta, perhaps. It is our practice at Error'd to supply five nuggets of joy each week. But in episode previous-plus-one, you actually got six! (Or maybe, depending on how you count them, that's yet another off-by-one. I slay me.) If that doesn't tickle you enough, just wait until you hear what Dave L. brought us. Meanwhile...

"YATZP" scoffed self-styled Foo AKA F. Yet Another Time Zone P*, I guess. Not wrong. According to Herr Aka F., "German TV teletext (yes, we still have it!) botched the DST start (upper right corner). The editors realized it and posted a message stating as much, sent from the 'future' (i.e. correct) time zone."

0

 

Michael R. wrote in with a thought-provoker. If I'm representing one o'clock as 1:00, two o'clock as 2:00, and so forth, why should zero o'clock be the only time represented with not just one, but TWO leading zeroes? Logically, zero o'clock should be represented simply by :00, right?

1

 

Meanwhile, (just) Randy points out that somebody failed to pay attention to detail. "Did a full-scroll on Baeldung's members page and saw this. Sometimes, even teachers don't get it right."

2

 

In case Michael R. is still job-hunting Gary K. has found the perfect position for everyone. That is, assuming the tantalizingly missing Pay Range section conforms to the established pattern. "Does this mean I should put my qualifications in?" he wondered. Run, don't walk.

3

 

And in what I think is an all-time first for us, Dave L. brings (drum roll) an audio Error'd "I thought you'd like this recording from my Garmin watch giving me turn-by-turn directions: In 280.097 feet turn right. That's two hundred eighty feet and ONE POINT ONE SIX FOUR INCHES. Accuracy to a third of a millimeter!" Don't move your hand!

 

[Advertisement] Keep all your packages and Docker containers in one place, scan for vulnerabilities, and control who can access different feeds. ProGet installs in minutes and has a powerful free version with a lot of great features that you can upgrade when ready.Learn more.

Representative Line: Get Explosive

Sean sends us a one-line function that is a delight, if by delight you mean "horror". You'll be shocked to know it's PHP.

function proget(){foreach($_GET as $k=>$v){if($k=="h"){$GLOBALS["h"]=1;}$p=explode(",",$k);}return($p);} //function to process GET headers

Based on the comment, proget is a shorthand for process_get_parameters. Which is sort of what it does. Sort of.

Let's go through this. We iterate across our $_GET parameters using $k for the key, $v for the value, but we never reference the value so forget it exists. We're iterating across every key. The first thing we check is if a key "h" exists. We don't look at its value, we just check if it exists, and if it does, we set a global variable. And this, right here, is enough for this to be a WTF. The logic of "set a global variable based on the existence of a query parameter regardless of the value of the query parameter" is… a lot. But then, somehow, this actually gets more out there.

We explode the key on commas (explode being PHP's much cooler name for split), which implies… our keys may be lists of values? Which I feel like is an example of someone not understanding what a "key" is. But worse than that, we just do this for every key, and return the results of performing that operation on the last key. Which means that if this function is doing anything at all, it's entirely dependent on the order of the keys. Which, PHP does order the keys by the order they're added, which I take to mean that if the URL has query params like ?foo=1&h=0&a,b,c,d=wtf. Or, if we're being picky about encoding, ?foo=1&h=0&a%2Cb%2Cc%2Cd=wtf. The only good news here is that PHP handles the encoding/decoding for you, so the explode will work as expected.

This is the kind of bad code that leaves me with lots of questions, and I'm not sure I want any of the answers. How did this happen, and why are questions best left unanswered, because I think the answers might cause more harm.

[Advertisement] Plan Your .NET 9 Migration with Confidence
Your journey to .NET 9 is more than just one decision.Avoid migration migraines with the advice in this free guide. Download Free Guide Now!

CodeSOD: Join Us in this Query

Today's anonymous submitter worked for a "large, US-based, e-commerce company." This particular company was, some time back, looking to save money, and like so many companies do, that meant hiring offshore contractors.

Now, I want to stress, there's certainly nothing magical about national borders which turns software engineers into incompetents. The reality is simply that contractors never have their client's best interests at heart; they only want to be good enough to complete their contract. This gets multiplied by the contracting firm's desire to maximize their profits by keeping their contractors as booked as possible. And it gets further multiplied by the remoteness and siloing of the interaction, especially across timezones. Often, the customer sends out requirements, and three months later gets a finished feature, with no more contact than that- and it never goes well.

All that said, let's look at some SQL Server code. It's long, so we'll take it in chunks.

-- ===============================================================================
-- Author     : Ignacius Ignoramus
-- Create date: 04-12-2020
-- Description:	SP of Getting Discrepancy of Allocation Reconciliation Snapshot
-- ===============================================================================

That the comment reinforces that this is an "SP", aka stored procedure, is already not my favorite thing to see. The description is certainly made up of words, and I think I get the gist.

ALTER PROCEDURE [dbo].[Discrepency]
	(
		@startDate DATETIME,
		@endDate DATETIME
	)
AS

BEGIN

Nothing really to see here; it's easy to see that we're going to run a query for a date range. That's fine and common.

	DECLARE @tblReturn TABLE
	(
		intOrderItemId	   INT
	)

Hmm. T-SQL lets you define table variables, which are exactly what they sound like. It's a local variable in this procedure, that acts like a table. You can insert/update/delete/query it. The vague name is a little sketch, and the fact that it holds only one field also makes me go "hmmm", but this isn't bad.

	DECLARE @tblReturn1 TABLE
	(
		intOrderItemId	   INT
	)

Uh oh.

	DECLARE @tblReturn2 TABLE
	(
		intOrderItemId	   INT
	)

Oh no.

	DECLARE @tblReturn3 TABLE
	(
		intOrderItemId	   INT
	)

Oh no no no.

	DECLARE @tblReturn4 TABLE
	(
		intOrderItemId	   INT
	)

This doesn't bode well.

So they've declared five variables called tblReturn, that all hold the same data structure.

What happens next? This next block is gonna be long.

	INSERT INTO @tblReturn --(intOrderItemId) VALUES (@_ordersToBeAllocated)

	/* OrderItemsPlaced */		

		select 		
		intOrderItemId
		from CompanyDatabase..Orders o
		inner join CompanyDatabase..OrderItems oi on oi.intOrderId = o.intOrderId
		where o.dtmTimeStamp between @startDate and  @endDate


		AND intOrderItemId Not In 
		(

		/* _itemsOnBackorder */

		select intOrderItemId			
		from CompanyDatabase..OrderItems oi
		inner join CompanyDatabase..Orders o on o.intOrderId = oi.intOrderId
		where o.dtmTimeStamp between @startDate and  @endDate
		and oi.strstatus='backordered' 
		)

		AND intOrderItemId Not In 
		(

		/* _itemsOnHold */

		select intOrderItemId			
		from CompanyDatabase..OrderItems oi
		inner join CompanyDatabase..Orders o on o.intOrderId = oi.intOrderId
		where o.dtmTimeStamp between @startDate and  @endDate
		and o.strstatus='ONHOLD'
		and oi.strStatus <> 'BACKORDERED' 
		)

		AND intOrderItemId Not In 
		(

		/* _itemsOnReview */

		select  intOrderItemId			
		from CompanyDatabase..OrderItems oi
		inner join CompanyDatabase..Orders o on o.intOrderId = oi.intOrderId
		where o.dtmTimeStamp between @startDate and  @endDate 
		and o.strstatus='REVIEW' 
		and oi.strStatus <> 'BACKORDERED'
		)

		AND intOrderItemId Not In 
		(

		/*_itemsOnPending*/

		select  intOrderItemId			
		from CompanyDatabase..OrderItems oi
		inner join CompanyDatabase..Orders o on o.intOrderId = oi.intOrderId
		where o.dtmTimeStamp between @startDate and  @endDate
		and o.strstatus='PENDING'
		and oi.strStatus <> 'BACKORDERED'
		)

		AND intOrderItemId Not In 
		(

		/*_itemsCancelled */

		select  intOrderItemId			
		from CompanyDatabase..OrderItems oi
		inner join CompanyDatabase..Orders o on o.intOrderId = oi.intOrderId
		where o.dtmTimeStamp between @startDate and  @endDate
		and oi.strstatus='CANCELLED' 
		)

We insert into @tblReturn the result of a query, and this query relies heavily on using a big pile of subqueries to decide if a record should be included in the output- but these subqueries all query the same tables as the root query. I'm fairly certain this could be a simple join with a pretty readable where clause, but I'm also not going to sit here and rewrite it right now, we've got a lot more query to look at.

INSERT INTO @tblReturn1

		
		/* _backOrderItemsReleased */	

		select  intOrderItemId			
		from CompanyDatabase..OrderItems oi
		inner join CompanyDatabase..orders o on o.intorderid = oi.intorderid
		where oi.intOrderItemid in (
			  select intRecordID 
			  from CompanyDatabase..StatusChangeLog
			  where strRecordType = 'OrderItem'
			  and strOldStatus in ('BACKORDERED')
			  and strNewStatus in ('NEW', 'RECYCLED')
			  and dtmTimeStamp between @startDate and  @endDate  
		)
		and o.dtmTimeStamp < @startDate
		

		UNION
		(
			/*_pendingHoldItemsReleased*/

			select  intOrderItemId					
			from CompanyDatabase..OrderItems oi
			inner join CompanyDatabase..orders o on o.intorderid = oi.intorderid
			where oi.intOrderID in (
				  select intRecordID 
				  from CompanyDatabase..StatusChangeLog
				  where strRecordType = 'Order'
				  and strOldStatus in ('REVIEW', 'ONHOLD', 'PENDING')
				  and strNewStatus in ('NEW', 'PROCESSING')
				  and dtmTimeStamp between @startDate and  @endDate  
			)
			and o.dtmTimeStamp < @startDate
			
		)

		UNION

		/* _reallocationsowingtonostock */	
		(
			select oi.intOrderItemID				   	 
			from CompanyDatabase.dbo.StatusChangeLog 
			inner join CompanyDatabase.dbo.OrderItems oi on oi.intOrderItemID = CompanyDatabase.dbo.StatusChangeLog.intRecordID
			inner join CompanyDatabase.dbo.Orders o on o.intOrderId = oi.intOrderId  

			where strOldStatus = 'RECYCLED' and strNewStatus = 'ALLOCATED' 
			and CompanyDatabase.dbo.StatusChangeLog.dtmTimestamp > @endDate and 
			strRecordType = 'OrderItem'
			and intRecordId in 
			(
			  select intRecordId from CompanyDatabase.dbo.StatusChangeLog 
			  where strOldStatus = 'ALLOCATED' and strNewStatus = 'RECYCLED' 
			  and strRecordType = 'OrderItem'
			  and CompanyDatabase.dbo.StatusChangeLog.dtmTimestamp between @startDate and  @endDate  
			)  
		)

Okay, just some unions with more subquery filtering. More of the same. It's the next one that makes this special.

INSERT INTO @tblReturn2

	SELECT intOrderItemId FROM @tblReturn 
	
	UNION

	SELECT intOrderItemId FROM @tblReturn1

Ah, here's the stuff. This is just bonkers. If the goal is to combine the results of these queries into a single table, you could just insert into one table the whole time.

But we know that there are 5 of these tables, so why are we only going through the first two to combine them at this point?

    INSERT INTO @tblReturn3

		/* _factoryAllocation*/

		select 
		oi.intOrderItemId                              
		from CompanyDatabase..Shipments s 
		inner join CompanyDatabase..ShipmentItems si on si.intShipmentID = s.intShipmentID
		inner join Common.CompanyDatabase.Stores stores on stores.intStoreID = s.intLocationID
		inner join CompanyDatabase..OrderItems oi on oi.intOrderItemId = si.intOrderItemId                                      
		inner join CompanyDatabase..Orders o on o.intOrderId = s.intOrderId  
		where s.dtmTimestamp >= @endDate
		and stores.strLocationType = 'FACTORY'
		
		UNION 
		(
	 	  /*_storeAllocations*/

		select oi.intOrderItemId                               
		from CompanyDatabase..Shipments s 
		inner join CompanyDatabase..ShipmentItems si on si.intShipmentID = s.intShipmentID
		inner join Common.CompanyDatabase.Stores stores on stores.intStoreID = s.intLocationID
		inner join CompanyDatabase..OrderItems oi on oi.intOrderItemId = si.intOrderItemId                                      
		inner join CompanyDatabase..Orders o on o.intOrderId = s.intOrderId
		where s.dtmTimestamp >= @endDate
		and stores.strLocationType <> 'FACTORY'
		)

		UNION
		(
		/* _ordersWithAllocationProblems */
    	
			select oi.intOrderItemId
			from CompanyDatabase.dbo.StatusChangeLog
			inner join CompanyDatabase.dbo.OrderItems oi on oi.intOrderItemID = CompanyDatabase.dbo.StatusChangeLog.intRecordID
			inner join CompanyDatabase.dbo.Orders o on o.intOrderId = oi.intOrderId
			where strRecordType = 'orderitem'
			and strNewStatus = 'PROBLEM'
			and strOldStatus = 'NEW'
			and CompanyDatabase.dbo.StatusChangeLog.dtmTimestamp > @endDate
			and o.dtmTimestamp < @endDate
		)

Okay, @tblReturn3 is more of the same. Nothing more to really add.

	 INSERT INTO @tblReturn4
	
	 SELECT intOrderItemId FROM @tblReturn2 WHERE
	 intOrderItemId NOT IN(SELECT intOrderItemId FROM @tblReturn3 )

Ooh, but here we see something a bit different- we're taking the set difference between @tblReturn2 and @tblReturn3. This would almost make sense if there weren't already set operations in T-SQL which would handle all of this.

Which brings us, finally, to the last query in the whole thing:

SELECT 
	 o.intOrderId
	,oi.intOrderItemId
	,o.dtmDate
	,oi.strDescription
	,o.strFirstName + o.strLastName AS 'Name'
	,o.strEmail
	,o.strBillingCountry
	,o.strShippingCountry
	FROM CompanyDatabase.dbo.OrderItems oi
	INNER JOIN CompanyDatabase.dbo.Orders o on o.intOrderId = oi.intOrderId
	WHERE oi.intOrderItemId IN (SELECT intOrderItemId FROM @tblReturn4)
END

At the end of all this, I've determined a few things.

First, the developer responsible didn't understand table variables. Second,they definitely didn't understand joins. Third, they had no sense of the overall workflow of this query and just sorta fumbled through until they got results that the client said were okay.

And somehow, this pile of trash made it through a code review by internal architects and got deployed to production, where it promptly became the worst performing query in their application. Correction: the worst performing query thus far.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

CodeSOD: A Ruby Encrusted Footgun

Many years ago, JP joined a Ruby project. This was in the heyday of Ruby, when every startup on Earth was using it, and if you weren't building your app on Rails, were you even building an app?

Now, Ruby offers a lot of flexibility. One might argue that it offers too much flexibility, especially insofar as it permits "monkey patching": you can always add new methods to an existing class, if you want. Regardless of the technical details, JP and the team saw that massive flexibility and said, "Yes, we should use that. All of it!"

As these stories usually go, that was fine- for awhile. Then one day, a test started failing because a class name wasn't defined. That was already odd, but what was even odder is that when they searched through the code, that class name wasn't actually used anywhere. So yes, there was definitely no class with that name, but also, there was no line of code that was trying to instantiate that class. So where was the problem?

def controller_class(name)
  "#{settings.app_name.camelize}::Controllers".constantize.const_get("#{name.to_s.camelize}")
end

def model_class(name)
  "#{settings.app_name.camelize}".constantize.const_get("#{name.to_s.camelize}")
end

def resource_class(name)
  "#{settings.app_name.camelize}Client".constantize.const_get("#{name.to_s.camelize}")
end

It happened because they were dynamically constructing the class names from a settings field. And not just in this handful of lines- this pattern occurred all over the codebase. There were other places where it referenced a different settings field, and they just hadn't encountered the bug yet, but knew that it was only a matter of time before changing a settings file was going to break more functionality in the application.

They wisely rewrote these sections to not reference the settings, and dubbed the pattern the "Caramelize Pattern". They added that to their coding standards as a thing to avoid, and learned a valuable lesson about how languages provide footguns.

Since today's April Fool's Day, consider the prank the fact that everyone learned their lesson and corrected their mistakes. I suppose that has to happen at least sometimes.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

CodeSOD: Nobody's BFF

Legacy systems are hard to change, and even harder to eliminate. You can't simply do nothing though; as technology and user expectations change, you need to find ways to modernize and adapt the legacy system.

That's what happened to Alicia's team. They had a gigantic, spaghetti-coded, monolithic application that was well past drinking age and had a front-end to match. Someone decided that they couldn't touch the complex business logic, but what they could do was replace the frontend code by creating an adapter service; the front end would call into this adapter, and the adapter would execute the appropriate methods in the backend.

Some clever coder named this "Backend for Frontend" or "BFF".

It was not anyone's BFF. For starters, this system didn't actually allow you to just connect a UI to the backend. No, that'd be too easy. This system was actually a UI generator.

The way this works is that you feed it a schema file, written in JSON. This file specifies what input elements you want, some hints for layout, what validation you want the UI to perform, and even what CSS classes you want. Then you compile this as part of a gigantic .NET application, and deploy it, and then you can see your new UI.

No one likes using it. No one is happy that it exists. Everyone wishes that they could just write frontends like normal people, and not use this awkward schema language.

All that is to say, when Alicia's co-worker stood up shortly before lunch, said, "I'm taking off the rest of the day, BFF has broken me," it wasn't particularly shocking to hear- or even the first time that'd happened.

Alicia, not heeding the warning inherent in that statement, immediately tracked down that dev's last work, and tried to understand what had been so painful.

    "minValue": 1900,
    "maxValue": 99,

This, of course, had to be a bug. Didn't it? How could the maxValue be lower than the minValue?

Let's look at the surrounding context.

{
    "type": "eventValueBetweenValuesValidator",
    "eventType": "CalendarYear",
    "minValue": 1900,
    "maxValue": 99,
    "isCalendarBasedMaxValue": true,
    "message": "CalendarYear must be between {% raw %}{{minValue}}{% endraw %} and {% raw %}{{maxValue}}{% endraw %}."
}

I think this should make it perfectly clear what's happening. Oh, it doesn't? Look at the isCalendarBasedMaxValue field. It's true. There, that should explain everything. No, it doesn't? You're just more confused?

The isCalendarBasedMaxValue says that the maxValue field should not be treated as a literal value, but instead, is the number of years in the future relative to the current year which are considered valid. This schema definition says "accept all years between 1900 and 2124 (at the time of this writing)." Next year, that top value goes up to 2125. Then 2126. And so on.

As features go, it's not a terrible feature. But the implementation of the feature is incredibly counter-intuitive. At the end of the day, this is just bad naming: (ab)using min/max to do something that isn't really a min/max validation is the big issue here.

Alicia writes:

I couldn't come up with something more counterintuitive if I tried.

Oh, don't sell yourself short, Alicia. I'm sure you could write something far, far worse if you tried. The key thing here is that clearly, nobody tried- they just sorta let things happen and definitely didn't think too hard about it.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

Error'd: Here Comes the Sun

We got an unusual rash of submissions at Error'd this week. Here are five reasonably good ones chosen not exactly at random. For those few (everyone) who didn't catch the off-by-one from last week's batch, there's the clue.

"Gotta CAPTCHA 'Em All," puns Alex G. "So do I select them all?" he wondered. I think the correct answer is null.

1

 

"What does a null eat?" wondered B.J.H , "and is one null invited or five?". The first question is easily answered. NaaN, of course. Probably garlic. I would expect B.J. to already know the eating habits of a long-standing companion, so I am guessing that the whole family is not meant to tag along. Stick with just the one.

3

 

Planespotter Rick R. caught this one at the airport. "Watching my daughter's flight from New York and got surprised by Boeing's new supersonic 737 having already arrived in DFW," he observed. I'm not quite sure what went wrong. It's not the most obvious time zone mistake I can imagine, but I'm pretty sure the cure is the same: all times displayed in any context that is not purely restricted to a single location (and short time frame) should explicitly include the relevant timezone.

2

 

Rob H. figures "From my day job's MECM Software Center. It appears that autocorrect has miscalculated, because the internet cannot be calculated." The internet is -1.

4

 

Ending this week on a note of hope, global warrior Stewart may have just saved the planet. "Climate change is solved. We just need to replicate the 19 March performance of my new solar panels." Or perhaps I miscalculated.

0

 

[Advertisement] Keep all your packages and Docker containers in one place, scan for vulnerabilities, and control who can access different feeds. ProGet installs in minutes and has a powerful free version with a lot of great features that you can upgrade when ready.Learn more.

A Bracing Way to Start the Day

Barry rolled into work at 8:30AM to see the project manager waiting at the door, wringing her hands and sweating. She paced a bit while Barry badged in, and then immediately explained the issue:

Today was a major release of their new features. This wasn't just a mere software change; the new release was tied to major changes to a new product line- actual widgets rolling off an assembly line right now. And those changes didn't work.

"I thought we tested this," Barry said.

"We did! And Stu called in sick today!"

Stu was the senior developer on the project, who had written most of the new code.

"I talked to him for a few minutes, and he's convinced it's a data issue. Something in the metadata or something?"

"I'll take a look," Barry said.

He skipped grabbing a coffee from the carafe and dove straight in.

Prior to the recent project, the code had looked something like this:

if (IsProduct1(_productId))
	_programId = 1;
elseif (IsProduct2(_productId))
	_programId = 2;
elseif (IsProduct3(_productId))
	_programId = 3;

Part of the project, however, was about changing the workflow for "Product 3". So Stu had written this code:

if (IsProduct1(_productId))
	_programId = 1;
else if (IsProduct2(_productId))
	_programId = 2;
else if (IsProduct3(_productId))
	_programId = 3;
	DoSomethingProductId3Specific1();
	DoSomethingProductId3Specific2();
	DoSomethingProductId3Specific3();

Since this is C# and not Python, it took Barry all of 5 seconds to spot this and figure out what the problem was and fix it:

if (IsProduct1(_productId))
{
	_programId = 1;
}
else if (IsProduct2(_productId))
{
	_programId = 2;
}
else if (IsProduct3(_productId))
{
	_programId = 3;
	DoSomethingProductId3Specific1();
	DoSomethingProductId3Specific2();
	DoSomethingProductId3Specific3();
}

This brings us to about 8:32. Now, given the problems, Barry wasn't about to just push this change- in addition to running pipeline tests (and writing tests that Stu clearly hadn't), he pinged the head of QA to get a tester on this fix ASAP. Everyone worked quickly, and that meant by 9:30 the fix was considered good and ready to be merged in and pushed to production. Sometime in there, while waiting for a pipeline to complete, Barry managed to grab a cup of coffee to wake himself up.

While Barry was busy with that, Stu had decided that he wasn't feeling that sick after all, and had rolled into the office around 9:00. Which meant that just as Barry was about to push the button to run the release pipeline, an "URGENT" email came in from Stu.

"Hey, everybody, I fixed that bug. Can we get this released ASAP?"

Barry went ahead and released the version that he'd already tested, but out of morbid curiosity, went and checked Stu's fix.

if (IsProduct1(_productId))
	_programId = 1;
else if (IsProduct2(_productId))
	_programId = 2;
else if (IsProduct3(_productId))
{
	_programId = 3;
}

if (IsProduct3(_productId))
{
	DoSomethingProductId3Specific1();
	DoSomethingProductId3Specific2();
	DoSomethingProductId3Specific3();
}

At least this version would have worked, though I'm not sure Stu fully understands what "{}"s mean in C#. Or in most programming languages, if we're being honest.

With Barry's work, the launch went off just a few minutes later than the scheduled time. Since the launch was successful, at the next company "all hands", the leadership team made sure to congratulate the people instrumental in making it happen: that is to say, the lead developer of the project, Stu.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

Representative Line: Time for Identification

If you need a unique ID, UUIDs provide a variety of options. It's worth noting that variants 1, 2, and 7 all incorporate a timestamp into the UUID. In the case of variant 7, this has the benefit of making the UUID sortable, which can be convenient in many cases (v1/v2 incorporate a MAC address which means that they're sortable if generated with the same NIC).

I bring this up because Dave inherited some code written by a "guru". Said guru was working before UUIDv7 was a standard, but also didn't have any problems that required sortable UUIDs, and thus had no real reason to use timestamp based UUIDs. They just needed some random identifier and, despite using C#, didn't use the UUID functions built in to the framework. No, they instead did this:

string uniqueID = String.Format("{0:d9}", (DateTime.UtcNow.Ticks / 10) % 1000000000);

A Tick is 100 nanoseconds. We divide that by ten, mod by a billion, and then call that our unique identifier.

This is, as you might guess, not unique. First there's the possibility of timestamp collisions: generating two of these too close together in time would collide. Second, the math is just complete nonsense. We divide Ticks by ten (converting hundreds of nanoseconds into thousands of nanoseconds), then we mod by a billion. So every thousand seconds we loop and have a risk of collision again?

Maybe, maybe, these are short-lived IDs and a thousand seconds is plenty of time. But even if that's true, none of this is a good way to do that.

I suppose the saving grace is they use UtcNow and not Now, thus avoiding situations where collisions also happen because of time zones?

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!

Elon Musk Gives Himself a Handshake

By: Nick Heer

Kurt Wagner and Katie Roof, Bloomberg:

Elon Musk said his xAI artificial intelligence startup has acquired the X platform, which he also controls, at a valuation of $33 billion, marking a surprise twist for the social network formerly known as Twitter.

This feels like it has to be part of some kind of financial crime, right? Like, I am sure it is not; I am sure this is just a normal thing businesses do that only feels criminal, like how they move money around the world to avoid taxes.

Wagner and Roof:

The deal gives the new combined entity, called XAI Holdings, a value of more than $100 billion, not including the debt, according to a person familiar with the arrangement, who asked not to be identified because the terms weren’t public. Morgan Stanley was the sole banker on the deal, representing both sides, other people said.

For perspective, that is around about the current value of Lockheed Martin, Rio Tinto — one of the world’s largest mining businesses — and Starbucks. All of those companies make real products with real demand — unfortunately so, in the case of the first. xAI has exactly one external customer today. And it is not like unpleasant social media seems to be a booming business.

Kate Conger and Lauren Hirsch, New York Times:

This month, X continued to struggle to hit its revenue targets, according to an internal email seen by The New York Times. As of March 3, X had served $91 million of ads this year, the message said, well below its first-quarter target of $153 million.

This is including the spending of several large advertisers. For comparison, in the same quarter in the pre-Musk era, Twitter generated over a billion dollars in advertising revenue.

I am begging for Matt Levine to explain this to me.

⌥ Permalink

Apple’s Missteps in A.I. Are Partly the Fault of A.I.

By: Nick Heer

Allison Morrow, CNN:

Tech columnists such as the New York Times’ Kevin Roose have suggested recently that Apple has failed AI, rather than the other way around.

“Apple is not meeting the moment in AI,” Roose said on his podcast, Hard Fork, earlier this month. “I just think that when you’re building products with generative AI built into it, you do just need to be more comfortable with error, with mistakes, with things that are a little rough around the edges.”

To which I would counter, respectfully: Absolutely not.

Via Dan Moren, of Six Colors:

The thesis of the piece is not about excusing Apple’s AI missteps, but zooming out to take a look at the bigger picture of why AI is everywhere, and make the argument that maybe Apple is well-served by not necessarily being on the cutting edge of these developments.

If that is what this piece is arguing, I do not think Apple makes a good case for it. When it launched Apple Intelligence, it could have said it was being more methodical, framing a modest but reliable feature set as a picture of responsibility. This would be a thin layer of marketing speak covering the truth, of course, but that would at least set expectations. Instead, what we got was a modest and often unreliable feature set with mediocre implementation, and the promise of a significantly more ambitious future that has been kicked down the road.

These things do not carry the Apple promise, as articulated by Morrow, of “design[ing] things that are accessible out of the box”, products for which “[y]ou will almost never need a user manual filled with tiny print”. It all feels flaky and not particularly nice to use. Even the toggle to turn it off is broken.

⌥ Permalink

Sponsor: Magic Lasso Adblock: Incredibly Private and Secure Safari Web Browsing

By: Nick Heer

Online privacy isn’t just something you should be hoping for – it’s something you should expect. You should ensure your browsing history stays private and is not harvested by ad networks.

Magic Lasso Adblock: No ads, no trackers, no annoyances, no worries

By blocking ad trackers, Magic Lasso Adblock stops you being followed by ads around the web.

As an efficient, high performance and native Safari ad blocker, Magic Lasso blocks all intrusive ads, trackers and annoyances on your iPhone, iPad, and Mac. And it’s been designed from the ground up to protect your privacy.

Users rely on Magic Lasso Adblock to:

  • Remove ad trackers, annoyances and background crypto-mining scripts

  • Browse common websites 2.0× faster

  • Block all YouTube ads, including pre-roll video ads

  • Double battery life during heavy web browsing

  • Lower data usage when on the go

With over 5,000 five star reviews; it’s simply the best ad blocker for your iPhone, iPad, and Mac.

And unlike some other ad blockers, Magic Lasso Adblock respects your privacy, doesn’t accept payment from advertisers and is 100% supported by its community of users.

So, join over 350,000 users and download Magic Lasso Adblock today.

⌥ Permalink

Meta Adds ‘Friends’ Tab to Facebook to Show Posts From Users’ Friends

By: Nick Heer

Meta:

Formerly a place to view friend requests and People You May Know, the Friends tab will now show your friends’ stories, reels, posts, birthdays and friend requests.

You know, I think this concept of showing people things they say they want to see might just work.

Meta says this is just one of “several ‘O.G.’ Facebook experiences [coming] throughout the year” — a truly embarrassing sentence. But Mark Zuckerberg said in an autumn earnings call that Facebook would “add a whole new category of content which is A.I. generated or A.I. summarized content, or existing content pulled together by A.I. in some way”. This plan is going just great. I think the way these things can be reconciled is exactly how Facebook is doing it: your friends go in a “Friends” tab, but you will see all the other stuff it wants to push on you by default. Just look how Meta has done effectively the same thing in Instagram and Threads.

⌥ Permalink

The Myth and Reality of Mac OS X Snow Leopard

By: Nick Heer

Jeff Johnson in November 2023:

When people wistfully proclaim that they wish for the next major macOS version to be a “Snow Leopard update”, they’re wishing for the wrong thing. No major update will solve Apple’s quality issues. Major updates are the cause of quality issues. The solution would be a long string of minor bug fix updates. What people should be wishing for are the two years of stability and bug fixes that occurred after the release of Snow Leopard. But I fear we’ll never see that again with Tim Cook in charge.

I read an article today from yet another person pining for a mythical Snow Leopard-style MacOS release. While I sympathize with the intent of their argument, it is largely fictional and, as Johnson writes, it took until about two years into Snow Leopard’s release cycle for it to be the release we want to remember:

It’s an iron law of software development that major updates always introduce more bugs than they fix. Mac OS X 10.6.0 was no exception, of course. The next major update, Mac OS X 10.7.0, was no exception either, and it was much buggier than 10.6.8 v1.1, even though both versions were released in the same week.

What I desperately miss is that period of stability after a few rounds of bug fixes. As I have previously complained about, my iMac cannot run any version of MacOS newer than Ventura, released in 2022. It is still getting bug and security fixes. In theory, this should mean I am running a solid operating system despite missing some features.

It is not. Apple’s engineering efforts quickly moved toward shipping MacOS Sonoma in 2023, and then Sequoia last year. It seems as though any bug fixes were folded into these new major versions and, even worse, new bugs were introduced late in the Ventura release cycle that have no hope of being fixed. My iMac seizes up when I try to view HDR media; because this Extended Dynamic Range is an undocumented enhancement, there is no preference to turn it off. Recent Safari releases have contained several bugs related to page rendering and scrolling. Weather sometimes does not display for my current location.

Ventura was by no means bug-free when it shipped, and I am disappointed even its final form remains a mess. My MacBook Pro is running the latest public release of MacOS Sequoia and it, too, has new problems late in its development cycle; I reported a Safari page crashing bug earlier this week. These are on top of existing problems, like how there is no way to change the size of search results’ thumbnails in Photos.

Alas, I am not expecting many bugs to be fixed. It is, after all, nearly April, which means there are just two months until WWDC and the first semi-public builds of another new MacOS version. I am hesitant every year to upgrade. But it does not appear much effort is being put into the maintenance of any previous version. We all get the choice of many familiar bugs, or a blend of hopefully fewer old bugs plus some new ones.

⌥ Permalink

The New Substack Universe

By: Nick Heer

Remember when Substack’s co-founders went to great lengths to explain what they had built was little more than infrastructure? It was something they repeated earlier this year:

You need to have your own corner of the internet, a place where you can build a home, on your own land, with assets you control.

Our system gives creators ownership. With Substack, you have your own property to build on: content you own, a URL of your choosing, a website for your work, and a mailing list of your subscribers that you can export and take with you at any time.

This is a message the company reinforces because it justifies a wildly permissive environment for posters that requires little oversight. But it is barely more true that Substack is “your own land, with assets you control” than, say, a YouTube channel. The main thing Substack has going for it is that you can export a list of subscribers’ email accounts. Otherwise, the availability of your material remains subject to Substack’s priorities and policies.

What Substack in fact offers, and what differentiates it from a true self-owned “land”, is a comprehensive set of media formats and opportunities for promotion.

Charlotte Klein, New York magazine:

Substack today has all of the functionalities of a social platform, allowing proprietors to engage with both subscribers (via the Chat feature) or the broader Substack universe in the Twitter-esque Notes feed. Writers I spoke to mentioned that for all of their reluctance to engage with the Notes feature, they see growth when they do. More than 50 percent of all subscriptions and 30 percent of paid subscriptions on the platform come directly from the Substack network. There’s been a broader shift toward multimedia content: Over half of the 250 highest-revenue creators were using audio and video in April 2024, a number that had surged to 82 percent by February 2025.

Substack is now a blogging platform with email capabilities, a text-based social platform, a podcasting platform, and a video host — all of which can be placed behind a paywall. This is a logical evolution for the company. But please do not confuse this with infrastructure. YouTube can moderate its platform as it chooses and so can Substack. The latter has decided to create a special category filled to the brim with vaccine denialism publications that have “tens of thousands of paid subscribers”, from which Substack takes ten percent of earnings.

⌥ Permalink

Public Figures Keep Leaving Their Venmo Accounts Public

By: Nick Heer

The high-test idiocy of a senior U.S. politician inviting a journalist to an off-the-record chat planning an attack on Yemen, killing over thirty people and continuing a decade of war, seems to have popularized a genre of journalism dedicated to the administration’s poor digital security hygiene. Some of these articles feel less substantial; others suggest greater crimes. One story feels like deja vu.

Dhruv Mehrotra and Tim Marchman, Wired:

The Venmo account under [Mike] Waltz’s name includes a 328-person friend list. Among them are accounts sharing the names of people closely associated with Waltz, such as [Walker] Barrett, formerly Waltz’s deputy chief of staff when Waltz was a member of the House of Representatives, and Micah Thomas Ketchel, former chief of staff to Waltz and currently a senior adviser to Waltz and President Donald Trump.

[…]

One of the most notable appears to belong to [Susie] Wiles, one of Trump’s most trusted political advisers. That account’s 182-person friend list includes accounts sharing the names of influential figures like Pam Bondi, the US attorney general, and Hope Hicks, Trump’s former White House communications director.

In 2021, reporters for Buzzfeed News found Joe Biden’s Venmo account and his contacts. Last summer, the same Wired reporters plus Andrew Couts found J.D. Vance’s and, in February, reporters for the American Prospect found Pete Hegseth’s. It remains a mystery to me why one of the most popular U.S. payment apps is this public.

⌥ Permalink

The War on Encryption Is Dangerous

By: Nick Heer

Meredith Whittaker, president of Signal — which has recently been in the news — in an op-ed for the Financial Times:

The UK is part and parcel of a dangerous trend that threatens the cyber security of our global infrastructures. Legislators in Sweden recently proposed a law that would force communication providers to build back door vulnerabilities. France is poised to make the same mistake when it votes on the inclusion of “ghost participants” in secure conversations via back doors. “Chat control” legislation haunts Brussels.

There is some good news: French legislators ultimately rejected this provision.

⌥ Permalink

WWDC 2025 Announced

By: Nick Heer

Like those since 2020, WWDC 2025 appears to be an entirely online event with a one-day in-person event. While it is possible there will be live demos — I certainly hope that is the case — I bet it is a two-hour infomercial again.

If you are planning on travelling there and live outside the United States, there are some things you should know and precautions you should take, particularly if you are someone who is transgender or nonbinary. It is a good thing travel is not required, and hopefully Apple will once again run labs worldwide.

⌥ Permalink

You Are Just a Guest on Meta’s A.I.-Filled Platforms

By: Nick Heer

Jason Koebler, 404 Media:

The best way to think of the slop and spam that generative AI enables is as a brute force attack on the algorithms that control the internet and which govern how a large segment of the public interprets the nature of reality. It is not just that people making AI slop are spamming the internet, it’s that the intended “audience” of AI slop is social media and search algorithms, not human beings.

[…]

“Brute force” is not just what I have noticed while reporting on the spammers who flood Facebook, Instagram, TikTok, YouTube, and Google with AI-generated spam. It is the stated strategy of the people getting rich off of AI slop.

Regardless of whether you have been following Koebler’s A.I. slop beat, you owe it to yourself to read this article at least. The goal, Koelber surmises, is for Meta to target slop and ads at users in more-or-less the same way and, because this slop is cheap and fast to produce, it is a bottomless cup of engagement metrics.

Koebler, in a follow-up article:

As I wrote last week, the strategy with these types of posts is to make a human linger on them long enough to say to themselves “what the fuck,” or to be so horrified as to comment “what the fuck,” or send it to a friend saying “what the fuck,” all of which are signals to the algorithm that it should boost this type of content but are decidedly not signals that the average person actually wants to see this type of thing. The type of content that I am seeing right now makes “Elsagate,” the YouTube scandal in which disturbing videos were targeted to kids and resulted in various YouTube reforms, look quaint.

Matt Growcoot, PetaPixel:

Meta is testing an Instagram feature that suggests AI-generated comments for users to post beneath other users’ photos and videos.

Meta is going to make so much money before it completely disintegrates on account of nobody wanting to spend this much time around a thin veneer over robots.

⌥ Permalink

Facebook to Stop Targeting Ads at U.K. Woman After Legal Fight

By: Nick Heer

Grace Dean, BBC News:

Ms O’Carroll’s lawsuit argued that Facebook’s targeted advertising system was covered by the UK’s definition of direct marketing, giving individuals the right to object.

Meta said that adverts on its platform could only be targeted to groups of a minimum size of 100 people, rather than individuals, so did not count as direct marketing. But the Information Commissioner’s Office (ICO) disagreed.

“Organisations must respect people’s choices about how their data is used,” a spokesperson for the ICO said. “This means giving users a clear way to opt out of their data being used in this way.”

Meta, in response, says “no business can be mandated to give away its services for free”, a completely dishonest way to interpret the ICO’s decision. There is an obvious difference between advertising and personalized advertising. To pretend otherwise is nonsense. Sure, personalized advertising makes Meta more money than non-personalized advertising, but that is an entirely different problem. Meta can figure it out. Or it can be a big soggy whiner about it.

⌥ Permalink

Apple Adds Lossless Audio Support Via Cable to USB-C AirPods Max

By: Nick Heer

John Voorhees, MacStories:

The update [next month] will enable 24-bit, 48 kHz lossless audio, which Apple says is supported by over 100 million songs on Apple Music. Using the headphones’ USB-C cable, musicians will enjoy ultra-low latency and lossless audio in their Logic Pro workflows. The USB-C cable will allow them to produce Personalized Spatial Audio, too.

Allow me to recap the absurd timeline of lossless support for AirPods models.

In December 2020, Apple launched the first AirPods Max models promising “high-fidelity sound” and “the ultimate personal listening experience”. These headphones are mostly designed for wireless listening, but a 3.5mm-to-Lightning cable allows you to connect them to analog sources. Five months later, Apple announces lossless audio in Apple Music. These tracks are not delivered in full fidelity to any AirPods model, including the AirPods Max, because of Bluetooth bandwidth limits, nor when AirPods Max are used in wired mode.

In September 2023, Apple updates the AirPods Pro 2 with a USB-C charging case and adds lossless audio playback over “a groundbreaking wireless audio protocol”, but only when using the Vision Pro — a capability also added to the AirPods 4 line. These headphones all have the H2 chip; the pre-USB-C AirPods Pro 2 also had the H2, but do not support lossless audio.

In September 2024, Apple announces a seemingly minor AirPods Max update with new colours and a USB-C port where a Lightning one used to be. Crucially, it still contains the same H1 chip as the Lightning version.

In March 2025, Apple says lossless audio will now be supported by the AirPods Max, but only in a wired configuration, and only for the USB-C model. I feel like there must be technical reasons for this mess, but it is a mess nonetheless.

⌥ Permalink

Google Lost User Data, Makes Its Recovery a Problem for Users

By: Nick Heer

Simon Sharwood, the Register:

Over the weekend, users noticed their Timelines went missing.

Google seems to have noticed, too, as The Register has seen multiple social media posts in which Timelines users share an email from the search and ads giant in which it admits “We briefly experienced a technical issue that caused the deletion of Timeline data for some people.”

The email goes on to explain that most users that availed themselves of a feature that enables encrypted backups will be able to restore their Maps Timelines data.

Once again, Google provides no explanation for why it is incapable of reliably storing user data, and no customer support. Users are on their own.

⌥ Permalink

Sponsor: Magic Lasso Adblock: 2.0× Faster Web Browsing in Safari

By: Nick Heer

Want to experience twice as fast load times in Safari on your iPhone, iPad, and Mac?

Then download Magic Lasso Adblock — the ad blocker designed for you.

Magic Lasso Adblock: browse 2.0x faster

As an efficient, high performance, and native Safari ad blocker, Magic Lasso blocks all intrusive ads, trackers, and annoyances – delivering a faster, cleaner, and more secure web browsing experience.

By cutting down on ads and trackers, common news websites load 2× faster and browsing uses less data while saving energy and battery life.

Rely on Magic Lasso Adblock to:

  • Improve your privacy and security by removing ad trackers

  • Block all YouTube ads, including pre-roll video ads

  • Block annoying cookie notices and privacy prompts

  • Double battery life during heavy web browsing

  • Lower data usage when on the go

With over 5,000 five star reviews; it’s simply the best ad blocker for your iPhone, iPad. and Mac.

And unlike some other ad blockers, Magic Lasso Adblock respects your privacy, doesn’t accept payment from advertisers, and is 100% supported by its community of users.

So, join over 350,000 users and download Magic Lasso Adblock today.

⌥ Permalink

‘Adolescence’

By: Nick Heer

Lucy Mangan, the Guardian:

There have been a few contenders for the crown [of “televisual perfection”] over the years, but none has come as close as Jack Thorne’s and Stephen Graham’s astonishing four-part series Adolescence, whose technical accomplishments – each episode is done in a single take – are matched by an array of award-worthy performances and a script that manages to be intensely naturalistic and hugely evocative at the same time. Adolescence is a deeply moving, deeply harrowing experience.

I did not intend on watching the whole four-part series today, maybe just the first and second episodes. But I could not turn away. The effectively unanimous praise for this is absolutely earned.

The oner format sounds like it could be a gimmick, the kind of thing that screams a bit too loud and overshadows what should be a tender and difficult narrative. Nothing could be further from the truth. The technical decisions force specific storytelling decisions, in the same way that a more maximalist production in the style of, say, David Fincher does. Fincher would shoot fifty versions of everything and then assemble the best performances into a tight machine — and I love that stuff. But I love this, too, little errors and all. It is better for these choices. The dialogue cannot get just a little bit tighter in the edit, or whatever. It is all just there.

I know nothing about reviewing television or movies but, so far as I can tell, everyone involved has pulled this off spectacularly. You can quibble with things like the rainbow party-like explanation of different emoji — something for which I cannot find any evidence — that has now become its own moral panic. I get that. Even so, this is one of the greatest storytelling achievements I have seen in years.

Update: Watch it on Netflix. See? The ability to edit means I can get away with not fully thinking this post through.

⌥ Permalink

Trapping Misbehaving Bots in an A.I. Labyrinth

By: Nick Heer

Reid Tatoris, Harsh Saxena, and Luis Miglietti, of Cloudflare:

Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives. When you opt in, Cloudflare will automatically deploy an AI-generated set of linked pages when we detect inappropriate bot activity, without the need for customers to create any custom rules.

Two thoughts:

  1. This is amusing. Nothing funnier than using someone’s own words or, in this case, technology against them.

  2. This is surely going to lead to the same arms race as exists now between privacy protections and hostile adtech firms. Right?

⌥ Permalink

⌥ Apple Could Build Great Platforms for Third-Party A.I. If It Wanted To

By: Nick Heer

There is a long line of articles questioning Apple’s ability to deliver on artificial intelligence because of its position on data privacy. Today, we got another in the form of a newsletter.

Reed Albergotti, Semafor:

Meanwhile, Apple was focused on vertically integrating, designing its own chips, modems, and other components to improve iPhone margins. It was using machine learning on small-scale projects, like improving its camera algorithms.

[…]

Without their ads businesses, companies like Google and Meta wouldn’t have built the ecosystems and cultures required to make them AI powerhouses, and that environment changed the way their CEOs saw the world.

Again, I will emphasize this is a newsletter. It may seem like an article from a prestige publisher that prides itself on “separat[ing] the facts from our views”, but you might notice how, aside from citing some quotes and linking to ads, none of Albergotti’s substantive claims are sourced. This is just riffing.

I remain skeptical. Albergotti frames this as both a mindset shift and a necessity for advertising companies like Google and Meta. But the company synonymous with the A.I. boom, OpenAI, does not have the same business model. Besides, Apple behaves like other A.I. firms by scraping the web and training models on massive amounts of data. The evidence for this theory seems pretty thin to me.

But perhaps a reluctance to be invasive and creepy is one reason why personalized Siri features have been delayed. I hope Apple does not begin to mimic its peers in this regard; privacy should not be sacrificed. I think it is silly to be dependent on corporate choices rather than legislation to determine this, but that is the world some of us live in.

Let us concede the point anyhow, since it suggests a role Apple could fill by providing an architecture for third-party A.I. on its products. It does not need to deliver everything to end users; it can focus on building a great platform. Albergotti might sneeze at “designing its own chips […] to improve iPhone margins”, which I am sure was one goal, but it has paid off in ridiculously powerful Macs perfect for A.I. workflows. And, besides, it has already built some kind of plugin architecture into Apple Intelligence because it has integrated ChatGPT. There is no way for other providers to add their own extension — not yet, anyhow — but the system is there.

Gus Mueller:

The crux of the issue in my mind is this: Apple has a lot of good ideas, but they don’t have a monopoly on them. I would like some other folks to come in and try their ideas out. I would like things to advance at the pace of the industry, and not Apple’s. Maybe with a blessed system in place, Apple could watch and see how people use LLMs and other generative models (instead of giving us Genmoji that look like something Fisher-Price would make). And maybe open up the existing Apple-only models to developers. There are locally installed image processing models that I would love to take advantage of in my apps.

Via Federico Viticci, MacStories:

Which brings me to my second point. The other feature that I could see Apple market for a “ChatGPT/Claude via Apple Intelligence” developer package is privacy and data retention policies. I hear from so many developers these days who, beyond pricing alone, are hesitant toward integrating third-party AI providers into their apps because they don’t trust their data and privacy policies, or perhaps are not at ease with U.S.-based servers powering the popular AI companies these days. It’s a legitimate concern that results in lots of potentially good app ideas being left on the table.

One of Apple’s specialties is in improving the experience of using many of the same technologies as everyone else. I would like to see that in A.I., too, but I have been disappointed by its lacklustre efforts so far. Even long-running projects where it has had time to learn and grow have not paid off, as anyone can see in Siri’s legacy.

What if you could replace these features? What if Apple’s operating systems were great platforms by which users could try third-party A.I. services and find the ones that fit them best? What if Apple could provide certain privacy promises, too? I bet users would want to try alternatives in a heartbeat. Apple ought to welcome the challenge.

Technofossils

By: Nick Heer

Damian Carrington, the Guardian:

Their exploration of future fossils has led [Prof. Sarah] Gabbott and [Prof. Jan] Zalasiewicz to draw some conclusions. One is that understanding how human detritus could become fossils points towards how best to stop waste piling up in the environment.

“In the making of fossils, it’s the first few years, decades, centuries and millennia which are really crucial,” says Zalasiewicz. “This overlaps with the time in which we have the capacity to do something about it.”

Gabbott says: “The big message here is that the amount of stuff that we are now making is eye-watering – it’s off the scale.” All of the stuff made by humans by 1950 was a small fraction of the mass of all the living matter on Earth. But today it outweighs all plants, animals and microbes and is set to triple by 2040.

It is disconcerting to understand our evidence of civilization accumulated over the span of many tens of thousands of years, yet we have equalized that within just a few decades. We are converting so much of the matter on this planet into things we care about for only a few minutes to a few years, but their mark will last forever.

Gabbott and Zalasiewicz’s book “Discarded” is out now. I hope my local library stocks it soon.

⌥ Permalink

Apple Head Computer, Apple Intelligence, and Apple Computer Heads

By: Nick Heer

Benedict Evans:

That takes us to xR, and to AI. These are fields where the tech is fundamental, and where there are real, important Apple kinds of questions, where Apple really should be able to do something different. And yet, with the Vision Pro Apple stumbled, and then with AI it’s fallen flat on its face. This is a concern.

The Vision Pro shipped as promised and works as advertised. But it’s also both too heavy and bulky and far too expensive to be a viable mass-market consumer product. Hugo Barra called it an over-engineered developer kit — you could also call it an experiment, or a preview or a concept. […]

The main problem, I think, with the reception of the Vision Pro is that it was passed through the same marketing lens as Apple uses to frame all its products. I have no idea if Apple considers the sales of this experiment acceptable, the tepid developer adoption predictable, or the skeptical press understandable. However, if you believe the math on display production and estimated sales figures, they more-or-less match.

Of course, as Evans points out, Apple does not ship experiments:

The new Siri that’s been delayed this week is the mirror image of this. […]

However, it clearly is a problem that the Apple execution machine broke badly enough for Apple to spend an hour at WWDC and a bunch of TV commercials talking about vapourware that it didn’t appear to understand was vapourware. The decision to launch the Vision Pro looks like a related failure. It’s a big problem that this is late, but it’s an equally big problem that Apple thought it was almost ready.

Unlike the Siri feature delay, I do not think the Vision Pro’s launch affects the company’s credibility at all. It can keep pushing that thing and trying to turn it into something more mass-market. This Siri stuff is going to make me look at WWDC in a whole different light this year.

Mark Gurman, Bloomberg:

Chief Executive Officer Tim Cook has lost confidence in the ability of AI head John Giannandrea to execute on product development, so he’s moving over another top executive to help: Vision Pro creator Mike Rockwell. In a new role, Rockwell will be in charge of the Siri virtual assistant, according to the people, who asked not to be identified because the moves haven’t been announced.

[…]

Rockwell is known as the brains behind the Vision Pro, which is considered a technical marvel but not a commercial hit. Getting the headset to market required a number of technical breakthroughs, some of which leveraged forms of artificial intelligence. He is now moving away from the Vision Pro at a time when that unit is struggling to plot a future for the product.

If you had no context for this decision, it looks like Rockwell is being moved off Apple’s hot new product and onto a piece of software that perennially disappoints. It looks like a demotion. That is how badly Siri needs a shakeup.

Giannandrea will remain at the company, even with Rockwell taking over Siri. An abrupt departure would signal publicly that the AI efforts have been tumultuous — something Apple is reluctant to acknowledge. Giannandrea’s other responsibilities include oversight of research, testing and technologies related to AI. The company also has a team reporting to Giannandrea investigating robotics.

I figured as much. Gurman does not clarify in this article how much of Apple Intelligence falls under Giannandrea’s rubric, and how much is part of the “Siri” stuff that is being transferred to Rockwell. It does not sound as though Giannandrea will have no further Apple Intelligence responsibilities — yet — but the high-profile public-facing stuff is now overseen by Rockwell and, ultimately, Craig Federighi.

⌥ Permalink

Apple’s Restrictions on Third-Party Hardware Interoperability

By: Nick Heer

There is a free market argument that can be made about how Apple gets to design its own ecosystem and, if it is so restrictive, people will be more hesitant to buy an iPhone since they can get more choices with an Android phone. I get that. But I think it is unfortunate so much of our life coalesces around devices which are so restrictive compared to those which came before.

Recall Apple’s “digital hub” strategy. The Mac would not only connect to hardware like digital cameras and music players; the software Apple made for it would empower people to do something great with those photos and videos and their music.

The iPhone repositioned that in two ways. First, the introduction of iCloud was a way to “demote” the Mac to a device at an equivalent level to everything else. Second, and just as importantly, is how it converged all that third-party hardware into a single device: it is the digital camera, the camcorder, and the music player. As a result, its hub-iness comes mostly in the form of software. If a developer can assume the existence of particular hardware components, they have extraordinary latitude to build on top of that. However, because Apple exercises control over this software ecosystem, it limits its breadth.

Like the Mac of 2001, it is also a hub for accessories — these days, things like headphones and smartwatches. Apple happens to make examples of both. You can still connect third-party devices — but they are limited.

Eric Migicovsky, of Pebble:

I want to set expectations accordingly. We will build a good app for iOS, but be prepared – there is no way for us to support all the functionality that Apple Watch has access to. It’s impossible for a 3rd party smartwatch to send text messages, or perform actions on notifications (like dismissing, muting, replying) and many, many other things.

Even if you believe Apple is doing this not out of anticompetitive verve, but instead for reasons of privacy, security, API support, and any number of other qualities, it still sucks. What it means is that Apple is mostly competing against itself, particularly in smartwatches. (Third-party Bluetooth headphones, like the ones I have, mostly work fine.)

The European Commission announced guidance today for improving third-party connectivity with iOS. Apple is, of course, miserable about this. I am curious to see the real-world results, particularly as the more dire predictions of permitting third-party app distribution have — shockingly — not materialized.

Imagine how much more interesting this ecosystem could be if there were substantial support across “host” platforms.

⌥ Permalink

In universities, sometimes simple questions aren't simple

By: cks

Over on the Fediverse I shared a recent learning experience:

Me, an innocent: "So, how many professors are there in our university department?"
Admin person with a thousand yard stare: "Well, it depends on what you mean by 'professor', 'in', and 'department." <unfolds large and complicated chart>

In many companies and other organizations, the status of people is usually straightforward. In a university, things are quite often not so clear, and in my department all three words in my joke are in fact not a joke (although you could argue that two overlap).

For 'professor', there are a whole collection of potential statuses beyond 'tenured or tenure stream'. Professors may be officially retired but still dropping by to some degree ('emeritus'), appointed only for a limited period (but doing research, not just teaching), hired as sessional instructors for teaching, given a 'status-only' appointment, and other possible situations.

(In my university, there's such a thing as teaching stream faculty, who are entirely distinct from sessional instructors. In other universities, all professors are what we here would call 'research stream' professors and do research work as well as teaching.)

For 'in', even once you have a regular full time tenure stream professor, there's a wide range of possibilities for a professor to be cross appointed (also) between departments (or sometimes 'partially appointed' by two departments). These sort of multi-department appointments are done for many reasons, including to enable a professor in one department to supervise graduate students in another one. How much of the professor's salary each department pays varies, as does where the professor actually does their research and what facilities they use in each department.

(Sometime a multi-department professor will be quite active in both departments because their core research is cross-disciplinary, for example.)

For 'department', this is a local peculiarity in my university. We have three campuses, and professors are normally associated with a specific campus. Depending on how you define 'the department', you might or might not consider Computer Science professors at the satellite campuses to be part of the (main campus) department. Sometimes it depends on what the professors opt to do, for example whether or not they will use our main research computing facilities, or whether they'll be supervising graduate students located at our main campus.

Which answers you want for all of these depends on what you're going to use the resulting number (or numbers) for. There is no singular and correct answer for 'how many professors are there in the department'. The corollary to this is that any time we're asked how many professors are in our department, we have to quiz the people asking about what parts matter to them (or guess, or give complicated and conditional answers, or all of the above).

(Asking 'how many professor FTEs do we have' isn't any better.)

PS: If you think this complicates the life of any computer IAM system that's trying to be a comprehensive source of answers, you would be correct. Locally, my group doesn't even attempt to track these complexities and instead has a much simpler view of things that works well enough for our purposes (mostly managing Unix accounts).

US sanctions and your VPN (and certain big US-based cloud providers)

By: cks

As you may have heard (also) and to simplify, the US government requires US-based organizations to not 'do business with' certain countries and regions (what this means in practice depends in part which lawyer you ask, or more to the point, that the US-based organization asked). As a Canadian university, we have people from various places around the world, including sanctioned areas, and sometimes they go back home. Also, we have a VPN, and sometimes when people go back home, they use our VPN for various reasons (including that they're continuing to do various academic work while they're back at home). Like many VPNs, ours normally routes all of your traffic out of our VPN public exit IPs (because people want this, for good reasons).

Getting around geographical restrictions by using a VPN is a time honored Internet tradition. As a result of it being a time honored Internet tradition, a certain large cloud provider with a lot of expertise in browsers doesn't just determine what your country is based on your public IP; instead, as far as we can tell, it will try to sniff all sorts of attributes of your browser and your behavior and so on to tell if you're actually located in a sanctioned place despite what your public IP is. If this large cloud provider decides that you (the person operating through the VPN) actually are in a sanctioned region, it then seems to mark your VPN's public exit IP as 'actually this is in a sanctioned area' and apply the result to other people who are also working through the VPN.

(Well, I simplify. In real life the public IP involved may only be one part of a signature that causes the large cloud provider to decide that a particular connection or request is from a sanctioned area.)

Based on what we observed, this large cloud provider appears to deal with connections and HTTP requests from sanctioned regions by refusing to talk to you. Naturally this includes refusing to talk to your VPN's public exit IP when it has decided that your VPN's IP is really in a sanctioned country. When this sequence of events happened to us, this behavior provided us an interesting and exciting opportunity to discover how many companies hosted some part of their (web) infrastructure and assets (static or otherwise) on the large cloud provider, and also how hard to diagnose the resulting failures were. Some pages didn't load at all; some pages loaded only partially, or had stuff that was supposed to work but didn't (because fetching JavaScript had failed); with some places you could load their main landing page (on one website) but then not move to the pages (on another website at a subdomain) that you needed to use to get things done.

The partial good news (for us) was that this large cloud provider would reconsider its view of where your VPN's public exit IP 'was' after a day or two, at which point everything would go back to working for a while. This was also sort of the bad news, because it made figuring out what was going on somewhat more complicated and hit or miss.

If this is relevant to your work and your VPNs, all I can suggest is to get people to use different VPNs with different public exit IPs depending on where the are (or force them to, if you have some mechanism for that).

PS: This can presumably also happen if some of your people are merely traveling to and in the sanctioned region, either for work (including attending academic conferences) or for a vacation (or both).

(This is a sysadmin war story from a couple of years ago, but I have no reason to believe the situation is any different today. We learned some troubleshooting lessons from it.)

Three ways I know of to authenticate SSH connections with OIDC tokens

By: cks

Suppose, not hypothetically, that you have an MFA equipped OIDC identity provider (an 'OP' in the jargon), and you would like to use it to authenticate SSH connections. Specifically, like with IMAP, you might want to do this through OIDC/OAuth2 tokens that are issued by your OP to client programs, which the client programs can then use to prove your identity to the SSH server(s). One reason you might want to do this is because it's hard to find non-annoying, MFA-enabled ways of authenticating SSH, and your OIDC OP is right there and probably already supports sessions and so on. So far I've found three different projects that will do this directly, each with their own clever approach and various tradeoffs.

(The bad news is that all of them require various amounts of additional software, including on client machines. This leaves SSH apps on phones and tablets somewhat out in the cold.)

The first is ssh-oidc, which is a joint effort of various European academic parties, although I believe it's also used elsewhere (cf). Based on reading the documentation, ssh-oidc works by directly passing the OIDC token to the server, I believe through a SSH 'challenge' as part of challenge/response authentication, and then verifying it on the server through a PAM module and associated tools. This is clever, but I'm not sure if you can continue to do plain password authentication (at least not without PAM tricks to selectively apply their PAM module depending on, eg, the network area the connection is coming from).

Second is Smallstep's DIY Single-Sign-On for SSH (also). This works by setting up a SSH certificate authority and having the CA software issue signed, short-lived SSH client certificates in exchange for OIDC authentication from your OP. With client side software, these client certificates will be automatically set up for use by ssh, and on servers all you need is to trust your SSH CA. I believe you could even set this up for personal use on servers you SSH to, since you set up a personally trusted SSH CA. On the positive side, this requires minimal server changes and no extra server software, and preserves your ability to directly authenticate with passwords (and perhaps some MFA challenge). On the negative side, you now have a SSH CA you have to trust.

(One reason to care about still supporting passwords plus another MFA challenge is that it means that people without the client software can still log in with MFA, although perhaps somewhat painfully.)

The third option, which I've only recently become aware of, is Cloudflare's recently open-sourced 'opkssh' (via, Github). OPKSSH builds on something called OpenPubkey, which uses a clever trick to embed a public key you provide in (signed) OIDC tokens from your OP (for details see here). OPKSSH uses this to put a basically regular SSH public key into such an augmented OIDC token, then smuggles it from the client to the server by embedding the entire token in a SSH (client) certificate; on the server, it uses an AuthorizedKeysCommand to verify the token, extract the public key, and tell the SSH server to use the public key for verification (see How it works for more details). If you want, as far as I can see OPKSSH still supports using regular SSH public keys and also passwords (possibly plus an MFA challenge).

(Right now OPKSSH is not ready for use with third party OIDC OPs. Like so many things it's started out by only supporting the big, established OIDC places.)

It's quite possible that there are other options for direct (ie, non-VPN) OIDC based SSH authentication. If there are, I'd love to hear about them.

(OpenBao may be another 'SSH CA that authenticates you via OIDC' option; see eg Signed SSH certificates and also here and here. In general the OpenBao documentation gives me the feeling that using it merely to bridge between OIDC and SSH servers would be swatting a fly with an awkwardly large hammer.)

How we handle debconf questions during our Ubuntu installs

By: cks

In a comment on How we automate installing extra packages during Ubuntu installs, David Magda asked how we dealt with the things that need debconf answers. This is a good question and we have two approaches that we use in combination. First, we have a prepared file of debconf selections for each Ubuntu version and we feed this into debconf-set-selections before we start installing packages. However in practice this file doesn't have much in it and we rarely remember to update it (and as a result, a bunch of it is somewhat obsolete). We generally only update this file if we discover debconf selections where the default doesn't work in our environment.

Second, we run apt-get with a bunch of environment variables set to muzzle debconf:

export DEBCONF_TERSE=yes
export DEBCONF_NOWARNINGS=yes
export DEBCONF_ADMIN_EMAIL=<null address>@<our domain>
export DEBIAN_FRONTEND=noninteractive

Traditionally I've considered muzzling debconf this way to be too dangerous to do during package updates or installing packages by hand. However, I consider it not so much safe as safe enough to do this during our standard install process. To put it one way, we're not starting out with a working system and potentially breaking it by letting some new or updated package pick bad defaults. Instead we're starting with a non-working system and hopefully ending up with a working one. If some package picks bad defaults and we wind up with problems, that's not much worse than we started out with and we'll fix it by updating our file of debconf selections and then redoing the install.

Also, in practice all of this gets worked out during our initial test installs of any new Ubuntu version (done on test virtual machines these days). By the time we're ready to start installing real servers with a new Ubuntu version, we've gone through most of the discovery process for debconf questions. Then the only time we're going to have problems during future system installs future is if a package update either changes the default answer for a current question (to a bad one) or adds a new question with a bad default. As far as I can remember, we haven't had either happen.

(Some of our servers need additional packages installed, which we do by hand (as mentioned), and sometimes the packages will insist on stopping to ask us questions or give us warnings. This is annoying, but so far not annoying enough to fix it by augmenting our standard debconf selections to deal with it.)

The pragmatics of doing fsync() after a re-open() of journals and logs

By: cks

Recently I read Rob Norris' fsync() after open() is an elaborate no-op (via). This is a contrarian reaction to the CouchDB article that prompted my entry Always sync your log or journal files when you open them. At one level I can't disagree with Norris and the article; POSIX is indeed very limited about the guarantees it provides for a successful fsync() in a way that frustrates the 'fsync after open' case.

At another level, I disagree with the article. As Norris notes, there are systems that go beyond the minimum POSIX guarantees, and also the fsync() after open() approach is almost the best you can do and is much faster than your other (portable) option, which is to call sync() (on Linux you could call syncfs() instead). Under POSIX, sync() is allowed to return before the IO is complete, but at least sync() is supposed to definitely trigger flushing any unwritten data to disk, which is more than POSIX fsync() provides you (as Norris notes, POSIX permits fsync() to apply only to data written to that file descriptor, not all unwritten data for the underlying file). As far as fsync() goes, in practice I believe that almost all Unixes and Unix filesystems are going to be more generous than POSIX requires and fsync() all dirty data for a file, not just data written through your file descriptor.

Actually being as restrictive as POSIX allows would likely be a problem for Unix kernels. The kernel wants to index the filesystem cache by inode, including unwritten data. This makes it natural for fsync() to flush all unwritten data associated with the file regardless of who wrote it, because then the kernel needs no extra data to be attached to dirty buffers. If you wanted to be able to flush only dirty data associated with a file object or file descriptor, you'd need to either add metadata associated with dirty buffers or index the filesystem cache differently (which is clearly less natural and probably less efficient).

Adding metadata has an assortment of challenges and overheads. If you add it to dirty buffers themselves, you have to worry about clearing this metadata when a file descriptor is closed or a file object is deallocated (including when the process exits). If you instead attach metadata about dirty buffers to file descriptors or file objects, there's a variety of situations where other IO involving the buffer requires updating your metadata, including the kernel writing out dirty buffers on its own without a fsync() or a sync() and then perhaps deallocating the now clean buffer to free up memory.

Being as restrictive as POSIX allows probably also has low benefits in practice. To be a clear benefit, you would need to have multiple things writing significant amounts of data to the same file and fsync()'ing their data separately; this is when the file descriptor (or file object) specific fsync() saves you a bunch of data write traffic over the 'fsync() the entire file' approach. But as far as I know, this is a pretty unusual IO pattern. Much of the time, the thing fsync()'ing the file is the only writer, either because it's the only thing dealing with the file or because updates to the file are being coordinated through it so that processes don't step over each other.

PS: If you wanted to implement this, the simplest option would be to store the file descriptor and PID (as numbers) as additional metadata with each buffer. When the system fsync()'d a file, it could check the current file descriptor number and PID against the saved ones and only flush buffers where they matched, or where these values had been cleared to signal an uncertain owner. This would flush more than strictly necessary if the file descriptor number (or the process ID) had been reused or buffers had been touched in some way that caused the kernel to clear the metadata, but doing more work than POSIX strictly requires is relatively harmless.

Sidebar: fsync() and mmap() in POSIX

Under a strict reading of the POSIX fsync() specification, it's not entirely clear how you're properly supposed to fsync() data written through mmap() mappings. If 'all data for the open file descriptor' includes pages touched through mmap(), then you have to keep the file descriptor you used for mmap() open, despite POSIX mmap() otherwise implicitly allowing you to close it; my view is that this is at least surprising. If 'all data' only includes data directly written through the file descriptor with system calls, then there's no way to trigger a fsync() for mmap()'d data.

The obviousness of indexing the Unix filesystem buffer cache by inodes

By: cks

Like most operating systems, Unix has an in-memory cache of filesystem data. Originally this was a fixed size buffer cache that was maintained separately from the memory used by processes, but later it became a unified cache that was used for both memory mappings established through mmap() and regular read() and write() IO (for good reasons). Whenever you have a cache, one of the things you need to decide is how the cache is indexed. The more or less required answer for Unix is that the filesystem cache is indexed by inode (and thus filesystem, as inodes are almost always attached to some filesystem).

Unix has three levels of indirection for straightforward IO. Processes open and deal with file descriptors, which refer to underlying file objects, which in turn refer to an inode. There are various situations, such as calling dup(), where you will wind up with two file descriptors that refer to the same underlying file object. Some state is specific to file descriptors, but other state is held at the level of file objects, and some state has to be held at the inode level, such as the last modification time of the inode. For mmap()'d files, we have a 'virtual memory area', which is a separate level of indirection that is on top of the inode.

The biggest reason to index the filesystem cache by inode instead of file descriptor or file object is coherence. If two processes separately open the same file, getting two separate file objects and two separate file descriptors, and then one process writes to the file while the other reads from it, we want the reading process to see the data that the writing process has written. The only thing the two processes naturally share is the inode of the file, so indexing the filesystem cache by inode is the easiest way to provide coherence. If the kernel indexed by file object or file descriptor, it would have to do extra work to propagate updates through all of the indirection. This includes the 'updates' of reading data off disk; if you index by inode, everyone reading from the file automatically sees fetched data with no extra work.

(Generally we also want this coherence for two processes that both mmap() the file, and for one process that mmap()s the file while another process read()s or write()s to it. Again this is easiest to achieve if everything is indexed by the inode.)

Another reason to index by inode is how easy it is to handle various situations in the filesystem cache when things are closed or removed, especially when the filesystem cache holds writes that are being buffered in memory before being flushed to disk. Processes frequently close file descriptors and drop file objects, including by exiting, but any buffered writes still need to be findable so they can be flushed to disk before, say, the filesystem itself is unmounted. Similarly, if an inode is deleted we don't want to flush its pending buffered writes to disk (and certainly we can't allocate blocks for them, since there's nothing to own those blocks any more), and we want to discard any clean buffers associated with it to free up memory. If you index the cache by inode, all you need is for filesystems to be able to find all their inodes; everything else more or less falls out naturally.

This doesn't absolutely require a Unix to index its filesystem buffer caches by inode. But I think it's clearly easiest to index the filesystem cache by inode, instead of the other available references. The inode is the common point for all IO involving a file (partly because it's what filesystems deal with), which makes it the easiest index; everyone has an inode reference and in a properly implemented Unix, everyone is using the same inode reference.

(In fact all sorts of fun tend to happen in Unixes if they have a filesystem that gives out different in-kernel inodes that all refer to the same on-disk filesystem object. Usually this happens by accident or filesystem bugs.)

How we automate installing extra packages during Ubuntu installs

By: cks

We have a local system for installing Ubuntu machines, and one of the important things it does is install various additional Ubuntu packages that we want as part of our standard installs. These days we have two sorts of standard installs, a 'base' set of packages that everything gets and a broader set of packages that login servers and compute servers get (to make them more useful and usable by people). Specialized machines need additional packages, and while we can automate installation of those too, they're generally a small enough set of packages that we document them in our install instructions for each machine and install them by hand.

There are probably clever ways to do bulk installs of Ubuntu packages, but if so, we don't use them. Our approach is instead a brute force one. We have files that contain lists of packages, such as a 'base' file, and these files just contain a list of packages with optional comments:

# Partial example of Basic package set
amanda-client
curl
jq
[...]

# decodes kernel MCE/machine check events
rasdaemon

# Be able to build Debian (Ubuntu) packages on anything
build-essential fakeroot dpkg-dev devscripts automake 

(Like all of the rest of our configuration information, these package set files live in our central administrative filesystem. You could distribute them in some other way, for example fetching them with rsync or even HTTP.)

To install these packages, we use grep to extract the actual packages into a big list and feed the big list to apt-get. This is more or less:

pkgs=$(cat $PKGDIR/$s | grep -v '^#' | grep -v '^[ \t]*$')
apt-get -qq -y install $pkgs

(This will abort if any of the packages we list aren't available. We consider this a feature, because it means we have an error in the list of packages.)

A more organized and minimal approach might be to add the '--no-install-recommends' option, but we started without it and we don't particularly want to go back to find which recommended packages we'd have to explicitly add to our package lists.

At least some of the 'base' package installs could be done during the initial system install process from our customized Ubuntu server ISO image, since you can specify additional packages to install. However, doing package installs that way would create a series of issues in practice. We'd probably need to more carefully track which package came from which Ubuntu collection, since only some of them are enabled during the server install process, it would be harder to update the lists, and the tools for handling the whole process would be a lot more limited, as would our ability to troubleshoot any problems.

Doing this additional package install in our 'postinstall' process means that we're doing it in a full Unix environment where we have all of the standard Unix tools, and we can easily look around the system if and when there's a problem. Generally we've found that the more of our installs we can defer to once the system is running normally, the better.

(Also, the less the Ubuntu installer does, the faster it finishes and the sooner we can get back to our desks.)

(This entry was inspired by parts of a blog post I read recently and reflecting about how we've made setting up new versions of machines pretty easy, assuming our core infrastructure is there.)

The mystery (to me) of tiny font sizes in KDE programs I run

By: cks

Over on the Fediverse I tried a KDE program and ran into a common issue for me:

It has been '0' days since a KDE app started up with too-small fonts on my bespoke fvwm based desktop, and had no text zoom. I guess I will go use a browser, at least I can zoom fonts there.

Maybe I could find a KDE settings thing and maybe find where and why KDE does this (it doesn't happen in GNOME apps), but honestly it's simpler to give up on KDE based programs and find other choices.

(The specific KDE program I was trying to use this time was NeoChat.)

My fvwm based desktop environment has an XSettings daemon running, which I use in part to set up a proper HiDPI environment (also, which doesn't talk about KDE fonts because I never figured that out). I suspect that my HiDPI display is part of why KDE programs often or always seem to pick tiny fonts, but I don't particularly know why. Based on the xsettingsd documentation and the registry, there doesn't seem to be any KDE specific font settings, and I'm setting the Gtk/FontName setting to a font that KDE doesn't seem to be using (which I could only verify once I found a way to see the font I was specifying).

After some searching I found the systemsettings program through the Arch wiki's page on KDE and was able to turn up its font sizes in a way that appears to be durable (ie, it stays after I stop and start systemsettings). However, this hasn't affected the fonts I see in NeoChat when I run it again. There are a bunch of font settings, but maybe NeoChat is using the 'small' font for some reason (apparently which app uses what font setting can be variable).

QT (the underlying GUI toolkit of much or all of KDE) has its own set of environment variables for scaling things on HiDPI displays, and setting $QT_SCALE_FACTOR does size up NeoChat (although apparently bits of Plasma ignore these, although I think I'm unlikely to run into this since I don't want to use KDE's desktop components).

Some KDE applications have their own settings files with their own font sizes; one example I know if is kdiff3. This is quite helpful because if I'm determined enough, I can either adjust the font sizes in the program's settings or at least go edit the configuration file (in this case, .config/kdiff3rc, I think, not .kde/share/config/kdiff3rc). However, not all KDE applications allow you to change font sizes through either their GUI or a settings file, and NeoChat appears to be one of the ones that don't.

In theory now that I've done all of this research I could resize NeoChat and perhaps other KDE applications through $QT_SCALE_FACTOR. In practice I feel I would rather switch to applications that interoperate better with the rest of my environment unless for some reason the KDE application is either my only choice or the significantly superior one (as it has been so far for kdiff3 for my usage).

Go's choice of multiple return values was the simpler option

By: cks

Yesterday I wrote about Go's use of multiple return values and Go types, in reaction to Mond's Were multiple return values Go's biggest mistake?. One of the things that I forgot to mention in that entry is that I think Go's choice to have multiple values for function returns and a few other things was the simpler and more conservative approach in its overall language design.

In a statically typed language that expects to routinely use multiple return values, as Go was designed to with the 'result, error' pattern, returning multiple values as a typed tuple means that tuple-based types are pervasive. This creates pressures on both the language design and the API of the standard library, especially if you start out (as Go did) being a fairly strongly nominally typed language, where different names for the same concrete type can't be casually interchanged. Or to put it another way, having a frequently used tuple container (meta-)type significantly interacts with and affects the rest of the language.

(For example, if Go had handled multiple values through tuples as explicit typed entities, it might have had to start out with something like type aliases (added only in Go 1.9) and it might have been pushed toward some degree of structural typing, because that probably makes it easier to interact with all of the return value tuples flying around.)

Having multiple values as a special case for function returns, range, and so on doesn't create anywhere near this additional influence and pressure on the rest of the language. There are a whole bunch of questions and issues you don't face because multiple values aren't types and can't be stored or manipulated as single entities. Of course you have to be careful in the language specification and it's not trivial, but it's simpler and more contained than going the tuple type route. I also feel it's the more conservative approach, since it doesn't affect the rest of the language as much as a widely used tuple container type would.

(As Mond criticizes, it does create special cases. But Go is a pragmatic language that's willing to live with special cases.)

Go's multiple return values and (Go) types

By: cks

Recently I read Were multiple return values Go's biggest mistake? (via), which wishes that Go had full blown tuple types (to put my spin on it). One of the things that struck me about Go's situation when I read the article is exactly the inverse of what the article is complaining about, which is that because Go allows multiple values for function return types (and in a few other places), it doesn't have to have tuple types.

One problem with tuple types in a statically typed language is that they must exist as types, whether declared explicitly or implicitly. In a language like Go, where type definitions create new distinct types even if the structure is the same, it isn't particularly difficult to wind up with an ergonomics problem. Suppose that you want to return a tuple that is a net.Conn and an error, a common pair of return values in the net package today. If that tuple is given a named type, everyone must use that type in various places; merely returning or storing an implicitly declared type that's structurally the same is not acceptable under Go's current type rules. Conversely, if that tuple is not given a type name in the net package, everyone is forced to stick to an anonymous tuple type. In addition, this up front choice is now an API; it's not API compatible to give your previously anonymous tuple type a name or vice versa, even if the types are structurally compatible.

(Since returning something and error is so common an idiom in Go, we're also looking at either a lot of anonymous types or a lot more named types. Consider how many different combinations of multiple return values you find in the net package alone.)

One advantage of multiple return values (and the other forms of tuple assignment, and for range clauses) is that they don't require actual formal types. Functions have a 'result type', which doesn't exist as an actual type, but you also needed to handle the same sort of 'not an actual type' thing for their 'parameter type'. My guess is that this let Go's designers skip a certain amount of complexity in Go's type system, because they didn't have to define an actual tuple (meta-)type or alternately expand how structs worked to cover the tuple usage case,

(Looked at from the right angle, structs are tuples with named fields, although then you get into questions of nested structs act in tuple-like contexts.)

A dynamically typed language like Python doesn't have this problem because there are no explicit types, so there's no need to have different types for different combinations of (return) values. There's simply a general tuple container type that can be any shape you want or need, and can be created and destructured on demand.

(I assume that some statically typed languages have worked out how to handle tuples as a data type within their type system. Rust has tuples, for example; I haven't looked into how they work in Rust's type system, for reasons.)

How ZFS knows and tracks the space usage of datasets

By: cks

Anyone who's ever had to spend much time with 'zfs list -t all -o space' knows the basics of ZFS space usage accounting, with space used by the datasets, data unique to a particular snapshot (the 'USED' value for a snapshot), data used by snapshots in total, and so on. But today I discovered that I didn't really know how it all worked under the hood, so I went digging in the source code. The answer is that ZFS tracks all of these types of space usage directly as numbers, and updates them as blocks are logically freed.

(Although all of these are accessed from user space as ZFS properties, they're not conventional dataset properties; instead, ZFS materializes the property version any time you ask, from fields in its internal data structures. Some of these fields are different and accessed differently for snapshots and regular datasets, for example what 'zfs list' presents as 'USED'.)

All changes to a ZFS dataset happen in a ZFS transaction (group), which are assigned ever increasing numbers, the 'transaction group number(s)' (txg). This includes allocating blocks, which remember their 'birth txg', and making snapshots, which carry the txg they were made in and necessarily don't contain any blocks that were born after that txg. When ZFS wants to free a block in the live filesystem (either because you deleted the object or because you're writing new data and ZFS is doing its copy on write thing), it looks at the block's birth txg and the txg of the most recent snapshot; if the block is old enough that it has to be in that snapshot, then the block is not actually freed and the space for the block is transferred from 'USED' (by the filesystem) to 'USEDSNAP' (used only in snapshots). ZFS will then further check the block's txg against the txgs of snapshots to see if the block is unique to a particular snapshot, in which case its space will be added to that snapshot's 'USED'.

ZFS goes through a similar process when you delete a snapshot. As it runs around trying to free up the snapshot's space, it may discover that a block it's trying to free is now used only by one other snapshot, based on the relevant txgs. If so, the block's space is added to that snapshot's 'USED'. If the block is freed entirely, ZFS will decrease the 'USEDSNAP' number for the entire dataset. If the block is still used by several snapshots, no usage numbers need to be adjusted.

(Determining if a block is unique in the previous snapshot is fairly easy, since you can look at the birth txgs of the two previous snapshots. Determining if a block is now unique in the next snapshot (or for that matter is still in use in the dataset) is more complex and I don't understand the code involved; presumably it involves somehow looking at what blocks were freed and when. Interested parties can look into the OpenZFS code themselves, where there are some surprises.)

PS: One consequence of this is that there's no way after the fact to find out when space shifted from being used by the filesystem to used by snapshots (for example, when something large gets deleted in the filesystem and is now present only in snapshots). All you can do is capture the various numbers over time and then look at your historical data to see when they changed. The removal of snapshots is captured by ZFS pool history, but as far as I know this doesn't capture how the deletion affected the various space usage numbers.

I don't think error handling is a solved problem in language design

By: cks

There are certain things about programming language design that are more or less solved problems, where we generally know what the good and bad approaches are. For example, over time we've wound up agreeing on various common control structures like for and while loops, if statements, and multi-option switch/case/etc statements. The syntax may vary (sometimes very much, as for example in Lisp), but the approach is more or less the same because we've come up with good approaches.

I don't believe this is the case with handling errors. One way to see this is to look at the wide variety of approaches and patterns that languages today take to error handling. There is at least 'errors as exceptions' (for example, Python), 'errors as values' (Go and C), and 'errors instead of results and you have to check' combined with 'if errors happen, panic' (both Rust). Even in Rust there are multiple idioms for dealing with errors; some Rust code will explicitly check its Result types, while other Rust code sprinkles '?' around and accepts that if the program sails off the happy path, it simply dies.

If you were creating a new programming language from scratch, there's no clear agreed answer to what error handling approach you should pick, not the way we have more or less agreed on how for, while, and so on should work. You'd be left to evaluate trade offs in language design and language ergonomics and to make (and justify) your choices, and there probably would always be people who think you should have chosen differently. The same is true of changing or evolving existing languages, where there's no generally agreed on 'good error handling' to move toward.

(The obvious corollary of this is that there's no generally agreed on keywords or other syntax for error handling, the way 'for' and 'while' are widely accepted as keywords as well as concepts. The closest we've come is that some forms of error handling have generally accepted keywords, such as try/catch for exception handling.)

I like to think that this will change at some point in the future. Surely there actually is a good pattern for error handling out there and at some point we will find it (if it hasn't already been found) and then converge on it, as we've converged on programming language things before. But I feel it's clear that we're not there yet today.

OIDC claim scopes and their interactions with OIDC token authentication

By: cks

When I wrote about how SAML and OIDC differed in sharing information, where SAML shares every SAML 'attribute' by default and OIDC has 'scopes' for its 'claims', I said that the SAML approach was probably easier within an organization, where you already have trust in the clients. It turns out that there's an important exception to this I didn't realize at the time, and that's when programs (like mail clients) are using tokens to authenticate to servers (like IMAP servers).

In OIDC/OAuth2 (and probably in SAML as well), programs that obtain tokens can open them up and see all of the information that they contain, either inspecting them directly or using a public OIDC endpoint that allows them to 'introspect' the token for additional information (this is the same endpoint that will be used by your IMAP server or whatever). Unless you enjoy making a bespoke collection of (for example) IMAP clients, the information that programs need to obtain tokens is going to be more or less public within your organization and will probably (or even necessarily) leak outside of it.

(For example, you can readily discover all of the OIDC client IDs used by Thunderbird for the various large providers it supports. There's nothing stopping you from using those client IDs and client secrets yourself, although large providers may require your target to have specifically approved using Thunderbird with your target's accounts.)

This means that anyone who can persuade your people to authenticate through a program's usual flow can probably extract all of the information available in the token. They can do this either on the person's computer (capturing the token locally) or by persuading people that they need to 'authenticate to this service with IMAP OAuth2' or the like and then extracting the information from the token.

In the SAML world, this will by default be all of the information contained in the token. In the OIDC world, you can restrict the information made available through tokens issued through programs by restricting the scopes that you allow programs to ask for (and possibly different scopes for different programs, although this is a bit fragile; attackers may get to choose which program's client ID and so on they use).

(Realizing this is going to change what scopes we allow in our OIDC IdP for program client registrations. So far I had reflexively been giving them access to everything, just like our internal websites; now I think I'm going to narrow it down to almost nothing.)

Sidebar: How your token-consuming server knows what created them

When your server verifies OAuth2/OIDC tokens presented to it, the minimum thing you want to know is that they come from the expected OIDC identity provider, which is normally achieved automatically because you'll ask that OIDC IdP to verify that the token is good. However, you may also want to know that the token was specifically issued for use with your server, or through a program that's expected to be used for your server. The normal way to do this is through the 'aud' OIDC claim, which has at least the client ID (and in theory your OIDC IdP could add additional entries). If your OIDC IdP can issue tokens through multiple identities (perhaps to multiple parties, such as the major IdPs of, for example, Google and Microsoft), you may also want to verify the 'iss' (issuer) field instead or in addition to 'aud'.

Some notes on the OpenID Connect (OIDC) 'redirect uri'

By: cks

The normal authentication process for OIDC is web-based and involves a series of HTTP redirects, interspersed with web pages that you interact with. Something that wants to authenticate you will redirect you to the OIDC identity server's website, which will ask you for your login and password and maybe MFA authentication, check them, and then HTTP redirect you back to a 'callback' or 'redirect' URL that will transfer a magic code from the OIDC server to the OIDC client (generally as a URL query parameter). All of this happens in your browser, which means that the OIDC client and server don't need to be able to directly talk to each other, allowing you to use an external cloud/SaaS OIDC IdP to authenticate to a high-security internal website that isn't reachable from the outside world and maybe isn't allowed to make random outgoing HTTP connections.

(The magic code transferred in the final HTTP redirect is apparently often not the authentication token itself but instead something the client can use for a short time to obtain the real authentication token. This does require the client to be able to make an outgoing HTTP connection, which is usually okay.)

When the OIDC client initiates the HTTP redirection to the OIDC IdP server, one of the parameters it passes along is the 'redirect uri' it wants the OIDC server to use to pass the magic code back to it. A malicious client (or something that's gotten a client's ID and secret) could do some mischief by manipulating this redirect URL, so the standard specifically requires that OIDC IdP have a list of allowed redirect uris for each registered client. The standard also says that in theory, the client's provided redirect uri and the configured redirect uris are compared as literal string values. So, for example, 'https://example.org/callback' doesn't match 'https://example.org/callback/'.

This is straightforward when it comes to websites as OIDC clients, since they should have well defined callback urls that you can configure directly into your OIDC IdP when you set up each of them. It gets more hairy when what you're dealing with is programs as OIDC clients, where they are (for example) trying to get an OIDC token so they can authenticate to your IMAP server with OAuth2, since these programs don't normally have a website. Historically, there are several approaches that people have taken for programs (or seem to have, based on my reading so far).

Very early on in OAuth2's history, people apparently defined the special redirect uri value 'urn:ietf:wg:oauth:2.0:oob' (which is now hard to find or identify documentation on). An OAuth2 IdP that saw this redirect uri (and maybe had it allowed for the client) was supposed to not redirect you but instead show you a HTML page with the magic OIDC code displayed on it, so you could copy and paste the code into your local program. This value is now obsolete but it may still be accepted by some IdPs (you can find it listed for Google in mutt_oauth2.py, and I spotted an OIDC IdP server that handles it).

Another option is that the IdP can provide an actual website that does the same thing; if you get HTTP redirected to it with a valid code, it will show you the code on a HTML page and you can copy and paste it. Based on mutt_oauth2.py again, it appears that Microsoft may have at one point done this, using https://login.microsoftonline.com/common/oauth2/nativeclient as the page. You can do this too with your own IdP (or your own website in general), although it's not recommended for all sorts of reasons.

The final broad approach is to use 'localhost' as the target host for the redirect. There are several ways to make this work, and one of them runs into complications with the IdP's redirect uri handling.

The obvious general approach is for your program to run a little HTTP server that listens on some port on localhost, and capture the code when the (local) browser gets the HTTP redirect to localhost and visits the server. The problem here is that you can't necessarily listen on port 80, so your redirect uri needs to include the port you're listening (eg 'http://localhost:7000'), and if your OIDC IdP is following the standard it must be configured not just with 'http://localhost' as the allowed redirect uri but the specific port you'll use. Also, because of string matching, if the OIDC IdP lists 'http://localhost:7000', you can't send 'http://localhost:7000/' despite them being the same URL.

(And your program has to use 'localhost', not '127.0.0.1' or the IPv6 loopback address; although the two have the same effect, they're obviously not string-identical.)

Based on experimental evidence from OIDC/OAuth2 client configurations, I strongly suspect that some large IdP providers have non-standard, relaxed handling of 'localhost' redirect uris such that their client configuration lists 'http://localhost' and the IdP will accept some random port glued on in the actual redirect uri (or maybe this behavior has been standardized now). I suspect that the IdPs may also accept the trailing slash case. Honestly, it's hard to see how you get out of this if you want to handle real client programs out in the wild.

(Some OIDC IdP software definitely does the standard compliant string comparison. The one I know of for sure is SimpleSAMLphp's OIDC module. Meanwhile, based on reading the source code, Dex uses a relaxed matching for localhost in its matching function, provided that there are no redirect uris register for the client. Dex also still accepts the urn:ietf:wg:oauth:2.0:oob redirect uri, so I suspect that there are still uses out there in the field.)

If the program has its own embedded web browser that it's in full control of, it can do what Thunderbird appears to do (based on reading its source code). As far as I can tell, Thunderbird doesn't run a local listening server; instead it intercepts the HTTP redirection to 'http://localhost' itself. When the IdP sends the final HTTP redirect to localhost with the code embedded in the URL, Thunderbird effectively just grabs the code from the redirect URL in the HTTP reply and never actually issues a HTTP request to the redirect target.

The final option is to not run a localhost HTTP server and to tell people running your program that when their browser gives them an 'unable to connect' error at the end of the OIDC authentication process, they need to go to the URL bar and copy the 'code' query parameter into the program (or if you're being friendly, let them copy and paste the entire URL and you extract the code parameter). This allows your program to use a fixed redirect uri, including just 'http://localhost', because it doesn't have to be able to listen on it or on any fixed port.

(This is effectively a more secure but less user friendly version of the old 'copy a code that the website displayed' OAuth2 approach, and that approach wasn't all that user friendly to start with.)

PS: An OIDC redirect uri apparently allows things other than http:// and https:// URLs; there is, for example, the 'openid-credential-offer' scheme. I believe that the OIDC IdP doesn't particularly do anything with those redirect uris other than accept them and issue a HTTP redirect to them with the appropriate code attached. It's up to your local program or system to intercept HTTP requests for those schemes and react appropriately, much like Thunderbird does, but perhaps easier because you can probably register the program as handling all 'whatever-special://' URLs so the redirect is automatically handed off to it.

(I suspect that there are more complexities in the whole OIDC and OAuth2 redirect uri area, since I'm new to the whole thing.)

Some notes on configuring Dovecot to authenticate via OIDC/OAuth2

By: cks

Suppose, not hypothetically, that you have a relatively modern Dovecot server and a shiny new OIDC identity provider server ('OP' in OIDC jargon, 'IdP' in common usage), and you would like to get Dovecot to authenticate people's logins via OIDC. Ignoring certain practical problems, the way this is done is for your mail clients to obtain an OIDC token from your IdP, provide it to Dovecot via SASL OAUTHBEARER, and then for Dovecot to do the critical step of actually validating that token it received is good, still active, and contains all the information you need. Dovecot supports this through OAuth v2.0 authentication as a passdb (password database), but in the usual Dovecot fashion, the documentation on how to configure the parameters for validating tokens with your IdP is a little bit lacking in explanations. So here are some notes.

If you have a modern OIDC IdP, it will support OpenID Connect Discovery, including the provider configuration request on the path /.well-known/openid-configuration. Once you know this, if you're not that familiar with OIDC things you can request this URL from your OIDC IdP, feed the result through 'jq .', and then use it to pick out the specific IdP URLs you want to set up in things like the Dovecot file with all of the OAuth2 settings you need. If you do this, the only URL you want for Dovecot is the userinfo_endpoint URL. You will put this into Dovecot's introspection_url, and you'll leave introspection_mode set to the default of 'auth'.

You don't want to set tokeninfo_url to anything. This setting is (or was) used for validating tokens with OAuth2 servers before the introduction of RFC 7662. Back then, the defacto standard approach was to make a HTTP GET approach to some URL with the token pasted on the end (cf), and it's this URL that is being specified. This approach was replaced with RFC 7662 token introspection, and then replaced again with OpenID Connect UserInfo. If both tokeninfo_url and introspection_url are set, as in Dovecot's example for Google, the former takes priority.

(Since I've just peered deep into the Dovecot source code, it appears that setting 'introspection_mode = post' actually performs an (unauthenticated) token introspection request. The 'get' mode seems to be the same as setting tokeninfo_url. I think that if you set the 'post' mode, you also want to set active_attribute and perhaps active_value, but I don't know what to, because otherwise you aren't necessarily fully validating that the token is still active. Does my head hurt? Yes. The moral here is that you should use an OIDC IdP that supports OpenID Connect UserInfo.)

If your IdP serves different groups and provides different 'issuer' ('iss') values to them, you may want to set the Dovecot 'issuers =' to the specific issuer that applies to you. You'll also want to set 'username_attribute' to whatever OIDC claim is where your IdP puts what you consider the Dovecot username, which might be the email address or something else.

It would be nice if Dovecot could discover all of this for itself when you set openid_configuration_url, but in the current Dovecot, all this does is put that URL in the JSON of the error response that's sent to IMAP clients when they fail OAUTHBEARER authentication. IMAP clients may or may not do anything useful with it.

As far as I can tell from the Dovecot source code, setting 'scope =' primarily requires that the token contains those scopes. I believe that this is almost entirely a guard against the IMAP client requesting a token without OIDC scopes that contain claims you need elsewhere in Dovecot. However, this only verifies OIDC scopes, it doesn't verify the presence of specific OIDC claims.

So what you want to do is check your OIDC IdP's /.well-known/openid-configuration URL to find out its collection of endpoints, then set:

# Modern OIDC IdP/OP settings
introspection_url = <userinfo_endpoint>
username_attribute = <some claim, eg 'email'>

# not sure but seems common in Dovecot configs?
pass_attrs = pass=%{oauth2:access_token}

# optionally:
openid_configuration_url = <stick in the URL>

# you may need:
tls_ca_cert_file = /etc/ssl/certs/ca-certificates.crt

The OIDC scopes that IMAP clients should request when getting tokens should include a scope that gives the username_attribute claim, which is 'email' if the claim is 'email', and also apparently the requested scopes should include the offline_access scope.

If you want a test client to see if you've set up Dovecot correctly, one option is to appropriately modify a contributed Python program for Mutt (also the README), which has the useful property that it has an option to check all of IMAP, POP3, and authenticated SMTP once you've obtained a token. If you're just using it for testing purposes, you can change the 'gpg' stuff to 'cat' to just store the token with no fuss (and no security). Another option, which can be used for real IMAP clients too if you really want to, is an IMAP/etc OAuth2 proxy.

(If you want to use Mutt with OAuth2 with your IMAP server, see this article on it also, also, also. These days I would try quite hard to use age instead of GPG.)

Doing multi-tag matching through URLs on the modern web

By: cks

So what happened is that Mike Hoye had a question about a perfectly reasonable ideas:

Question: is there wiki software out there that handles tags (date, word) with a reasonably graceful URL approach?

As in, site/wiki/2020/01 would give me all the pages tagged as 2020 and 01, site/wiki/foo/bar would give me a list of articles tagged foo and bar.

I got nerd-sniped by a side question but then, because I'd been nerd-sniped, I started thinking about the whole thing and it got more and more hair-raising as a thing done in practice.

This isn't because the idea of stacking selections like this is bad; 'site/wiki/foo/bar' is a perfectly reasonable and good way to express 'a list of articles tagged foo and bar'. Instead, it's because of how everything on the modern web eventually gets visited combined with how, in the natural state of this feature, 'site/wiki/bar/foo' is just a valid a URL for 'articles tagged both foo and bar'.

The combination, plus the increasing tendency of things on the modern web to rattle every available doorknob just to see what happens, means that even if you don't advertise 'bar/foo', sooner or later things are going to try it. And if you do make the combinations discoverable through HTML links, crawlers will find them very fast. At a minimum this means crawlers will see a lot of essentially duplicated content, and you'll have to go through all of the work to do the searches and generate the page listings and so on.

If I was going to implement something like this, I would define a canonical tag order and then, as early in request processing as possible, generate a HTTP redirect from any non-canonical ordering to the canonical one. I wouldn't bother checking if the tags were existed or anything, just determine that they are tags, put them in canonical order, and if the request order wasn't canonical, redirect. That way at least all of your work (and all of the crawler attention) is directed at one canonical version. Smart crawlers will notice that this is a redirect to something they already have (and hopefully not re-request it), and you can more easily use caching.

(And if search engines still matter, the search engines will see only your canonical version.)

This probably holds just as true for doing this sort of tag search through query parameters on GET queries; if you expose the result in a URL, you want to canonicalize it. However, GET query parameters are probably somewhat safer if you force people to form them manually and don't expose links to them. So far, web crawlers seem less likely to monkey around with query parameters than with URLs, based on my limited experience with the blog.

The commodification of desktop GUI behavior

By: cks

Over on the Fediverse, I tried out a thesis:

Thesis: most desktop GUIs are not opinionated about how you interact with things, and this is why there are so many GUI toolkits and they make so little difference to programs, and also why the browser is a perfectly good cross-platform GUI (and why cross-platform GUIs in general).

Some GUIs are quite opinionated (eg Plan 9's Acme) but most are basically the same. Which isn't necessarily a bad thing but it creates a sameness.

(Custom GUIs are good for frequent users, bad for occasional ones.)

Desktop GUIs differ in how they look and to some extent in how you do certain things and how you expect 'native' programs to behave; I'm sure the fans of any particular platform can tell you all about little behaviors that they expect from native applications that imported ones lack. But I think we've pretty much converged on a set of fundamental behaviors for how to interact with GUI programs, or at least how to deal with basic ones, so in a lot of cases the question about GUIs is how things look, not how you do things at all.

(Complex programs have for some time been coming up with their own bespoke alternatives to, for example, huge cascades of menus. If these are successful they tend to get more broadly adopted by programs facing the same problems; consider the 'ribbon', which got what could be called a somewhat mixed reaction on its modern introduction.)

On the desktop, changing the GUI toolkit that a program uses (either on the same platform or on a different one) may require changing the structure of your code (in addition to ordinary code changes), but it probably won't change how your program operates. Things will look a bit different, maybe some standard platform features will appear or disappear, but it's not a completely different experience. This often includes moving your application from the desktop into the browser (a popular and useful 'cross-platform' environment in itself).

This is less true on mobile platforms, where my sense is that the two dominant platforms have evolved somewhat different idioms for how you interact with applications. A proper 'native' application behaves differently on the two platforms even if it's using mostly the same code base.

GUIs such as Plan 9's Acme show that this doesn't have to be the case; for that matter, so does GNU Emacs. GNU Emacs has a vague shell of a standard looking GUI but it's a thin layer over a much different and stranger vastness, and I believe that experienced Emacs people do very little interaction with it.

Some views on the common Apache modules for SAML or OIDC authentication

By: cks

Suppose that you want to restrict access to parts of your Apache based website but you want something more sophisticated and modern than Apache Basic HTTP authentication. The traditional reason for this was to support 'single sign on' across all your (internal) websites; the modern reason is that a central authentication server is the easiest place to add full multi-factor authentication. The two dominant protocols for this are SAML and OIDC. There are commonly available Apache authentication modules for both protocols, in the form of Mellon (also) for SAML and OpenIDC for OIDC.

I've now used or at least tested the Ubuntu 24.04 version of both modules against the same SAML/OIDC identity provider, primarily because when you're setting up a SAML/OIDC IdP you need to be able to test it with something. Both modules work fine, but after my experiences I'm more likely to use OpenIDC than Mellon in most situations.

Mellon has two drawbacks and two potential advantages. The first drawback is that setting up a Mellon client ('SP') is more involved. Most of annoying stuff is automated for you with the mellon_create_metadata script (which you can get from the Mellon repository if it's not in your Mellon package), but you still have to give your IdP your XML blob and get their XML blob. The other drawback is that Mellon isn't integrated into the Apache 'Require' framework for authorization decisions; instead you have to make do with Mellon-specific directives.

The first potential advantage is that Mellon has a straightforward story for protecting two different areas of your website with two different IdPs, if you need to do that for some reason; you can just configure them in separate <Location> or <Directory> blocks and everything works out. If anything, it's a bit non-obvious how to protect various disconnected bits of your URL space with the same IdP without having to configure multiple SPs, one for each protected section of URL space. The second potential advantage is that in general SAML has an easier story for your IdP giving you random information, and Mellon will happily export every SAML attribute it gets into the environment your CGI or web application gets.

The first advantage of OpenIDC is that it's straightforward to configure when you have a single IdP, with no XML and generally low complexity. It's also straightforward to protect multiple disconnected URL areas with the same IdP but possibly different access restrictions. A third advantage is that OpenIDC is integrated into Apache's 'Require' system, although you have to use OpenIDC specific syntax like 'Require claim groups:agroup' (see the OpenIDC wiki on authorization).

In exchange for this, it seems to be quite involved to use OpenIDC if you need to use multiple OIDC identity providers to protect different bits of your website. It's apparently possible to do this in the same virtual host but it seems quite complex and requires a lot of parts, so if I was confronted with this problem I would try very hard to confine each web thing that needed a different IdP into a different virtual host. And OpenIDC has the general OIDC problem that it's harder to expose random information.

(All of the important OpenIDC Apache directives about picking an IdP can't be put in <Location> or <Directory> blocks, only in a virtual host as a whole. If you care about this, see the wiki on Multiple Providers and also access to different URL paths on a per-provider basis.)

We're very likely to only ever be working with a single IdP, so for us OpenIDC is likely to be easier, although not hugely so.

Sidebar: The easy approach for group based access control with either

Both Mellon and OpenIDC work fine together with the traditional Apache AuthGroupFile directive, provided (of course) that you have or build an Apache format group file using what you've told Mellon or OpenIDC to use as the 'user' for Apache authentication. If your IdP is using the same user (and group) information as your regular system is, then you may well already have this information around.

(This is especially likely if you're migrating from Apache Basic HTTP authentication, where you already needed to build this sort of stuff.)

Building your own Apache group file has the additional benefit that you can augment and manipulate group information in ways that might not fit well into your IdP. Your IdP has the drawback that it has to be general; your generated Apache group file can be narrowly specific for the needs of a particular web area.

The web browser as an enabler of minority platforms

By: cks

Recently, I got involved in a discussion on the Fediverse over what I will simplify to the desirability (or lack of it) of cross platform toolkits, including the browser, and how they erase platform personality and opinions. This caused me to have a realization about what web browser based applications are doing for me, which is that being browser based is what lets me use them at all.

My environment is pretty far from being a significant platform; I think Unix desktop share is in the low single percent under the best of circumstances. If people had to develop platform specific versions of things like Grafana (which is a great application), they'd probably exist for Windows, maybe macOS, and at the outside, tablets (some applications would definitely exist on phones, but Grafana is a bit of a stretch). They probably wouldn't exist on Linux, especially not for free.

That the web browser is a cross platform environment means that I get these applications (including the Fediverse itself) essentially 'for free' (which is to say, it's because of the efforts of web browsers to support my platform and then give me their work for free). Developers of web applications don't have to do anything to make them work for me, not even so far as making it possible to build their software on Linux; it just happens for them without them even having to think about it.

Although I don't work in the browser as much as some people do, looking back the existence of implicitly cross platform web applications has been a reasonably important thing in letting me stick with Linux.

This applies to any minority platform, not just Linux. All you need is a sufficiently capable browser and you have access to a huge range of (web) applications.

(Getting that sufficiently capable browser can be a challenge on a sufficiently minority platform, especially if you're not on a major architecture. I'm lucky in that x86 Linux is a majority minority platform; people on FreeBSD or people on architectures other than x86 and 64-bit ARM may be less happy with the situation.)

PS: I don't know if what we have used the web for really counts as 'applications', since they're mostly HTML form based things once you peel a few covers off. But if they do count, the web has been critical in letting us provide them to people. We definitely couldn't have built local application versions of them for all of the platforms that people here use.

(I'm sure this isn't a novel thought, but the realization struck (or re-struck) me recently so I'm writing it down.)

How I got my nose rubbed in my screens having 'bad' areas for me

By: cks

I wrote a while back about how my desktop screens now had areas that were 'good' and 'bad' for me, and mentioned that I had recently noticed this, calling it a story for another time. That time is now. What made me really notice this issue with my screens and where I had put some things on them was our central mail server (temporarily) stopping handling email because its load was absurdly high.

In theory I should have noticed this issue before a co-worker rebooted the mail server, because for a long time I've had an xload window from the mail server (among other machines, I have four xloads). Partly I did this so I could keep an eye on these machines and partly it's to help keep alive the shared SSH connection I also use for keeping an xrun on the mail server.

(In the past I had problems with my xrun SSH connections seeming to spontaneously close if they just sat there idle because, for example, my screen was locked. Keeping an xload running seemed to work around that; I assumed it was because xload keeps updating things even with the screen locked and so forced a certain amount of X-level traffic over the shared SSH connection.)

When the mail server's load went through the roof, I should have noticed that the xload for it had turned solid green (which is how xload looks under high load). However, I had placed the mail server's xload way off on the right side of my office dual screens, which put it outside my normal field of attention. As a result, I never noticed the solid green xload that would have warned me of the problem.

(This isn't where the xload was back on my 2011 era desktop, but at some point since then I moved it and some other xloads over to the right.)

In the aftermath of the incident, I relocated all of those xloads to a more central location, and also made my new Prometheus alert status monitor appear more or less centrally, where I'll definitely notice it.

(Some day I may do a major rethink about my entire screen layout, but most of the time that feels like yak shaving that I'd rather not touch until I have to, for example because I've been forced to switch to Wayland and an entirely different window manager.)

Sidebar: Why xload turns green under high load

Xload draws a horizontal tick line for every integer load average it needs to display the maximum load that fits in its moving histogram. If the highest load average is 1.5, there will be one tick; if the highest load average is 10.2, there will be ten. Ticks are normally drawn in green. This means that as the load average climbs, xload draws more and more ticks, and after a certain point the entire xload display is just solid green from all of the tick lines.

This has the drawback that you don't know the shape of the load average (all you know is that at some point it got quite high), but the advantage that it's quite visually distinctive and you know you have a problem.

How SAML and OIDC differ in sharing information, and perhaps why

By: cks

In practice, SAML and OIDC are two ways of doing third party web-based authentication (and thus a Single Sign On (SSO)) system; the web site you want to use sends you off to a SAML or OIDC server to authenticate, and then the server sends authentication information back to the 'client' web site. Both protocols send additional information about you along with the bare fact of an authentication, but they differ in how they do this.

In SAML, the SAML server sends a collection of 'attributes' back to the SAML client. There are some standard SAML attributes that client websites will expect, but the server is free to throw in any other attributes it feels like, and I believe that servers do things like turn every LDAP attribute they get from a LDAP user lookup into a SAML attribute (certainly SimpleSAMLphp does this). As far as I know, any filtering of what SAML attributes are provided by the server to any particular client is a server side feature, and SAML clients don't necessarily have any way of telling the SAML server what attributes they want or don't want.

In OIDC, the equivalent way of returning information is 'claims', which are grouped into 'scopes', along with basic claims that you get without asking for a scope. The expectation in OIDC is that clients that want more than the basic claims will request specific scopes and then get back (only) the claims for those scopes. There are standard scopes with standard claims (not all of which are necessarily returned by any given OIDC server). If you want to add additional information in the form of more claims, I believe that it's generally expected that you'll create one or more custom scopes for those claims and then have your OIDC clients request them (although not all OIDC clients are willing and able to handle custom scopes).

(I think in theory an OIDC server may be free to shove whatever claims it wants to into information for clients regardless of what scopes the client requested, but an OIDC client may ignore any information it didn't request and doesn't understand rather than pass it through to other software.)

The SAML approach is more convenient for server and client administrators who are working within the same organization. The server administrator can add whatever information to SAML responses that's useful and convenient, and SAML clients will generally automatically pick it up and often make it available to other software. The OIDC approach is less convenient, since you need to create one or more additional scopes on the server and define what claims go in them, and then get your OIDC clients to request the new scopes; if an OIDC client doesn't update, it doesn't get the new information. However, the OIDC approach makes it easier for both clients and servers to be more selective and thus potentially for people to control how much information they give to who. An OIDC client can ask for only minimal information by only asking for a basic scope (such as 'email') and then the OIDC server can tell the person exactly what information they're approving being passed to the client, without the OIDC server administrators having to get involved to add client-specific attribute filtering.

(In practice, OIDC probably also encourages giving less information to even trusted clients in general since you have to go through these extra steps, so you're less likely to do things like expose all LDAP information as OIDC claims in some new 'our-ldap' scope or the like.)

My guess is that OIDC was deliberately designed this way partly in order to make it better for use with third party clients. Within an organization, SAML's broad sharing of information may make sense, but it makes much less sense in a cross-organization context, where you may be using OIDC-based 'sign in with <large provider>' on some unrelated website. In that sort of case, you certainly don't want that website to get every scrap of information that the large provider has on you, but instead only ask for (and get) what it needs, and for it to not get much by default.

The OpenID Connect (OIDC) 'sub' claim is surprisingly load-bearing

By: cks

OIDC (OpenID Connect) is today's better or best regarded standard for (web-based) authentication. When a website (or something) authenticates you through an OpenID (identity) Provider (OP), one of the things it gets back is a bunch of 'claims', which is to say information about the authenticated person. One of the core claims is 'sub', which is vaguely described as a string that is 'subject - identifier for the end-user at the issuer'. As I discovered today, this claim is what I could call 'load bearing' in a surprising way or two.

In theory, 'sub' has no meaning beyond identifying the user in some opaque way. The first way it's load bearing is that some OIDC client software (a 'Relying Party (RP)') will assume that the 'sub' claim has a human useful meaning. For example, the Apache OpenIDC module defaults to putting the 'sub' claim into Apache's REMOTE_USER environment variable. This is fine if your OIDC IdP software puts, say, a login name into it; it is less fine if your OIDC IdP software wants to create 'sub' claims that look like 'YXVzZXIxMi5zb21laWRw'. These claims mean something to your server software but not necessarily to you and the software you want to use on (or behind) OIDC RPs.

The second and more surprising way that the 'sub' claim is load bearing involves how external consumers of your OIDC IdP keep track of your people. In common situations your people will be identified and authorized by their email address (using some additional protocols), which they enter into the outside OIDC RP that's authenticating against your OIDC IdP, and this looks like the identifier that RP uses to keep track of them. However, at least one such OIDC RP assumes that the 'sub' claim for a given email address will never change, and I suspect that there are more people who either quietly use the 'sub' claim as the master key for accounts or who require 'sub' and the email address to be locked together this way.

This second issue makes the details of how your OIDC IdP software generates its 'sub' claim values quite important. You want it to be able to generate those 'sub' values in a clear and documented way that other OIDC IdP software can readily duplicate to create the same 'sub' values, and that won't change if you change some aspect of the OIDC IdP configuration for your current software. Otherwise you're at least stuck with your current OIDC IdP software, and perhaps with its exact current configuration (for authentication sources, internal names of things, and so on).

(If you have to change 'sub' values, for example because you have to migrate to different OIDC IdP software, this could go as far as the outside OIDC RP basically deleting all of their local account data for your people and requiring all of it to be entered back from scratch. But hopefully those outside parties have a better procedure than this.)

The problem facing MFA-enabled IMAP at the moment (in early 2025)

By: cks

Suppose that you have an IMAP server and you would like to add MFA (Multi-Factor Authentication) protection to it. I believe that in theory the IMAP protocol supports multi-step 'challenge and response' style authentication, so again in theory you could implement MFA this way, but in practice this is unworkable because people would be constantly facing challenges. Modern IMAP clients (and servers) expect to be able to open and close connections more or less on demand, rather than opening one connection, holding it open, and doing everything over it. To make IMAP MFA practical, you need to do it with some kind of 'Single Sign On' (SSO) system. The current approach for this uses an OIDC identity provider for the SSO part and SASL OAUTHBEARER authentication between the IMAP client and the IMAP server, using information from the OIDC IdP.

So in theory, your IMAP client talks to your OIDC IdP to get a magic bearer token, provides this token to the IMAP server, the IMAP server verifies that it comes from a configured and trusted IdP, and everything is good. You only have to go through authenticating to your OIDC IdP SSO system every so often (based on whatever timeout it's configured with); the rest of the time the aggregate system does any necessary token refreshes behind the scenes. And because OIDC has a discovery process that can more or less start from your email address (as I found out), it looks like IMAP clients like Thunderbird could let you more or less automatically use any OIDC IdP if people had set up the right web server information.

If you actually try this right now, you'll find that Thunderbird, apparently along with basically all significant IMAP client programs, will only let you use a few large identity providers; here is Thunderbird's list (via). If you read through that Thunderbird source file, you'll find one reason for this limitation, which is that each provider has one or two magic values (the 'client ID' and usually the 'client secret', which is obviously not so secret here), in addition to URLs that Thunderbird could theoretically autodiscover if everyone supported the current OIDC autodiscovery protocols (my understanding is that not everyone does). In most current OIDC identity provider software, these magic values are either given to the IdP software or generated by it when you set up a given OIDC client program (a 'Relying Party (RP)' in the OIDC jargon).

This means that in order for Thunderbird (or any other IMAP client) to work with your own local OIDC IdP, there would have to be some process where people could load this information into Thunderbird. Alternately, Thunderbird could publish default values for these and anyone who wanted their OIDC IdP to work with Thunderbird would have to add these values to it. To date, creators of IMAP client software have mostly not supported either option and instead hard code a list of big providers who they've arranged more or less explicit OIDC support with.

(Honestly it's not hard to see why IMAP client authors have chosen this approach. Unless you're targeting a very technically inclined audience, walking people through the process of either setting this up in the IMAP client or verifying if a given OIDC IdP supports the client is daunting. I believe some IMAP clients can be configured for OIDC IdPs through 'enterprise policy' systems, but there the people provisioning the policies are supposed to be fairly technical.)

PS: Potential additional references on this mess include David North's article and this FOSDEM 2024 presentation (which I haven't yet watched, I only just stumbled into this mess).

A Prometheus gotcha with alerts based on counting things

By: cks

Suppose, not entirely hypothetically, that you have some backup servers that use swappable HDDs as their backup media and expose that 'media' as mounted filesystems. Because you keep swapping media around, you don't automatically mount these filesystems and when you do manually try to mount them, it's possible to have some missing (if, for example, a HDD didn't get fully inserted and engaged with the hot-swap bay). To deal with this, you'd like to write a Prometheus alert for 'not all of our backup disks are mounted'. At first this looks simple:

count(
  node_filesystem_size_bytes{
         host = "backupserv",
         mountpoint =~ "/dumps/tapes/slot.*" }
) != <some number>

This will work fine most of the time and then one day it will fail to alert you to the fact that none of the expected filesystems are mounted. The problem is the usual one of PromQL's core nature as a set-based query language (we've seen this before). As long as there's at least one HDD 'tape' filesystem mounted, you can count them, but once there are none, the result of counting them is not 0 but nothing. As a result this alert rule won't produce any results when there are no 'tape' filesystems on your backup server.

Unfortunately there's no particularly good fix, especially if you have multiple identical backup servers and so the real version uses 'host =~ "bserv1|bserv2|..."'. In the single-host case, you can use either absent() or vector() to provide a default value. There's no good solution in the multi-host case, because there's no version of vector() that lets you set labels. If there was, you could at least write:

count( ... ) by (host)
  or vector(0, "host", "bserv1")
  or vector(0, "host", "bserv2")
  ....

(Technically you can set labels via label_replace(). Let's not go there; it's a giant pain for simply adding labels, especially if you want to add more than one.)

In my particular case, our backup servers always have some additional filesystems (like their root filesystem), so I can write a different version of the count() based alert rule:

count(
  node_filesystem_size_bytes{
         host =~ "bserv1|bserv2|...",
         fstype =~ "ext.*' }
) by (host) != <other number>

In theory this is less elegant because I'm not counting exactly what I care about (the number of 'tape' filesystems that are mounted) but instead something more general and potentially more variable (the number of extN filesystems that are mounted) that contains various assumptions about the systems. In practice the number is just as fixed as the number of 'taoe' filesystems, and the broader set of labels will always match something, producing a count of at least one for each host.

(This would change if the standard root filesystem type changed in a future version of Ubuntu, but if that happened, we'd notice.)

PS: This might sound all theoretical and not something a reasonably experienced Prometheus person would actually do. But I'm writing this entry partly because I almost wrote a version of my first example as our alert rule, until I realized what would happen when there were no 'tape' filesystems mounted at all, which is something that happens from time to time for reasons outside the scope of this entry.

What SimpleSAMLphp's core:AttributeAlter does with creating new attributes

By: cks

SimpleSAMLphp is a SAML identity provider (and other stuff). It's of deep interest to us because it's about the only SAML or OIDC IdP I can find that will authenticate users and passwords against LDAP and has a plugin that will do additional full MFA authentication against the university's chosen MFA provider (although you need to use a feature branch). In the process of doing this MFA authentication, we need to extract the university identifier to use for MFA authentication from our local LDAP data. Conveniently, SimpleSAMLphp has a module called core:AttributeAlter (a part of authentication processing filters) that is intended to do this sort of thing. You can give it a source, a pattern, a replacement that includes regular expression group matches, and a target attribute. In the syntax of its examples, this looks like the following:

 // the 65 is where this is ordered
 65 => [
    'class' => 'core:AttributeAlter',
    'subject' => 'gecos',
    'pattern' => '/^[^,]*,[^,]*,[^,]*,[^,]*,([^,]+)(?:,.*)?$/',
    'target' => 'mfaid',
    'replacement' => '\\1',
 ],

If you're an innocent person, you expect that your new 'mfaid' attribute will be undefined (or untouched) if the pattern does not match because the required GECOS field isn't set. This is not in fact what happens, and interested parties can follow along the rest of this in the source.

(All of this is as of SimpleSAMLphp version 2.3.6, the current release as I write this.)

The short version of what happens is that when the target is a different attribute and the pattern doesn't match, the target will wind up set but empty. Any previous value is lost. How this happens (and what happens) starts with that 'attributes' here are actually arrays of values under the covers (this is '$attributes'). When core:AttributeAlter has a different target attribute than the source attribute, it takes all of the source attribute's values, passes each of them through a regular expression search and replace (using your replacement), and then gathers up anything that changed and sets the target attribute to this gathered collection. If the pattern doesn't match any values of the attribute (in the normal case, a single value), the array of changed things is empty and your target attribute is set to an empty PHP array.

(This is implemented with an array_diff() between the results of preg_replace() and the original attribute value array.)

My personal view is that this is somewhere around a bug; if the pattern doesn't match, I expect nothing to happen. However, the existing documentation is ambiguous (and incomplete, as the use of capture groups isn't particularly documented), so it might not be considered a bug by SimpleSAMLphp. Even if it is considered a bug I suspect it's not going to be particularly urgent to fix, since this particular case is unusual (or people would have found it already).

For my situation, perhaps what I want to do is to write some PHP code to do this extraction operation by hand, through core:PHP. It would be straightforward to extract the necessary GECOS field (or otherwise obtain the ID we need) in PHP, without fooling around with weird pattern matching and module behavior.

(Since I just looked it up, I believe that in the PHP code that core:PHP runs for you, you can use a PHP 'return' to stop without errors but without changing anything. This is relevant in my case since not all GECOS entries have the necessary information.)

If you get the chance, always run more extra network fiber cabling

By: cks

Some day, you may be in an organization that's about to add some more fiber cabling between two rooms in the same building, or maybe two close by buildings, and someone may ask you for your opinion about many fiber pairs should be run. My personal advice is simple: run more fiber than you think you need, ideally a bunch more (this generalizes to network cabling in general, but copper cabling is a lot more bulky and so harder to run (much) more of). There is an unreasonable amount of fiber to run, but mostly it comes up when you'd have to put in giant fiber patch panels.

The obvious reason to run more fiber is that you may well expand your need for fiber in the future. Someone will want to run a dedicated, private network connection between two locations; someone will want to trunk things to get more bandwidth; someone will want to run a weird protocol that requires its own network segment (did you know you can run HDMI over Ethernet?); and so on. It's relatively inexpensive to add some more fiber pairs when you're already running fiber but much more expensive to have to run additional fiber later, so you might as well give yourself room for growth.

The less obvious reason to run extra fiber is that every so often fiber pairs stop working, just like network cables go bad, and when this happens you'll need to replace them with spare fiber pairs, which means you need those spare fiber pairs. Some of the time this fiber failure is (probably) because a raccoon got into your machine room, but some of the time it just happens for reasons that no one is likely to ever explain to you. And when this happens, you don't necessarily lose only a single pair. Today, for example, we lost three fiber pairs that ran between two adjacent buildings and evidence suggests that other people at the university lost at least one more pair.

(There are a variety of possible causes for sudden loss of multiple pairs, probably all running through a common path, which I will leave to your imagination. These fiber runs are probably not important enough to cause anyone to do a detailed investigation of where the fault is and what happened.)

Fiber comes in two varieties, single mode and multi-mode. I don't know enough to know if you should make a point of running both (over distances where either can be used) as part of the whole 'run more fiber' thing. Locally we have both SM and MM fiber and have switched back and forth between them at times (and may have to do so as a result of the current failures).

PS: Possibly you work in an organization where broken inside-building fiber runs are regularly fixed or replaced. That is not our local experience; someone has to pay for fixing or replacing, and when you have spare fiber pairs left it's easier to switch over to them rather than try to come up with the money and so on.

(Repairing or replacing broken fiber pairs will reduce your long term need for additional fiber, but obviously not the short term need. If you lose N pairs of fiber, you need N spare pairs to get back into operation.)

Updating local commits with more changes in Git (the harder way)

By: cks

One of the things I do with Git is maintain personal changes locally on top of the upstream version, with my changes updated via rebasing every time I pull upstream to update it. In the simple case, I have only a single local change and commit, but in more complex cases I split my changes into multiple local commits; my local version of Firefox currently carries 12 separate personal commits. Every so often, upstream changes something that causes one of those personal changes to need an update, without actually breaking the rebase of that change. When this happens I need to update my local commit with more changes, and often it's not the 'top' local commit (which can be updated simply).

In theory, the third party tool git-absorb should be ideal for this, and I believe I've used it successfully for this purpose in the past. In my most recent instance, though, git-absorb frustratingly refused to do anything in a situation where it felt it should work fine. I had an additional change to a file that was changed in exactly one of my local commits, which feels like an easy case.

(Reading the git-absorb readme carefully suggests that I may be running into a situation where my new change doesn't clash with any existing change. This makes git-absorb more limited than I'd like, but so it goes.)

In Git, what I want is called a 'fixup commit', and how to use it is covered in this Stackoverflow answer. The sequence of commands is basically:

# modify some/file with new changes, then
git add some/file

# Use this to find your existing commit ID
git log some/file

# with the existing commid ID
git commit --fixup=<commit ID>
git rebase --interactive --autosquash <commit ID>^

This will open an editor buffer with what 'git rebase' is about to do, which I can immediately exit out of because the defaults are exactly what I want (assuming I don't want to shuffle around the order of my local commits, which I probably don't, especially as part of a fixup).

I can probably also use 'origin/main' instead of '<commit ID>^', but that will rebase more things than is strictly necessary. And I need the commit ID for the 'git commit --fixup' invocation anyway.

(Sufficiently experienced Git people can probably put together a script that would do this automatically. It would get all of the files staged in the index, find the most recent commit that modified each of them, abort if they're not all the same commit, make a fixup commit to that most recent commit, and then potentially run the 'git rebase' for you.)

Using PyPy (or thinking about it) exposed a bug in closing files

By: cks

Over on the Fediverse, I said:

A fun Python error some code can make and not notice until you run it under PyPy is a function that has 'f.close' at the end instead of 'f.close()' where f is an open()'d file.

(Normal CPython will immediately close the file when the function returns due to refcounted GC. PyPy uses non-refcounted GC so the file remains open until GC happens, and so you can get too many files open at once. Not explicitly closing files is a classic PyPy-only Python bug.)

When a Python file object is garbage collected, Python arranges to close the underlying C level file descriptor if you didn't already call .close(). In CPython, garbage collection is deterministic and generally prompt; for example, when a function returns, all of its otherwise unreferenced local variables will be garbage collected as their reference counts drop to zero. However, PyPy doesn't use reference counting for its garbage collection; instead, like Go, it only collects garbage periodically, and so will only close files as a side effect some time later. This can make it easy to build up a lot of open files that aren't doing anything, and possibly run your program out of available file descriptors, something I've run into in the past.

I recently wanted to run a hacked up version of a NFS monitoring program written in Python under PyPy instead of CPython, so it would run faster and use less CPU on the systems I was interested in. Since I remembered this PyPy issue, I found myself wondering if it properly handled closing the file(s) it had to open, or if it left it to CPython garbage collection. When I looked at the code, what I found can be summarized as 'yes and no':

def parse_stats_file(filename):
  [...]
  f = open(filename)
  [...]
  f.close

  return ms_dict

Because I was specifically looking for uses of .close(), the lack of the '()' immediately jumped out at me (and got fixed in my hacked version).

It's easy to see how this typo could linger undetected in CPython. The line 'f.close' itself does nothing but isn't an error, and then 'f' is implicitly closed in the next line, as part of the 'return', so even if you looking at this program's file descriptor usage while it's running you won't see any leaks.

(I'm not entirely a fan of nondeterministic garbage collection, at least in the context of Python, where deterministic GC was a long standing feature of the language in practice.)

Always sync your log or journal files when you open them

By: cks

Today I learned of a new way to accidentally lose data 'written' to disk, courtesy of this Fediverse post summarizing a longer article about CouchDB and this issue. Because this is so nifty and startling when I encountered it, yet so simple, I'm going to re-explain the issue in my own words and explain how it leads to the title of this entry.

Suppose that you have a program that makes data it writes to disk durable through some form of journal, write ahead log (WAL), or the like. As we all know, data that you simply write() to the operating system isn't yet on disk; the operating system is likely buffering the data in memory before writing it out at the OS's own convenience. To make the data durable, you must explicitly flush it to disk (well, ask the OS to), for example with fsync(). Your program is a good program, so of course it does this; when it updates the WAL, it write()s then fsync()s.

Now suppose that your program is terminated after the write but before the fsync. At this point you have a theoretically incomplete and improperly written journal or WAL, since it hasn't been fsync'd. However, when your program restarts and goes through its crash recovery process, it has no way to discover this. Since the data was written (into the OS's disk cache), the OS will happily give the data back to you even though it's not yet on disk. Now assume that your program takes further actions (such as updating its main files) based on the belief that the WAL is fully intact, and then the system crashes, losing that buffered and not yet written WAL data. Oops. You (potentially) have a problem.

(These days, programs can get terminated for all sorts of reasons other than a program bug that causes a crash. If you're operating in a modern containerized environment, your management system can decide that your program or its entire container ought to shut down abruptly right now. Or something else might have run the entire system out of memory and now some OOM handler is killing your program.)

To avoid the possibility of this problem, you need to always force a disk flush when you open your journal, WAL, or whatever; on Unix, you'd immediately fsync() it. If there's no unwritten data, this will generally be more or less instant. If there is unwritten data because you're restarting after the program was terminated by surprise, this might take a bit of time but insures that the on-disk state matches the state that you're about to observe through the OS.

(CouchDB's article points to another article, Justin Jaffray’s NULL BITMAP Builds a Database #2: Enter the Memtable, which has a somewhat different way for this failure to bite you. I'm not going to try to summarize it here but you might find the article interesting reading.)

Using Netplan to set up WireGuard on Ubuntu 22.04 works, but has warts

By: cks

For reasons outside the scope of this entry, I recently needed to set up WireGuard on an Ubuntu 22.04 machine. When I did this before for an IPv6 gateway, I used systemd-networkd directly. This time around I wasn't going to set up a single peer and stop; I expected to iterate and add peers several times, which made netplan's ability to update and re-do your network configuration look attractive. Also, our machines are already using Netplan for their basic network configuration, so this would spare my co-workers from having to learn about systemd-networkd.

Conveniently, Netplan supports multiple configuration files so you can put your WireGuard configuration into a new .yaml file in your /etc/netplan. The basic version of a WireGuard endpoint with purely internal WireGuard IPs is straightforward:

network:
  version: 2
  tunnels:
    our-wg0:
      mode: wireguard
      addresses: [ 192.168.X.1/24 ]
      port: 51820
      key:
        private: '....'
      peers:
        - keys:
            public: '....'
          allowed-ips: [ 192.168.X.10/32 ]
          keepalive: 90
          endpoint: A.B.C.D:51820

(You may want something larger than a /24 depending on how many other machines you think you'll be talking to. Also, this configuration doesn't enable IP forwarding, which is a feature in our particular situation.)

If you're using netplan's systemd-networkd backend, which you probably are on an Ubuntu server, you can apparently put your keys into files instead of needing to carefully guard the permissions of your WireGuard /etc/netplan file (which normally has your private key in it).

If you write this out and run 'netplan try' or 'netplan apply', it will duly apply all of the configuration and bring your 'our-wg0' WireGuard configuration up as you expect. The problems emerge when you change this configuration, perhaps to add another peer, and then re-do your 'netplan try', because when you look you'll find that your new peer hasn't been added. This is a sign of a general issue; as far as I can tell, netplan (at least in Ubuntu 22.04) can set up WireGuard devices from scratch but it can't update anything about their WireGuard configuration once they're created. This is probably be a limitation in the Ubuntu 22.04 version of systemd-networkd that's only changed in the very latest systemd versions. In order to make WireGuard level changes, you need to remove the device, for example with 'ip link del dev our-wg0' and then re-run 'netplan try' (or 'netplan apply') to re-create the WireGuard device from scratch; the recreated version will include all of your changes.

(The latest online systemd.netdev manual page says that systemd-networkd will try to update netdev configurations if they change, and .netdev files are where WireGuard settings go. The best information I can find is that this change appeared in systemd v257, although the Fedora 41 systemd.netdev manual page has this same wording and it has systemd '256.11'. Maybe there was a backport into Fedora.)

In our specific situation, deleting and recreating the WireGuard device is harmless and we're not going to be doing it very often anyway. In other configurations things may not be so straightforward and so you may need to resort to other means to apply updates to your WireGuard configuration (including working directly through the 'wg' tool).

I'm not impressed by the state of NFS v4 in the Linux kernel

By: cks

Although NFS v4 is (in theory) the latest great thing in NFS protocol versions, for a long time we only used NFS v3 for our fileservers and our Ubuntu NFS clients. A few years ago we switched to NFS v4 due to running into a series of problems our people were experiencing with NFS (v3) locks (cf); since NFS v4 locks are integrated into the protocol and NFS v4 is the 'modern' NFS version that's probably receiving more attention than anything to do with NFS v3.

(NFS v4 locks are handled relatively differently than NFS v3 locks.)

Moving to NFS v4 did fix our NFS lock issues in that stuck NFS locks went away, when before they'd been a regular issue on our IMAP server. However, all has not turned out to be roses, and the result has left me not really impressed with the state of NFS v4 in the Linux kernel. In Ubuntu 22.04's 5.15.x server kernel, we've now run into scalability issues in both the NFS server (which is what sparked our interest in how many NFS server threads to run and what NFS server threads do in the kernel), and now in the NFS v4 client (where I have notes that let me point to a specific commit with the fix).

(The NFS v4 server issue we encountered may be the one fixed by this commit.)

What our two issues have in common is that both are things that you only find under decent or even significant load. That these issues both seem to have still been present as late as kernels 6.1 (server) and 6.6 (client) suggests that neither the Linux NFS v4 server nor the Linux NFS v4 client had been put under serious load until then, or at least not by people who could diagnose their problems precisely enough to identify the problem and get kernel fixes made. While both issues are probably fixed now, their past presence leaves me wondering what other scalability issues are lurking in the kernel's NFS v4 support, partly because people have mostly been using NFS v3 until recently (like us).

We're not going to go back to NFS v3 in general (partly because of the clear improvement in locking), and the server problem we know about has been wiped away because we're moving our NFS fileservers to Ubuntu 24.04 (and some day the NFS clients will move as well). But I'm braced for further problems, including ones in 24.04 that we may be stuck with for a while.

PS: I suspect that part of the issues may come about because the Linux NFS v4 client and the Linux NFS v4 server don't add NFS v4 operations at the same time. As I found out, the server supports more operations than the client uses but the client's use is of whatever is convenient and useful for it, not necessarily by NFS v4 revision. If the major use of Linux NFS v4 servers is with v4 clients, this could leave the server implementation of operations under-used until the client starts using them (and people upgrade clients to kernel versions with that support).

MFA's "push notification" authentication method can be easier to integrate

By: cks

For reasons outside the scope of this entry, I'm looking for an OIDC or SAML identity provider that supports primary user and password authentication against our own data and then MFA authentication through the university's SaaS vendor. As you'd expect, the university's MFA SaaS vendor supports all of the common MFA approaches today, covering push notifications through phones, one time codes from hardware tokens, and some other stuff. However, pretty much all of the MFA integrations I've been able to find only support MFA push notifications (eg, also). When I thought about it, this made a lot of sense, because it's often going to be much easier to add push notification MFA than any other form of it.

A while back I wrote about exploiting password fields for multi-factor authentication, where various bits of software hijacked password fields to let people enter things like MFA one time codes into systems (like OpenVPN) that were never set up for MFA in the first place. With most provider APIs, authentication through push notification can usually be inserted in a similar way, because from the perspective of the overall system it can be a synchronous operation. The overall system calls a 'check' function of some sort, the check function calls out the the provider's API and then possibly polls for a result for a while, and then it returns a success or a failure. There's no need to change the user interface of authentication or add additional high level steps.

(The exception is if the MFA provider's push authentication API only returns results to you by making a HTTP query to you. But I think that this would be a relatively weird API; a synchronous reply or at least a polled endpoint is generally much easier to deal with and is more or less required to integrate push authentication with non-web applications.)

By contrast, if you need to get a one time code from the person, you have to do things at a higher level and it may not fit well in the overall system's design (or at least the easily exposed points for plugins and similar things). Instead of immediately returning a successful or failed authentication, you now need to display an additional prompt (in many cases, a HTML page), collect the data, and only then can you say yes or no. In a web context (such as a SAML or OIDC IdP), the provider may want you to redirect the user to their website and then somehow call you back with a reply, which you'll have to re-associate with context and validate. All of this assumes that you can even interpose an additional prompt and reply, which isn't the case in some contexts unless you do extreme things.

(Sadly this means that if you have a system that only supports MFA push authentication and you need to also accept codes and so on, you may be in for some work with your chainsaw.)

Go's behavior for zero value channels and maps is partly a choice

By: cks

How Go behaves if you have a zero value channel or map (a 'nil' channel or map) is somewhat confusing (cf, via). When we talk about it, it's worth remembering that this behavior is a somewhat arbitrary choice on Go's part, not a fundamental set of requirements that stems from, for example, other language semantics. Go has reasons to have channels and maps behave as they do, but some those reasons have to do with how channel and map values are implemented and some are about what's convenient for programming.

As hinted at by how their zero value is called a 'nil' value, channel and map values are both implemented as pointers to runtime data structures. A nil channel or map has no such runtime data structure allocated for it (and the pointer value is nil); these structures are allocated by make(). However, this doesn't entirely allow us to predict what happens when you use nil values of either type. It's not unreasonable for an attempt to assign an element to a nil map to panic, since the nil map has no runtime data structure allocated to hold anything we try to put in it. But you don't have to say that a nil map is empty and looking up elements in it gives you a zero value; I think you could have this panic instead, just as assigning an element does. However, this would probably result in less safe code that paniced more (and probably had more checks for nil maps, too).

Then there's nil channels, which don't behave like nil maps. It would make sense for receiving from a nil channel to yield the zero value, much like looking up an element in a nil map, and for sending to a nil channel to panic, again like assigning to an element in a nil map (although in the channel case it would be because there's no runtime data structure where your goroutine could metaphorically hang its hat waiting for a receiver). Instead Go chooses to make both operations (permanently) block your goroutine, with panicing on send reserved for sending to a non-nil but closed channel.

The current semantics of sending on a closed channel combined with select statements (and to a lesser extent receiving from a closed channel) means that Go needs a channel zero value that is never ready to send or receive. However, I believe that Go could readily make actual sends or receives on nil channels panic without any language problems. As a practical matter, sending or receiving on a nil channel is a bug that will leak your goroutine even if your program doesn't deadlock.

Similarly, Go could choose to allocate an empty map runtime data structure for zero value maps, and then let you assign to elements in the resulting map rather than panicing. If desired, I think you could preserve a distinction between empty maps and nil maps. There would be some drawbacks to this that cut against Go's general philosophy of being relatively explicit about (heap) allocations and you'd want a clever compiler that didn't bother creating those zero value runtime map data structures when they'd just be overwritten by 'make()' or a return value from a function call or the like.

(I can certainly imagine a quite Go like language where maps don't have to be explicitly set up any more than slices do, although you might still use 'make()' if you wanted to provide size hints to the runtime.)

Sidebar: why you need something like nil channels

We all know that sometimes you want to stop sending or receiving on a channel in a select statement. On first impression it looks like closing a channel (instead of setting the channel to nil) could be made to work for this (it doesn't currently). The problem is that closing a channel is a global thing, while you may only want a local effect; you want to remove the channel from your select, but not close down other uses of it by other goroutines.

This need for a local effect pretty much requires a special, distinct channel value that is never ready for sending or receiving, so you can overwrite the old channel value with this special value, which we might as well call a 'nil channel'. Without a channel value that serves this purpose you'd have to complicate select statements with some other way to disable specific channels.

(I had to work this out in my head as part of writing this entry so I might as well write it down for my future self.)

JSON has become today's machine-readable output format (on Unix)

By: cks

Recently, I needed to delete about 1,200 email messages to a particular destination from the mail queue on one of our systems. This turned out to be trivial, because this system was using Postfix and modern versions of Postfix can output mail queue status information in JSON format. So I could dump the mail queue status, select the relevant messages and print the queue IDs with jq, and feed this to Postfix to delete the messages. This experience has left me with the definite view that everything should have the option to output JSON for 'machine-readable' output, rather than some bespoke format. For new programs, I think that you should only bother producing JSON as your machine readable output format.

(If you strongly object to JSON, sure, create another machine readable output format too. But if you don't care one way or another, outputting only JSON is probably the easiest approach for programs that don't already have such a format of their own.)

This isn't because JSON is the world's best format (JSON is at best the least bad format). Instead it's because JSON has a bunch of pragmatic virtues on a modern Unix system. In general, JSON provides a clear and basically unambiguous way to represent text data and much numeric data, even if it has relatively strange characters in it (ie, JSON has escaping rules that everyone knows and all tools can deal with); it's also generally extensible to add additional data without causing heartburn in tools that are dealing with older versions of a program's output. And on Unix there's an increasingly rich collection of tools to deal with and process JSON, starting with jq itself (and hopefully soon GNU Awk in common configurations). Plus, JSON can generally be transformed to various other formats if you need them.

(JSON can also be presented and consumed in either multi-line or single line formats. Multi-line output is often much more awkward to process in other possible formats.)

There's nothing unique about JSON in all of this; it could have been any other format with similar virtues where everything lined up this way for the format. It just happens to be JSON at the moment (and probably well into the future), instead of (say) XML. For individual programs there are simpler 'machine readable' output formats, but they either have restrictions on what data they can represent (for example, no spaces or tabs in text), or require custom processing that goes well beyond basic grep and awk and other widely available Unix tools, or both. But JSON has become a "narrow waist" for Unix programs talking to each other, a common coordination point that means people don't have to invent another format.

(JSON is also partially self-documenting; you can probably look at a program's JSON output and figure out what various parts of it mean and how it's structured.)

PS: Using JSON also means that people writing programs don't have to design their own machine-readable output format. Designing a machine readable output format is somewhat more complicated than it looks, so I feel that the less of it people need to do, the better.

(I say this as a system administrator who's had to deal with a certain amount of output formats that have warts that make them unnecessarily hard to deal with.)

Institutions care about their security threats, not your security threats

By: cks

Recently I was part of a conversation on the Fediverse that sparked an obvious in retrospect realization about computer security and how we look at and talk about security measures. To put it succinctly, your institution cares about threats to it, not about threats to you. It cares about threats to you only so far as they're threats to it through you. Some of the security threats and sensible responses to them overlap between you and your institution, but some of them don't.

One of the areas where I think this especially shows up is in issues around MFA (Multi-Factor Authentication). For example, it's a not infrequently observed thing that if all of your factors live on a single device, such as your phone, then you actually have single factor authentication (this can happen with many of the different ways to do MFA). But for many organizations, this is relatively fine (for them). Their largest risk is that Internet attackers are constantly trying to (remotely) phish their people, often in moderately sophisticated ways that involve some prior research (which is worth it for the attackers because they can target many people with the same research). Ignoring MFA alert fatigue for a moment, even a single factor physical device will cut of all of this, because Internet attackers don't have people's smartphones.

For individual people, of course, this is potentially a problem. If someone can gain access to your phone, they get everything, and probably across all of the online services you use. If you care about security as an individual person, you want attackers to need more than one thing to get all of your accounts. Conversely, for organizations, compromising all of their systems at once is sort of a given, because that's what it means to have a Single Sign On system and global authentication. Only a few organizational systems will be separated from the general SSO (and organizations have to hope that their people cooperate by using different access passwords).

Organizations also have obvious solutions to things like MFA account recovery. They can establish and confirm the identities of people associated with them, and a process to establish MFA in the first place, so if you lose whatever lets you do MFA (perhaps your work phone's battery has gotten spicy), they can just run you through the enrollment process again. Maybe there will be a delay, but if so, the organization has broadly decided to tolerate it.

(And I just recently wrote about the difference between 'internal' accounts and 'external' accounts, where people generally know who is in an organization and so has an account, so allowing this information to leak in your authentication isn't usually a serious problem.)

Another area where I think this difference in the view of threats is in the tradeoffs involved in disk encryption on laptops and desktops used by people. For an organization, choosing non-disclosure over availability on employee devices makes a lot of sense. The biggest threat as the organization sees it isn't data loss on a laptop or desktop (especially if they write policies about backups and where data is supposed to be stored), it's an attacker making off with one and having the data disclosed, which is at least bad publicity and makes the executives unhappy. You may feel differently about your own data, depending on how your backups are.

HTTP connections are part of the web's long tail

By: cks

I recently read an article that, among other things, apparently seriously urging browser vendors to deprecate and disable plain text HTTP connections by the end of October of this year (via, and I'm deliberately not linking directly to the article). While I am a strong fan of HTTPS in general, I have some feelings about a rapid deprecation of HTTP. One of my views is that plain text HTTP is part of the web's long tail.

As I'm using the term here, the web's long tail (also is the huge mass of less popular things that are individually less frequently visited but which in aggregate amount to a substantial part of the web. The web's popular, busy sites are frequently updated and can handle transitions without problems. They can readily switch to using modern HTML, modern CSS, modern JavaScript, and so on (although they don't necessarily do so), and along with that update all of their content to HTTPS. In fact they mostly or entirely have done so over the last ten to fifteen years. The web's long tail doesn't work like that. Parts of it use old JavaScript, old CSS, old HTML, and these days, plain HTTP (in addition to the people who have objections to HTTPS and deliberately stick to HTTP).

The aggregate size and value of the long tail is part of why browsers have maintained painstaking compatibility back to old HTML so far, including things like HTML Image Maps. There's plenty of parts of the long tail that will never be updated to have HTTPS or work properly with it. For browsers to discard HTTP anyway would be to discard that part of the long tail, which would be a striking break with browser tradition. I don't think this is very likely and I certainly hope that it never comes to pass, because that long tail is part of what gives the web its value.

(It would be an especially striking break since a visible percentage of page loads still happen with HTTP instead of HTTPS. For example, Google's stats say that globally 5% of Windows Chrome page loads apparently still use HTTP. That's roughly one in twenty page loads, and the absolute number is going to be very large given how many page loads happen with Chrome on Windows. This large number is one reason I don't think this is at all a serious proposal; as usual with this sort of thing, it ignores that social problems are the ones that matter.)

PS: Of course, not all of the HTTP connections are part of the web's long tail as such. Some of them are to, for example, manage local devices via little built in web servers that simply don't have HTTPS. The people with these devices aren't in any rush to replace them just because some people don't like HTTP, and the vendors who made them aren't going to update their software to support (modern) HTTPS even for the devices which support firmware updates and where the vendor is still in business.

(You can view them as part of the long tail of 'the web' as a broad idea and interface, even though they're not exposed to the world the way that the (public) web is.)

It's good to have offline contact information for your upstream networking

By: cks

So I said something on the Fediverse:

Current status: it's all fun and games until the building's backbone router disappears.

A modest suggestion: obtain problem reporting/emergency contact numbers for your upstream in advance and post them on the wall somewhere. But you're on your own if you use VOIP desk phones.

(It's back now or I wouldn't be posting this, I'm in the office today. But it was an exciting 20 minutes.)

(I was somewhat modeling the modest suggestion after nuintari's Fediverse series of "rules of networking", eg, also.)

The disappearance of the building's backbone router took out all local networking in the particular building that this happened in (which is the building with our machine room), including the university wireless in the building. THe disappearance of the wireless was especially surprising, because the wireless SSID disappeared entirely.

(My assumption is that the university's enterprise wireless access points stopped advertising the SSID when they lost some sort of management connection to their control plane.)

In a lot of organizations you might have been able to relatively easily find the necessary information even with this happening. For example, people might have smartphones with data plans and laptops that they could tether to the smartphones, and then use this to get access to things like the university directory, the university's problem reporting system, and so on. For various reasons, we didn't really have any of this available, which left us somewhat at a loss when the external networking evaporated. Ironically we'd just managed to finally find some phone numbers and get in touch with people when things came back.

(One bit of good news is that our large scale alert system worked great to avoid flooding us with internal alert emails. My personal alert monitoring (also) did get rather noisy, but that also let me see right away how bad it was.)

Of course there's always things you could do to prepare, much like there are often too many obvious problems to keep track of them all. But in the spirit of not stubbing our toes on the same problem a second time, I suspect we'll do something to keep some problem reporting and contact numbers around and available.

Shared (Unix) hosting and the problem of managing resource limits

By: cks

Yesterday I wrote about how one problem with shared Unix hosting was the lack of good support for resource limits in the Unixes of the time. But even once you have decent resource limits, you still have an interlinked set of what we could call 'business' problems. These are the twin problems of what resource limits you set on people and how you sell different levels of these resources limits to your customers.

(You may have the first problem even for purely internal resource allocation on shared hosts within your organization, and it's never a purely technical decision.)

The first problem is whether you overcommit what you sell and in general how you decide on the resource limits. Back in the big days of the shared hosting business, I believe that overcommitting was extremely common; servers were expensive and most people didn't use much resources on average. If you didn't overcommit your servers, you had to charge more and most people weren't interested in paying that. Some resources, such as CPU time, are 'flow' resources that can be rebalanced on the fly, restricting everyone to a fair share when the system is busy (even if that share is below what they're nominally entitled to), but it's quite difficult to take memory back (or disk space). If you overcommit memory, your systems might blow up under enough load. If you don't overcommit memory, either everyone has to pay more or everyone gets unpopularly low limits.

(You can also do fancy accounting for 'flow' resources, such as allowing bursts of high CPU but not sustained high CPU. This is harder to do gracefully for things like memory, although you can always do it ungracefully by terminating things.)

The other problem entwined with setting resource limits is how (and if) you sell different levels of resource limits to your customers. A single resource limit is simple but probably not what all of your customers want; some will want more and some will only need less. But if you sell different limits, you have to tell customers what they're getting, let them assess their needs (which isn't always clear in a shared hosting situation), deal with them being potentially unhappy if they think they're not getting what they paid for, and so on. Shared hosting is always likely to have complicated resource limits, which raises the complexity of selling them (and of understanding them, for the customers who have to pick one to buy).

Viewed from the right angle, virtual private servers (VPSes) are a great abstraction to sell different sets of resource limits to people in a way that's straightforward for them to understand (and which at least somewhat hides whether or not you're overcommitting resources). You get 'a computer' with these characteristics, and most of the time it's straightforward to figure out whether things fit (the usual exception is IO rates). So are more abstracted, 'cloud-y' ways of selling computation, database access, and so on (at least in areas where you can quantify what you're doing into some useful unit of work, like 'simultaneous HTTP requests').

It's my personal suspicion that even if the resource limitation problems had been fully solved much earlier, shared hosting would have still fallen out of fashion in favour of simpler to understand VPS-like solutions, where what you were getting and what you were using (and probably what you needed) were a lot clearer.

One problem with "shared Unix hosting" was the lack of resource limits

By: cks

I recently read Comments on Shared Unix Hosting vs. the Cloud (via), which I will summarize as being sad about how old fashioned shared hosting on a (shared) Unix system has basically died out, and along with it web server technology like CGI. As it happens, I have a system administrator's view of why shared Unix hosting always had problems and was a down-market thing with various limitations, and why even today people aren't very happy with providing it. In my view, a big part of the issue was the lack of resource limits.

The problem with sharing a Unix machine with other people is that by default, those other people can starve you out. They can take up all of the available CPU time, memory, process slots, disk IO, and so on. On an unprotected shared web server, all you need is one person's runaway 'CGI' code (which might be PHP code or etc) or even an unusually popular dynamic site and all of the other people wind up having a bad time. Life gets worse if you allow people to log in, run things in the background, run things from cron, and so on, because all of these can add extra load. In order to make shared hosting be reliable and good, you need some way of forcing a fair sharing of resources and limiting how much resources a given customer can use.

Unfortunately, for much of the practical life of shared Unix hosting, Unixes did not have that. Some Unixes could create various sorts of security boundaries, but generally not resource usage limits that applied to an entire group of processes. Even once this became possibly to some degree in Linux through cgroup(s), the kernel features took some time to mature and then it took even longer for common software to support running things in isolated and resource controlled cgroups. Even today it's still not necessarily entirely there for things like running CGIs from your web server, never mind a potential shared database server to support everyone's database backed blog.

(A shared database server needs to implement its own internal resource limits for each customer, otherwise you have to worry about a customer gumming it up with expensive queries, a flood of queries, and so on. If they need separate database servers for isolation and resource control, now they need more server resources.)

My impression is that the lack of kernel supported resource limits forced shared hosting providers to roll their own ad-hoc ways of limiting how much resources their customers could use. In turn this created the array of restrictions that you used to see on such providers, with things like 'no background processes', 'your CGI can only run for so long before being terminated', 'your shell session is closed after N minutes', and so on. If shared hosting had been able to put real limits on each of their customers, this wouldn't have been as necessary; you could go more toward letting each customer blow itself up if it over-used resources.

(How much resources to give each customer is also a problem, but that's another entry.)

More potential problems for people with older browsers

By: cks

I've written before that keeping your site accessible to very old browsers is non-trivial because of issues like them not necessarily supporting modern TLS. However, there's another problem that people with older browsers are likely to be facing, unless circumstances on the modern web change. I said on the Fediverse:

Today in unfortunate web browser developments: I think people using older versions of browsers, especially Chrome, are going to have increasing problems accessing websites. There are a lot of (bad) crawlers out there forging old Chrome versions, perhaps due to everyone accumulating AI training data, and I think websites are going to be less and less tolerant of them.

(Mine sure is currently, as an experiment.)

(By 'AI' I actually mean LLM.)

I covered some request volume information yesterday and it (and things I've seen today) strongly suggest that there is a lot of undercover scraping activity going on. Much of that scraping activity uses older browser User-Agents, often very old, which means that people who don't like it are probably increasingly going to put roadblocks in the way of anything presenting those old User-Agent values (there are already open source projects designed to frustrate LLM scraping and there will probably be more in the future).

(Apparently some LLM scrapers start out with honest User-Agents but then switch to faking them if you block their honest versions.)

There's no particular reason why scraping software can't use current User-Agent values, but it probably has to be updated every so often when new browser versions come out and people haven't done that so far. Much like email anti-spam efforts changing email spammer behavior, this may change if enough websites start reacting to old User-Agents, but I suspect that it will take a while for that to come to pass. Instead I expect it to be a smaller scale, distributed effort from 'unimportant' websites that are getting overwhelmed, like LWN (see the mention of this in their 'what we haven't added' section).

Major websites probably won't outright reject old browsers, but I suspect that they'll start throwing an increased amount of blocks in the way of 'suspicious' browser sessions with those User-Agents. This is probably likely to include CAPTCHAs and other such measures that they already use some of the time. CAPTCHAs aren't particularly effective at stopping bad actors in practice but they're the hammer that websites already have, so I'm sure they'll be used on this nail.

Another thing that I suspect will start happening is that more sites will start insisting that you run some JavaScript to pass a test in order to access them (whether this is an explicit CAPTCHA or just passive JavaScript that has to execute). This will stop LLM scrapers that don't run JavaScript, which is not all of them, and force the others to spend a certain amount of CPU and memory, driving up the aggregate cost of scraping your site dry. This will of course adversely affect people without JavaScript in their browser and those of us who choose to disable it for most sites, but that will be seen as the lesser evil by people who do this. As with anti-scraper efforts, there are already open source projects for this.

(This is especially likely to happen if LLM scrapers modernize their claimed User-Agent values to be exactly like current browser versions. People are going to find some defense.)

PS: I've belatedly made the Wandering Thoughts blocks for old browsers now redirect people to a page about the situation. I've also added a similar page for my current block of most HTTP/1.0 requests.

The HTTP status codes of responses from about 21 hours of traffic to here

By: cks

You may have heard that there are a lot of crawlers out there these days, many of them apparently harvesting training data for LLMs. Recently I've been getting more strict about access to this blog, so for my own interest I'm going to show statistics on what HTTP status codes all of the requests to here got in the past roughly 21 hours and a bit. I think this is about typical, although there may be more blocked things than usual.

I'll start with the overall numbers for all requests:

 22792 403      [45%]
  9207 304      [18.3%]
  9055 200      [17.9%]
  8641 429      [17.1%]
   518 301
    58 400
    33 404
     2 206
     1 302

HTTP 403 is the error code that people get on blocked access; I'm not sure what's producing the HTTP 400s. The two HTTP 206s were from LinkedIn's bot against a recent entry and completely puzzle me. Some of the blocked access is major web crawlers requesting things that they shouldn't (Bing is a special repeat offender here), but many of them are not. Between HTTP 403s and HTTP 429s, 62% or so of the requests overall were rejected and only 36% got a useful reply.

(With less thorough and active blocks, that would be a lot more traffic for Wandering Thoughts to handle.)

The picture for syndication feeds is rather different, as you might expect, but not quite as different as I'd like:

  9136 304    [39.5%]
  8641 429    [37.4%]
  3614 403    [15.6%]
  1663 200    [ 7.2%]
    19 301

Some of those rejections are for major web crawlers and almost a thousand are for a pair of prolific, repeat high volume request sources, but a lot of them aren't. Feed requests account for 23073 requests out of a total of 50307, or about 45% of the requests. To me this feels quite low for anything plausibly originated from humans; most of the time I expect feed requests to significantly outnumber actual people visiting.

(In terms of my syndication feed rate limiting, there were 19440 'real' syndication feed requests (84% of the total attempts), and out of them 44.4% were rate-limited. That's actually a lower level of rate limiting than I expected; possibly various feed fetchers have actually noticed it and reduced their attempt frequency. 46.9% made successful conditional GET requests (ones that got a HTTP 304 response) and 8.5% actually fetched feed data.)

DWiki, the wiki engine behind the blog, has a concept of alternate 'views' of pages. Syndication feeds are alternate views, but so are a bunch of other things. Excluding syndication feeds, the picture for requests of alternate views of pages is:

  5499 403
   510 200
    39 301
     3 304

The most blocked alternate views are:

  1589 ?writecomment
  1336 ?normal
  1309 ?source
   917 ?showcomments

(The most successfully requested view is '?showcomments', which isn't really a surprise to me; I expect search engines to look through that, for one.)

If I look only at plain requests, not requests for syndication feeds or alternate views, I see:

 13679 403   [64.5%]
  6882 200   [32.4%]
   460 301
    68 304
    58 400
    33 404
     2 206
     1 302

This means the breakdown of traffic is 21183 normal requests (42%), 45% feed requests, and the remainder for alternate views, almost all of which were rejected.

Out of the HTTP 403 rejections across all requests, the 'sources' break down something like this:

  7116 Forged Chrome/129.0.0.0 User-Agent
  1451 Bingbot
  1173 Forged Chrome/121.0.0.0 User-Agent
   930 PerplexityBot ('AI' LLM data crawler)
   915 Blocked sources using a 'Go-http-client/1.1' User-Agent

Those HTTP 403 rejections came from 12619 different IP addresses, in contrast to the successful requests (HTTP 2xx and 3xx codes), which came from 18783 different IP addresses. After looking into the ASN breakdown of those IPs, I've decided that I can't write anything about them with confidence, and it's possible that part of what is going on is that I have mis-firing blocking rules (alternately, I'm being hit from a big network of compromised machines being used as proxies, perhaps the same network that is the Chrome/129.0.0.0 source). However, some of the ASNs that show up highly are definitely ones I recognize from other contexts, such as attempted comment spam.

Update: Well that was a learning experience about actual browser User-Agents. Those 'Chrome/129.0.0.0' User-Agents may well not have been so forged (although people really should be running more current versions of Chrome). I apologize to the people using real current Chrome versions that were temporarily unable to read the blog because of my overly-aggressive blocks.

Why I have a little C program to filter a $PATH (more or less)

By: cks

I use a non-standard shell and have for a long time, which means that I have to write and maintain my own set of dotfiles (which sometimes has advantages). In the long ago days when I started doing this, I had a bunch of accounts on different Unixes around the university (as was the fashion at the time, especially if you were a sysadmin). So I decided that I was going to simplify my life by having one set of dotfiles for rc that I used on all of my accounts, across a wide variety of Unixes and Unix environments. That way, when I made an improvement in a shell function I used, I could get it everywhere by just pushing out a new version of my dotfiles.

(This was long enough ago that my dotfile propagation was mostly manual, although I believe I used rdist for some of it.)

In the old days, one of the problems you faced if you wanted a common set of dotfiles across a wide variety of Unixes was that there were a lot of things that potentially could be in your $PATH. Different Unixes had different sets of standard directories, and local groups put local programs (that I definitely wanted access to) in different places. I could have put everything in $PATH (giving me a gigantic one) or tried to carefully scope out what system environment I was on and set an appropriate $PATH for each one, but I decided to take a more brute force approach. I started with a giant potential $PATH that listed every last directory that could appear in $PATH in any system I had an account on, and then I had a C program that filtered that potential $PATH down to only things that existed on the local system. Because it was written in C and had to stat() things anyways, I made it also keep track of what concrete directories it had seen and filter out duplicates, so that if there were symlinks from one name to another, I wouldn't get it twice in my $PATH.

(Looking at historical copies of the source code for this program, the filtering of duplicates was added a bit later; the very first version only cared about whether a directory existed or not.)

The reason I wrote a C program for this (imaginatively called 'isdirs') instead of using shell builtins to do this filtering (which is entirely possible) is primarily because this was so long ago that running a C program was definitely faster than using shell builtins in my shell. I did have a fallback shell builtin version in case my C program might not be compiled for the current system and architecture, although it didn't do the filtering of duplicates.

(Rc uses a real list for its equivalent of $PATH instead of the awkward ':' separated pseudo-list that other Unix shells use, so both my C program and my shell builtin could simply take a conventional argument list of directories rather than having to try to crack a $PATH apart.)

(This entry was inspired by Ben Zanin's trick(s) to filter out duplicate $PATH entries (also), which prompted me to mention my program.)

PS: rc technically only has one dotfile, .rcrc, but I split my version up into several files that did different parts of the work. One reason for this split was so that I could source only some parts to set up my environment in a non-interactive context (also).

Sidebar: the rc builtin version

Rc has very few builtins and those builtins don't include test, so this is a bit convoluted:

path=`{tpath=() pe=() {
        for (pe in $path)
           builtin cd $pe >[1=] >[2=] && tpath=($tpath $pe)
        echo $tpath
       } >[2]/dev/null}

In a conventional shell with a test builtin, you would just use 'test -d' to see if directories were there. In rc, the only builtin that will tell you if a directory exists is to try to cd to it. That we change directories is harmless because everything is running inside the equivalent of a Bourne shell $(...).

Keen eyed people will have noticed that this version doesn't work if anything in $path has a space in it, because we pass the result back as a whitespace-separated string. This is a limitation shared with how I used the C program, but I never had to use a Unix where one of my $PATH entries needed a space in it.

The profusion of things that could be in your $PATH on old Unixes

By: cks

In the beginning, which is to say the early days of Bell Labs Research Unix, life was simple and there was only /bin. Soon afterwards that disk ran out of space and we got /usr/bin (and all of /usr), and some people might even have put /etc on their $PATH. When UCB released BSD Unix, they added /usr/ucb as a place for (some of) their new programs and put some more useful programs in /etc (and at some point there was also /usr/etc); now you had three or four $PATH entries. When window systems showed up, people gave them their own directories too, such as /usr/bin/X11 or /usr/openwin/bin, and this pattern was followed by other third party collections of programs, with (for example) /usr/bin/mh holding all of the (N)MH programs (if you installed them there). A bit later, SunOS 4.0 added /sbin and /usr/sbin and other Unixes soon copied them, adding yet more potential $PATH entries.

(Sometimes X11 wound up in /usr/X11/bin, or /usr/X11<release>/bin. OpenBSD still has a /usr/X11R6 directory tree, to my surprise.)

When Unix went out into the field, early system administrators soon learned that they didn't want to put local programs into /usr/bin, /usr/sbin, and so on. Of course there was no particular agreement on where to put things, so people came up with all sorts of options for the local hierarchy, including /usr/local, /local, /slocal, /<group name> (such as /csri or /dgp), and more. Often these /local/bin things had additional subdirectories for things like the locally built version of X11, which might be plain 'bin/X11' or have a version suffix, like 'bin/X11R4', 'bin/X11R5', or 'bin/X11R6'. Some places got more elaborate; rather than putting everything in a single hierarchy, they put separate things into separate directory hierarchies. When people used /opt for this, you could get /opt/gnu/bin, /opt/tk/bin, and so on.

(There were lots of variations, especially for locally built versions of X11. And a lot of people built X11 from source in those days, at least in the university circles I was in.)

Unix vendors didn't sit still either. As they began adding more optional pieces they started splitting them up into various directory trees, both for their own software and for third party software they felt like shipping. Third party software was often planted into either /usr/local or /usr/contrib, although there were other options, and vendor stuff could go in many places. A typical example is Solaris 9's $PATH for sysadmins (and I think that's not even fully complete, since I believe Solaris 9 had some stuff hiding under /usr/xpg4). Energetic Unix vendors could and did put various things in /opt under various names. By this point, commercial software vendors that shipped things for Unixes also often put them in /opt.

This led to three broad things for people using Unixes back in those days. First, you invariably had a large $PATH, between all of the standard locations, the vendor additions, and the local additions on top of those (and possibly personal 'bin' directories in your $HOME). Second, there was a lot of variation in the $PATH you wanted, both from Unix to Unix (with every vendor having their own collection of non-standard $PATH additions) and from site to site (with sysadmins making all sorts of decisions about where to put local things). Third, setting yourself up on a new Unix often required a bunch of exploration and digging. Unix vendors often didn't add everything that you wanted to their standard $PATH, for example. If you were lucky and got an account at a well run site, their local custom new account dotfiles would set you up with a correct and reasonably complete local $PATH. If you were a sysadmin exploring a new to you Unix, you might wind up writing a grumpy blog entry.

(This got much more complicated for sites that had a multi-Unix environment, especially with shared home directories.)

Modern Unix life is usually at least somewhat better. On Linux, you're typically down to two main directories (/usr/bin and /usr/sbin) and possibly some things in /opt, depending on local tastes. The *BSDs are a little more expansive but typically nowhere near the heights of, for example, Solaris 9's $PATH (see the comments on that entry too).

'Internal' accounts and their difference from 'external' accounts

By: cks

In the comments on my entry on how you should respond to authentication failures depends on the circumstances, sapphirepaw said something that triggered a belated realization in my mind:

Probably less of a concern for IMAP, but in a web app, one must take care to hide the information completely. I was recently at a site that wouldn't say whether the provided email was valid for password reset, but would reveal it was in use when trying to create a new account.

The realization this sparked is that we can divide accounts and systems into two sorts, which I will call internal and external, and how you want to treat things around these accounts is possibly quite different.

An internal account is one that's held by people within your organization, and generally is pretty universal. If you know that someone is a member of the organization you can predict that they have an account on the system, and not infrequently what the account name is. For example, if you know that someone is a graduate student here it's a fairly good bet that they have an account with us and you may even be able to find and work out their login name. The existence of these accounts and even specifics about who has what login name (mostly) isn't particularly secret or sensitive.

(Internal accounts don't have to be on systems that the organization runs; they could be, for example, 'enterprise' accounts on someone else's SaaS service. Once you know that the organization uses a particular SaaS offering or whatever, you're usually a lot of the way to identifying all of their accounts.)

An external account is one that's potentially held by people from all over, far outside the bounds of a single organization (including the one running the the systems the account is used with). A lot of online accounts with websites are like this, because most websites are used by lots of people from all over. Who has such an account may be potentially sensitive information, depending on the website and the feelings of the people involved, and the account identity may be even more sensitive (it's one thing to know that a particular email address has an Fediverse account on mastodon.social, but it may be quite different to know which account that is, depending on various factors).

There's a spectrum of potential secrecy between these two categories. For example, the organization might not want to openly reveal which external SaaS products they use, what entity name the organization uses on them, and the specific names people use for authentication, all in the name of making it harder to break into their environment at the SaaS product. And some purely internal systems might have a very restricted access list that is kept at least somewhat secret so attackers don't know who to target. But I think the broad division between internal and external is useful because it does a lot to point out where any secrecy is.

When I wrote my entry, I was primarily thinking about internal accounts, because internal accounts are what we deal with (and what many internal system administration groups handle). As sapphirepaw noted, the concerns and thus the rules are quite different for external accounts.

(There may be better labels for these two sorts of accounts. I'm not great with naming)

How you should respond to authentication failures isn't universal

By: cks

A discussion broke out in the comments on my entry on how everything should be able to ratelimit authentication failures, and one thing that came up was the standard advice that when authentication fails, the service shouldn't give you any indication of why. You shouldn't react any differently if it's a bad password for an existing account, an account that doesn't exist any more (perhaps with the correct password for the account when it existed), an account that never existed, and so on. This is common and long standing advice, but like a lot of security advice I think that the real answer is that what you should do depends on your circumstances, priorities, and goals.

The overall purpose of the standard view is to not tell attackers what they got wrong, and especially not to tell them if the account doesn't even exist. What this potentially achieves is slowing down authentication guessing and making the attacker use up more resources with no chance of success, so that if you have real accounts with vulnerable passwords the attacker is less likely to succeed against them. However, you shouldn't have weak passwords any more and on the modern Internet, attackers aren't short of resources or likely to suffer any consequences for trying and trying against you (and lots of other people). In practice, much like delays on failed authentications, it's been a long time since refusing to say why something failed meaningfully impeded attackers who are probing standard setups for SSH, IMAP, authenticated SMTP, and other common things.

(Attackers are probing for default accounts and default passwords, but the fix there is not to have any, not to slow attackers down a bit. Attackers will find common default account setups, probably much sooner than you would like. Well informed attackers can also generally get a good idea of your valid accounts, and they certainly exist.)

If what you care about is your server resources and not getting locked out through side effects, it's to your benefit for attackers to stop early. In addition, attackers aren't the only people who will fail your authentication. Your own people (or ex-people) will also be doing a certain amount of it, and some amount of the time they won't immediately realize what's wrong and why their authentication attempt failed (in part because people are sadly used to systems simply being flaky, so retrying may make things work). It's strictly better for your people if you can tell them what was wrong with their authentication attempt, at least to a certain extent. Did they use a non-existent account name? Did they format the account name wrong? Are they trying to use an account that has now been disabled (or removed)? And so on.

(Some of this may require ingenious custom communication methods (and custom software). In the comments on my entry, BP suggested 'accepting' IMAP authentication for now-closed accounts and then providing them with only a read-only INBOX that had one new message that said 'your account no longer exists, please take it out of this IMAP client'.)

There's no universally correct trade-off between denying attackers information and helping your people. A lot of where your particular trade-offs fall will depend on your usage patterns, for example how many of your people make mistakes of various sorts (including 'leaving their account configured in clients after you've closed it'). Some of it will also depend on how much resources you have available to do a really good job of recognizing serious attacks and impeding attackers with measures like accurately recognizing 'suspicious' authentication patterns and blocking them.

(Typically you'll have no resources for this and will be using more or less out of the box rate-limiting and other measures in whatever software you use. Of course this is likely to limit your options for giving people special messages about why they failed authentication, but one of my hopes is that over time, software adds options to be more informative if you turn them on.)

A surprise with rspamd's spam scoring and a workaround

By: cks

Over on the Fediverse, I shared a discovery:

This is my face when rspamd will apparently pattern-match a mention of 'test@test' in the body of an email, extract 'test', try that against the multi.surbl.org DNS blocklist (which includes it), and decide that incoming email is spam as a result.

Although I didn't mention it in the post, I assume that rspamd's goal is to extract the domain from email addresses and see if the domain is 'bad'. This handles a not uncommon pattern of spammer behavior where they send email from a throwaway setup but direct your further email to their long term address. One sees similar things with URLs, and I believe that rspamd will extract domains from URLs in messages as well.

(Rspamd is what we currently use for scoring email for spam, for various reasons beyond the scope of this entry.)

The sign of this problem happening was message summary lines in the rspamd log that included annotations like (with a line split and spacing for clarity):

[...] MW_SURBL_MULTI(7.50){test:email;},
PH_SURBL_MULTI(5.00){test:email;} [...]

As I understand it, the 'test:email' bit means that the thing being looked up in multi.surbl.org was 'test' and it came from the email message (I don't know if it's specifically the body of the email message or this could also have been in the headers). The SURBL reasonably lists 'test' for, presumably, testing purposes, much like many IP based DNSBLs list various 127.0.0.* IPs. Extracting a dot-less 'domain' from a plain text email message is a bit aggressive, but we get the rspamd that we get.

(You might wonder where 'test@test' comes from; the answer is that in Toronto it's a special DSL realm that's potentially useful for troubleshooting your DSL (also).)

Fortunately rspamd allows exceptions. If your rspamd configuration directory is /etc/rspamd as normal, you can put a 'map' file of SURBL exceptions at /etc/rspamd/local.d/map.d/surbl-whitelist.inc.local. You can discover this location by reading modules.d/rbl.conf, which you can find by grep'ing the entire /etc/rspamd tree for 'surbl' (yes, sometimes I use brute force). The best documentation on what you put into maps that I could find is "Maps content" in the multimap module documentation; the simple version is that you appear to put one domain per line and comment lines are allowed, starting with '#'.

(As far as I could tell from our experience, rspamd noticed the existence of our new surbl-whitelist.inc.local file all on its own, with no restart or reload necessary.)

Everything should be able to ratelimit sources of authentication failures

By: cks

One of the things that I've come to believe in is that everything, basically without exception, should be able to rate-limit authentication failures, at least when you're authenticating people. Things don't have to make this rate-limiting mandatory, but it should be possible. I'm okay with basic per-IP or so rate limiting, although it would be great if systems could do better and be able to limit differently based on different criteria, such as whether the target login exists or not, or is different from the last attempt, or both.

(You can interpret 'sources' broadly here, if you want to; perhaps you should be able to ratelimit authentication by target login, not just by source IP. Or ratelimit authentication attempts to nonexistent logins. Exim has an interesting idea of a ratelimit 'key', which is normally the source IP in string form but which you can make be almost anything, which is quite flexible.)

I have come to feel that there are two reasons for this. The first reason, the obvious one, is that the Internet is full of brute force bulk attackers and if you don't put in rate-limits, you're donating CPU cycles and RAM to them (even if they have no chance of success and will always fail, for example because you require MFA after basic password authentication succeeds). This is one of the useful things that moving your services to non-standard ports helps with; you're not necessarily any more secure against a dedicated attacker, but you've stopped donating CPU cycles to the attackers that only poke the default port.

The second reason is that there are some number of people out there who will put a user name and a password (or the equivalent in the form of some kind of bearer token) into the configuration of some client program and then forget about it. Some of the programs these people are using will retry failed authentications incessantly, often as fast as you'll allow them. Even if the people check the results of the authentication initially (for example, because they want to get their IMAP mail), they may not keep doing so and so their program may keep trying incessantly even after events like their password changing or their account being closed (something that we've seen fairly vividly with IMAP clients). Without rate-limits, these programs have very little limits on their blind behavior; with rate limits, you can either slow them down (perhaps drastically) or maybe even provoke error messages that get the person's attention.

Unless you like potentially seeing your authentication attempts per second trending up endlessly, you want to have some way to cut these bad sources off, or more exactly make their incessant attempts inexpensive for you. The simple, broad answer is rate limiting.

(Actually getting rate limiting implemented is somewhat tricky, which in my view is one reason it's uncommon (at least as an integrated feature, instead of eg fail2ban). But that's another entry.)

PS: Having rate limits on failed authentications is also reassuring, at least for me.

Providing pseudo-tags in DWiki through a simple hack

By: cks

DWiki is the general filesystem based wiki engine that underlies this blog, and for various reasons having to do with how old it is, it lacks a number of features. One of the features that I've wanted for more than a decade has been some kind of support for attaching tags to entries and then navigating around using them (although doing this well isn't entirely easy). However, it was always a big feature, both in implementing external files of tags and in tagging entries, and so I never did anything about it.

Astute observers of Wandering Thoughts may have noticed that some years ago, it acquired some topic indexes. You might wonder how this was implemented if DWiki still doesn't have tags (and the answer isn't that I manually curate the lists of entries for each topic, because I'm not that energetic). What happened is that when the issue was raised in a comment on an entry, I realized that I sort of already had tags for some topics because of how I formed the 'URL slugs' of entries (which are their file names). When I wrote about some topics, such as Prometheus, ZFS, or Go, I'd almost always put that word in the wikiword that became the entry's file name. This meant that I could implement a low rent version of tags simply by searching the (file) names of entries for words that matched certain patterns. This was made easier because I already had code to obtain the general list of file names of entries since that's used for all sorts of things in a blog (syndication feeds, the front page, and so on).

That this works as well as it does is a result of multiple quirks coming together. DWiki is a wiki so I try to make entry file names be wikiwords, and because I have an alphabetical listing of all entries that I look at regularly, I try to put relevant things in the file name of entries so I can find them again and all of the entries about a given topic sort together. Even in a file based blog engine, people don't necessarily form their file names to put a topic in them; you might make the file name be a slug-ized version of the title, for example.

(The actual implementation allows for both positive and negative exceptions. Not all of my entries about Go have 'Go' as a word, and some entries with 'Go' in their file name aren't about Go the language, eg.)

Since the implementation is a hack that doesn't sit cleanly within DWiki's general model of the world, it has some unfortunate limitations (so far, although fixing them would require more hacks). One big one is that as far as the rest of DWiki is concerned, these 'topic' indexes are plain pages with opaque text that's materialized through internal DWikiText rendering. As such, they don't (and can't) have Atom syndication feeds, the way proper fully supported tags would (and you can't ask for 'the most recent N Go entries', and so on; basically there are no blog-like features, because they all require directories).

One of the lessons I took from the experience of hacking pseudo-tag support together was that as usual, sometimes the perfect (my image of nice, generalized tags) is the enemy of the good enough. My solution for Prometheus, ZFS, and Go as topics isn't at all general, but it works for these specific needs and it was easy to put together once I had the idea. Another lesson is that sometimes you have more data than you think, and you can do a surprising amount with it once you realize this. I could have implemented these simple tags years before I did, but until the comment gave me the necessary push I just hadn't thought about using the information that was already in entry names (and that I myself used when scanning the list).

The Prometheus host agent is missing some Linux NFSv4 RPC stats (as of 1.8.2)

By: cks

Over on the Fediverse I said:

This is my face when the Prometheus host agent provides very incomplete monitoring of NFS v4 RPC operations on modern kernels that can likely hide problems. For NFS servers I believe that you get only NFS v4.0 ops, no NFS v4.1 or v4.2 ones. For NFS v4 clients things confuse me but you certainly don't get all of the stats as far as I can see.

When I wrote that Fediverse post, I hadn't peered far enough into the depths of the Linux kernel to be sure what was missing, but now that I understand the Linux kernel NFS v4 server and client RPC operations stats I can provide a better answer of what's missing. All of this applies to node_exporter as of version 1.8.2 (the current one as I write this).

(I now think 'very incomplete' is somewhat wrong, but not entirely so, especially on the server side.)

Importantly, what's missing is different for the server side and the client side, with the client side providing information on operations that the server side doesn't. This can make it very puzzling if you're trying to cross-compare two 'NFS RPC operations' graphs, one from a client and one from a server, because the client graph will show operations that the server graph doesn't.

In the host agent code, the actual stats are read from /proc/net/rpc/nfs and /proc/net/rpc/nfsd by a separate package, prometheus/procfs, and are parsed in nfs/parse.go. For the server case, if we cross compare this to the kernel's include/linux/nfs4.h, what's missing from server stats is all NFS v4.1, v4.2, and RFC 8276 xattr operations, everything from operation 40 through operation 75 (as I write this).

Because the Linux NFS v4 client stats are more confusing and aren't so nicely ordered, the picture there is more complex. The nfs/parse.go code handles everything up through 'Clone', and is missing from 'Copy' onward. However, both what it has and what it's missing are a mixture of NFS v4, v4.1, and v4.2 operations; for example, 'Allocate' and 'Clone' (both included) are v4.2 operations, while 'Lookupp', a v4.0 operation, is missing from client stats. If I'm reading the code correctly, the missing NFS v4 client operations are currently (using somewhat unofficial names):

Copy OffloadCancel Lookupp LayoutError CopyNotify Getxattr Setxattr Listxattrs Removexattr ReadPlus

Adding the missing operations to the Prometheus host agent would require updates to both prometheus/procfs (to add fields for them) and to node_exporter itself, to report the fields. The NFS client stats collector in collector/nfs_linux.go uses Go reflection to determine the metrics to report and so needs no updates, but the NFS server stats collector in collector/nfsd_linux.go directly knows about all 40 of the current operations and so would need code updates, either to add the new fields or to switch to using Go reflection.

If you want numbers for scale, at the moment node_exporter reports on 50 out of 69 NFS v4 client operations, and is missing 36 NFS v4 server operations (reporting on what I believe is 36 out of 72). My ability to decode what the kernel NFS v4 client and server code is doing is limited, so I can't say exactly how these operations match up and, for example, what client operations the server stats are missing.

(I haven't made a bug report about this (yet) and may not do so, because doing so would require making my Github account operable again, something I'm sort of annoyed by. Github's choice to require me to have MFA to make bug reports is not the incentive they think it is.)

❌