❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayUncategorized

The Ascent to Sandia Crest

28 September 2025 at 00:00

The Rotary Club will take immediate action on the Ellis ranch loop project. The Rotarians reached this decision at their weekly luncheon, held yesterday at the Albarado hotel.

The club's plan is not merely to give Albuquerque a good, short road to the Ellis ranch... They embrace the building of a seventy-mile scenic loop. 1

Many Western cities are defined, in part, by their mountains. Those moving from town to town often comment on the disorientation, the disruption, caused by a change in the city's relation to the peaks. If you have ever lived in a place with the mountains on the west, and then a place with the mountains on the east, you will know what I mean. We get used to living in the shadows of mountains.

One of the appeals of mountains, perhaps the source of their mysticism, is inaccessibility. Despite their close proximity to Albuquerque, steep slopes and difficult terrain kept the Sandias a world apart from the city, even to this day. Yet we have always been driven to climb, to ascent to the summit.

Humans climb mountains not only as a matter of individual achievement, but also as a matter of infrastructure. Whether the inaccessibility of mountain peaks is a good thing or a bad thing depends on the observer; and even the most challenging mountain ascents are sometimes developed to the scale of an industrial tourism operation.

And somewhere, in between, are the Sandias. Not technically part of the Rocky Mountains but roughly aligned with them, the Sandias lack a clear peak. Instead, the range is an elongated ridge, making up the entire eastern boundary of the city of Albuquerque and extending some distance further north into the Sandia Pueblo. The highest point is at 10,679', relatively modest for the mountain states---but still one of the most prominent in New Mexico, moreso even than the higher-elevation Mt. Taylor and Wheeler Peak.

Today, the Sandias are a major site for recreation. Tourists reach the upper parts of the mountain by one of two means: Sandia Crest Scenic Highway, which climbs the gentler eastern slope of the mountain to the crest itself; or the Sandia Peak Aerial Tramway, which makes a daring ascent from the edge of the city up the western side. Either provides access to an extensive network of trails, as do numerous points in the foothills and along the scenic highway.

Like bagging any peak, these access routes were hard-won. Their present contours---engineering achievements, twists and turns, scars on the landscape---reflect over one hundred years of ambition and conflict. Some of Albuquerque's most prominent figures left their marks on the mountain, as did everyday BurqueΓ±os, President Richard M. Nixon, and the slow grind of state and federal bureaucracy. The lay of the land today tells a story about the changing relationship of the city with the mountains, and more broadly of the American public with the wilderness.

It also explains why so many older documents and maps refer to the highway to the crest as a "loop," when it does no such thing. Here's the clickbait headline: Where does the Sandia Loop Highway go? The answer will surprise you.


Exploration of the Sandias by Spanish expeditions were motivated in large part by rumors of great mineral wealth. Certainly there are minerals in the mountains, and the Pueblo people had told the Spanish expeditions coming up the valley of gold and other precious metals, both in the mountains and in the plains beyond them. With the full perspective of history, it seems that these reports were less grounded in fact and more in a well-founded desire to get the Spanish to move on elsewhere. Supposedly several mines were established, but they left little mark on the landscape and their locations are now forgotten.

The Sandias, while rich in scenery, were not rich in gold. For centuries, they remained mostly undeveloped, besides Pueblo settlements in areas such as Tijeras canyon which were later abandoned in large part due to the increasing encroachment of settlers. Just a small number of people, mostly hunters, called the mountains their home. A few mining operations made their way into the mountains during the 19th century, perhaps repeating the mistakes of the Spanish several hundred years before.

It was these mining camps that brought the Ellis family to Albuquerque. They had run a ranch in the eastern plains of New Mexico, but were driven into the city by lack of water. Patriarch George Ellis found work for a fruit distributor, delivering produce to the mining camps in the mountains. Perhaps missing their days out in the country, George took his time as he traveled the canyons and ridges between camps. His wanderings were rewarded when he found Las Huertas Canyon, a narrow slot with a creek, protection from the wind, and what George Ellis himself called "extraordinary beauty."

The Ellis family moved into the canyon as early as 1893, but in 1905 George Ellis filed a land patent for what was, as an outcome of the Treaty of Guadalupe Hidalgo, frontier up for the taking. Ellis Ranch, as it was called, was indeed extraordinary: one of the most pleasant places in the mountains. But, then, the Ellis family were rather extraordinary as well. Life on the mountainside at 7,600 feet wasn't easy, but they kept cattle, chickens, and a modest farm.

Charlotte Ellis, George's daughter, had not yet a year of education at UNM when they moved out of town and up the slopes on the east side of the ridge. Still, she built a notable career as an amateur botanist. It is because of her efforts that Ellis ranch is often described in terms of its plant life---Charlotte collected specimens of hundreds of plant species, some of them new discoveries, and the ranch is said to have contained examples of every flowering plant known to exist in the mountains.

Plants were not Charlotte's only interest, though. She was apparently a skier, in a time well before skiing was known as a recreational pastime. A photo taken in 1896 shows her in a snow-covered landscape, facing away from the camera. She's looking down the slope, wearing a full dress and a pair of wooden skis. Her pose suggests a certain resolve, the kind you would need to spend the winter high above the nearest towns. It is often captioned as the first photograph of a person skiing in New Mexico. It may be the first photo of a woman skiing at all.

It also foreshadows the tremendous impact that skiing would have on the mountains and on Albuquerque. We don't know exactly what she was thinking, but what we do know about her suggests that she viewed the mountains much as we do today: for their beauty, their diversity, and for their thrill.

George Ellis had perhaps filed his land patent just in time. The Forest Service had existed for some years but was reorganized under its current name, and as part of the Department of the Interior, in 1905. The forests of the Sandia Mountains have a complicated legal history owing to the interaction of Spanish and Mexican land grants with the Treaty of Guadalupe Hidalgo at the conclusion of the Mexican-American war. You could say that the matter of the forest's legal ownership was not entirely settled until 2004.

None of that stopped President Roosevelt declaring the Sandias part of the Manzano Forest Reserves in 1906. Shortly after, the Forest Reserves became a National Forest and a ranger station was under construction near Tijeras. Paul Ellis, the son of the family, was soon hired on as one of the national forest's first staff; the Ellis Ranch one of the few private parcels within the federal boundaries.


The Forest Service has always served multiple purposes: "Land of Many Uses," as Forest Service signs still proclaim. Principle concerns of the forest service included lumber, cattle and sheep grazing, mining, and hunting. These uses were already widespread at the dawn of the modern Forest Service, so in practice, much of the role of early Forest Rangers was to bring forest users under control. Extensive regulations and rules were backed by a small set of rangers of the "Stetson era," rugged types not unlike the Ellises.

What we don't always picture as a part of that era was recreation. The "crunchy granola" brand of wilderness recreation most widely promoted today is indeed a more modern phenomenon, but forests have been viewed as recreational assets since the dawn of forest management---the form of that recreation has just changed, largely in response to new concepts of environmental stewardship that emphasize preservation over "improvement."

In 1913, George Ellis passed away. After a small ceremony, he was buried in a cemetery established next to his ranch. With George gone, the rest of the Ellis family began to scatter, moving into towns, closer to work. This left the question of the ranch.

By the time of George's death, Ellis Ranch had already become renowned for hospitality. The newspapers of the day recounted some of Albuquerque's finest taking vacations at the Ellis Ranch, a trip that was something of an adventure in the 1900s. It took a few hours, in fine conditions, to make the way up a rough trail to the Ellis homestead. One local businessman, known to newspaper readers as "the bee man," took some pride in having gotten a Ford automobile to within a mile of the site. Even so, the journey was well-agreed to be worthwhile. Las Huertas canyon was beautiful, and the Ellises known to offer fresh produce, spring water, and mountain air---a commodity in increasingly high demand as tuberculosis swept the country.

The reputation of the Ellis ranch for taking visitors explains the Albuquerque Tribune's matter-of-fact statement that the Ellis ranch would likely become a tourist resort, despite the lack of known buyers. At that time, the Forest Service was already in an infrastructure-building mood. La Madera, then one of the larger settlements in the east mountains, had telephone service and the Forest Service was extending the line to the ranch.

Shortly after, the buyers emerged, Raymond Stamm (uncle of the Stamm of Bradbury Stamm Construction) and Jack Sheehan of Albuquerque.

These young men propose to fit up the ranch as a mountain resort, the house being splendidly adapted to a mountain club house, with fine fishing, hunting, and camping facilities all about. 2

One challenge for this ambitious pair was, of course, the road: for the ranch to be the resort they imagined it needed an easier means of access. It was estimated that this effort would only require a few hundred dollars, reworking about three miles of trail between the ranch and a better road to Tijeras.

Even by 1913, an improved road to Ellis Ranch was a recurring theme in the newspapers. George Ellis, before his death, had estimated $500 to improve the road between the ranch and Placitas to passable condition. After his death, the estimate on this longer segment reached the thousands.

Some of the problem was, of course, hubris: the Albuquerque boosters of the day could look at three miles of faint road climbing up and down cabins and call it a few days of work. When these projects went underway, they always took longer and cost more than expected. To be fair, though, they were also fighting a living foe. Thunderstorm downpours and winter snow repeatedly washed out the road in the steeper sections. Reading of road improvement efforts, you get repeated deja vu. They are, in fact, announcing plans to fix the same section of road over and over again---it having always washed out the previous year.

The Stamm-Sheehan resort seems to have been popular, even as the road stayed rough. In September of 1914, the faculty of UNM held what we might now call an "offsite" there. Indeed, during 1914 and 1915 it seemed that practically everybody who was anybody was vacationing at Ellis Ranch. All the while, though, they were still struggling to get there... a situation that the Rotarians resolved to correct.


The road to Ellis Ranch has been known by many names and promoted by many people. The Rotary Club was a powerful force in the Albuquerque of 1916, though, and their proposal seems to have set the pattern for those that followed. The Ellis Ranch Loop Road, as they called it, would span 70 miles from Albuquerque to Albuquerque. The route went through the Tijeras pass, turned north to San Antonito 3 and passed Ellis Ranch on the way to La Madera and then Placitas. From there, it went nearly to Bernalillo, meeting the Santa Fe-Albuquerque road that would soon become part of Route 66---until 1937, at least.

A big argument for the loop, besides the fact that it makes the Ellis summer resort more easily accessible, is its scenery.

At the time of the Rotary Club proposal, the region's road map was in some ways familiar, and in some ways rather different. Many of the today's routes were already known but their condition was widely variable. The trip from Albuquerque to Tijeras, which would become the ultimate Route 66 alignment, was already in place. It was shorter in 1913, though, not extending much further east, and far from a highway. There was a rough road north to San Antonito, and there had once been a road further north from there to La Madera, but by 1916 it had been closed where it entered a private land grant.

This is probably the biggest difference from our modern road network: the route through Tijeras Canyon was not used for much other than access to Tijeras, and for much of the era covered in this article the preferred way to negotiate the Sandias and Manzanos was to go around them entirely. This is a good part of the reason for Route 66's original north-south alignment through the Rio Grande Valley, one that added significant mileage but avoided the mountain pass that was slow going in good weather and could be outright impassable in winter.

Similarly, from the north, there was an existing road from Bernalillo through Placitas and into the East Mountains. It was, in fact, the main route used by many of the residents there. It was not an easy route, though, long and prone to washouts.

And while these roads came close to forming a loop, they didn't. A section of about four miles was missing, right in the middle. Ellis Ranch was, of course, somewhere in that in-between.

So, the Rotary Club project was mostly the improvement of existing roads, along with the construction of new road in a couple of areas required to complete the loop around the Sandia range. The Rotarians estimated the cost at $5,000, and the Forest Service offered to pitch in.

Now, we must consider that $5,000 was a significant amount of money in 1916, something around $150,000 today. This inflation calculation probably underestimates the ask, because extensive investment in highway construction was not yet the pillar of American culture that it is today. Still, public enthusiasm for the project was high, and by the end of 1916 Bernalillo County had enacted a tax assessment to to support the road. The state legislature joined on, and with the Forest Service's share the project was fully funded. Construction seems to have began in earnest in 1917, on improvements to the road through Tijeras Pass.

Efforts on the loop road coincided, of course, with work to develop Ellis Ranch as a summer resort. Starting in 1917, regular newspaper ads appear for vacation cabin rentals at Ellis Ranch, with instructions to telephone Hugh Cooper. This High Cooper was Hugh Cooper Jr., son of the Hugh Cooper who was pastor of the First Presbyterian Church. Much like George Ellis, Rev. Cooper was a well-known figure around Albuquerque. Given that he officiated the wedding of at least one of Ellis's children, he was presumably friendly with the Ellises as well. The newspapers are not exactly clear on how it came to happen, but Rev. Cooper Sr would soon become much more involved in the ranch: in 1923, he and his sister went in together to buy it.

"We hope to keep this beauty spot of nature, so near our city, from falling into the hands of sports, who will advertise dance pavilions and entertainments of a sordid nature. If our plans mature we hope to make it a place where parents can go with their children and find recreation of an exalted character." 4

Rev. Cooper's motivations were, he assured the newspaper, godly rather than pecuniary. The phrase "exalted recreation" was singled out for the headline and is certainly one to remember as you explore the Sandias. Cooper was not, though, immune to the sort of boosterism that so animated the Rotary Club.

"We hope, with the help of the good people of Albuquerque, to make the famous Ellis Ranch a second Estes park."


The 1910s and '20s were boom years in Albuquerque, with a confluence of the railroad shops, the sawmill, and tuberculosis bringing new residents at a fast clip. The population more than doubled from 1900 to 1920, and with the newcomers came urban development. Subdivisions went under construction west of New Town, a skyscraper rose, and car ownership proliferated. The same period, for many of the same reasons, saw the genesis of American tourism. Trains brought visitors from out of town, and cars brought visitors and residents alike out into the surroundings. Boosters, from the city council to the chamber of commerce to a half dozen clubs, all had plans to make the best of it.

There was, to be sure, a certain aspect of keeping up with the joneses. Colorado was building a strong reputation as a tourist destination, and Denver as a beautiful and prosperous city. Arizona and Nevada had not yet gained populous cities, and west Texas was no less sparse than it is now. For the businessmen of Albuquerque, Colorado was the competition... and Colorado was building summer resorts.

The concept of the "summer resort" is mostly forgotten today, but in the 1920s there was little air conditioning and cities were gaining a reputation as crowded and polluted. The spread of Tuberculosis had brought particular attention to "clean, fresh air," and nothing was quite as clean and fresh as a summer day in the mountains. Estes Park, a town in the northern Colorado rockies, had multiple fine hotels by 1910 and attracted the tourism to match.

Further, in 1910, Denver's own boosters (groups like the Real Estate Exchange and the Denver Motor Club) began to acquire land for the Denver Mountain Parks. The beautiful Genesee Park was under construction in 1920, with a lodge and campground. Daniel Park had opened as a scenic drive (or "Auto View"). Plans for Mount Blue Sky Scenic Byway, the highest paved road in North America, had been finalized and announced to great fanfare.

To the modern conservationist it seems odd that these plans were so focused on roads, but the automobile was new and exciting and in the West had almost singlehandedly popularized the idea of "wilderness recreation." The mountains were no longer a destination only for the hardiest adventurers, they were also for city dwellers, out for a drive and a picnic. The ideal of natural preservation was a long, winding road with regular scenic pullouts. If the road could access a summer resort, all the better. Denver was doing it, and so Albuquerque would too.

Behind the scenes, the wheels of government had been turning. The Forest Service approved construction through the mountains in 1919, which would "make accessible some of the most beautiful mountain scenery in the state, in addition to bringing within easy reach sites for summer homes and temporary camping places." At this point in its history, the Forest Service routinely leased space for the construction of private homes, and a small number were established in the national forest as the road progressed inwards.

Bernalillo County had raised $12,500, the Forest Service committed $25,000, and the state was lobbied for another $12,500. A Forest Service engineer, speaking to the County Commission, noted that the road would provide access to 30 million board feet of lumber. The Forest Service said that construction could be finished by the end of the summer, and in September, the section from Placitas to Ellis Ranch opened to drivers. The road from Ellis Ranch to San Antonito had been surveyed, but work had not yet started.


We should consider again how this historic project relates to the modern roads, because there is a surprise. Today, the primary access from Albuquerque to the eastern slopes of the Sandias is via I-40 to Tijeras, and then NM-14 north through Cedar Crest and to then to San Antonito. There, NM-536 branches to the west, becoming Sandia Crest Scenic Highway.

NM-536 turns into Sandia Crest Scenic Highway, which is actually not a numbered highway at all, at a 90-degree turn by the Balsam Glade picnic area. From the picnic area, a much smaller, rougher dirt road descends to the north: NM-165. Today, NM-165 is a rugged and adventurous route to Placitas, impassable in winter and unfriendly to passenger vehicles.

And yet, in the summer of 1919, it was this route that was made a "first class road" to Ellis Ranch. The former site of Ellis Ranch now has a surprisingly well-paved parking lot off of the dirt NM-165, signed as "Las Huertas Canyon," and a trail continues up the creek through much of what was the summer resort. It took about an hour and a half, the Journal reported, to drive from Albuquerque to Ellis Ranch by this route around the north side. The travel time probably isn't all that much faster today, given the slow going on the dirt road.

In 1920, Bernalillo County ran into some trouble on the project funding. The county's plan to fund much of the project through bonds on the tax assessment was, it turns out, in violation of state law. For accounting reasons, the total sum of the bonds would have to be much lower. The county was effectively out of the project, at least as far as money was concerned.

The Forest Service worked according to its own plans. Construction started on a forest road, what is now NM-536, that would later be home to Tinkertown. In summer of 1922, it reached Sulphur Spring, the first of many recreation sites on its long trip up. Progress was brisk: the newspaper reported that 2.5 miles had been completed, then a few more, then eight. The forest road seems to have gone by different names at different times, and perhaps just depending on who was talking, but as 1922 came to a close the Forest Service named it for the canyon that it followed from San Antonito: Tejano Road.

Tejano Road ended about where the bottom of the ski area is found today, some four miles short of Ellis Ranch. Much of the Ellis Ranch Loop was complete, but it wasn't a loop. Plans had been completed for the final stretch, but it was unclear who would pay for the work---and when. The Journal printed an appeals to finish the project, several times mentioning the importance of a "sky line road."

The sky line road was something of a fad in those days. California and Oregon both had them, and plans were underway in many other parts of the west. A skyline road was what we might now call a scenic byway, but with a particular emphasis on views. Many followed a ridge, right down the center, so that drivers could get a view down both sides. Extensive earth-moving and forestry were sometimes involved, modifying the environment to create these views.

The West had a lead on the idea: it wasn't until 1925 that planning started on the most famous, Skyline Drive of Shenandoah National Park. The 100-mile highway was viewed as a wilderness adventure, but was about as far from our modern conception of wilderness as possible. Extensive clearing, of both trees and residents, were required to manicure its vistas. It was seen as one of the finest examples of American recreation, a model to be followed.


In July of 1923, the Ellis Ranch Loop was considered complete except for the four-mile segment between Tejano Road and Ellis Ranch Road and, oddly enough, the reworking of the road from Placitas to Ellis Ranch---the very one that had been completed in 1922. Road construction in the mountains was a fight against both funding and nature. The road seems to have been minimally improved, to keep the fast schedule and tight budget. During winter, snowfall and then snowmelt would wash parts of the road out. Throughout the 1920s and 1930s, the road was constantly being repaired and reworked. This probably explains why NM-165, once the principle route up the Sandias, is now so minimally maintained: the state and the Forest Service gave up.

During 1924, the Forest Service closed two miles of the gap and had the other two underway. In a turn of luck for the Albuquerque boosters, the Forest Service also offered to pick up work on the road to Placitas. The entire Ellis Ranch Loop project had become a Forest Service initiative, and they did what they set out to do.

In September 24th of 1924, an Albuquerque man named Arthur Ford set out through Tijeras Canyon with a friend, Mrs. Karskadon. They left Albuquerque at 10 am, finding that the new section connecting Ellis Ranch was not, officially, open; Ford was little discouraged and simply moved the barriers aside. At 11:45 they reached Ellis Ranch. After lunch, they continued northwards, to Placitas, through Bernalillo, and back into town. Ford's odometer measured the trip at 68 miles. The Ellis Ranch Loop was complete.

Imagine a circle approximately 20 miles in diameter.

Then imagine that this circle encloses some of the most beautiful mountain, valley, and mesa scenery in the world.

The city has always existed in tension with the mountains. The Ellis Family, who most famously opened the mountain to visitors, also oversaw a sort of closure to nature. George Ellis was the last person to see a Grizzly Bear in the Sandias. He came across the bear in 1905; he shot it.

From 1916 to 1924, Albuquerque business leaders admired the mountains for their beauty but lamented their inaccessibility. The Sandias could attract visitors from around the nation, they argued, if only you could get to them. Charlotte Ellis would hoof it on foot and on skis, but, then, she was Charlotte.

Then imagine that this circle is bounded by a highway that is traversible every day in the year.

Imagine, within the 70-mile circumference of this circle, near its eastern rim, a cluster of summer houses, supplied with water, light, and other necessaries. Imagine, at various spots within the circle, fine picnic and camping grounds, where people from the hot city may go for a day's outing.

We have always been driven to climb. As soon as it was possible to drive the full loop, to reach Ellis Ranch on and easy road, to take in the "Million Dollar Playground" of the Sandias, the Kiwanis Club formed a committee of business leaders and Forest Service representatives to consider the future of Sandia development.

It was "not only practicable, but highly necessary," they said, to take the next logical step. Ellis Ranch, at 7,500 feet, was only halfway from the city to the crest. A road all the way to the top---from Ellis Ranch to the crest---would complete their vision of the Sandias. "An effort will be made to have work begun as soon as possible."

You may think you are dreaming. And perhaps you are. But some day you will wake up and find the dream come true. 5

The Forest Service cleared six miles of steep, tight switchbacks, from the Balsam Glade area just above Ellis Ranch to the crest itself, over 10,500 feet. The New Mexico Highway Department, bolstered by the completion of Route 66, laid a gravel roadbed. Automobiles had become more common, driving a more popular pastime. It didn't take the adventurous Arthur Ford and Mrs. Karskadon to inaugurate the crest spur. On October 10th, 1927, Highway Department officials at the crest counted 110 cars.

Albuquerque had summited the mountain---but the greatest controversy was still to come.

  1. Albuquerque Journal, 1916-08-04 p 6↩

  2. Albuquerque Tribune, 1916-03-23 p 3↩

  3. The names of the towns at this location are a bit confusing historically, so I am simply using San Antonito to refer to the spot that is currently occupied by Sandia Park and San Antonito. Sandia Park appears to be a renaming of a town that was originally called San Antonio (i.e. the big one), likely due to the presence of another San Antonio, New Mexico in Socorro County.↩

  4. Albuquerque Journal, 1923-08-11 p 10↩

  5. Albuquerque Journal, 1923-07-22 p 3↩

T-carrier

20 September 2025 at 00:00

Few aspects of commercial telecommunications have quite the allure of the T-carrier. Well, to me, at least, but then I have very specific interests.

T-carrier has this odd, enduring influence on discussion of internet connections. I remember that for years, some video game installers (perhaps those using Gamespy?) used to ask what kind of internet service you had, with T1 as the "highest" option. The Steam Hardware Survey included T1 among the options for a long time. This was odd, in a lot of ways. It set T1 as sort of the "gold standard" in the minds of gamers, but residential internet service over T1 would have been very rare. Besides, even by the standards of the 2000s T1 service was actually pretty slow.

Still, T1 involved a healthy life as an important "common denominator" in internet connectivity. As a regulated telephone service, it was expensive, but available pretty much anywhere. It also provided a very high standard for reliability and latency, beyond many of the last-mile media we use today.

Telephone Carriers

We think of telephone calls as being carried over a pair of wires. In the early days of the telephone system, it wasn't all that far off to imagine a phone call as a single long circuit of two wires that extended all the way from your telephone to the phone you had called. This was the most naive and straightforward version of circuit switching: connections were established by creating a circuit.

The era of this extremely literal form of circuit switching did not last as long as you might think. First, we have to remember that two-wire telephone circuits don't actually work that well. Low install cost and convenience means that they are the norm between a telephone exchange and its local callers, but for long-distance carriage over the phone network, you get far better results by splitting the "talk" and "listen" sides into two separate pairs. This is called a four-wire telephone circuit, and while you will rarely see four-wire service at a customer's premises, almost all connectivity between telephone exchanges (and even in the internals of the telephone exchange itself) has been four-wire since the dawn of long-distance service.

Four-wire circuits only exacerbated an obvious scaling problem: in the long distance network, you have connections called a toll leads between two exchanges. In a very simple case, two towns might have a toll lead between them. For simple four-wire telephone lines, that toll lead needs four wires for each channel. If it has four wires, only one phone call can take place between the towns at a time. If it has eight wires, two telephone calls. This got very expensive very fast, considering that even heavily-built four-crossarm open wire routes might only have a capacity for eight simultaneous calls.

For obvious reasons, research in the telephone industry focused heavily on ways to combine more calls onto fewer wires. Some simple electrical techniques could be used, like phantoms that combined two underlying pairs into a single additional "phantom" pair for a 50% increase in capacity. You could extend this technique to create more channels, with a noticeable loss in quality.

By the 1920s, the Bell System relied on a technique that we would later call frequency division multiplexing (FDM). By modulating a phone call onto a higher-frequency carrier, you can put it over the same wires as other phone calls modulated onto different frequencies. The devices that actually did this combined multiple channels onto a single line, so they were known as channel banks. The actual formats they used over the wire, since they originally consisted of modulation onto a carrier, came to be known themselves as carriers. AT&T identified the carriers they developed with simple sequential letters. In the 1940s, the state of the art was up to J- and K-carrier, which allowed 16 channels on a four-wire circuit (over open wire and multipair cable, respectively). A four-crossarm open-wire circuit, with sixteen pairs, could support 256 unidirectional circuits for 128 channels---or simultaneous phone calls.

FDM carriers reached their apex with the coaxial-cable based L-carrier and and microwave radio TH and TD carriers 1, which combined channels into groups, groups into supergroups, and supergroups into mastergroups for total capacities that reached into thousands of channels. Such huge FDM groups required very large bandwidths, though, which could not be achieved over copper pairs.

In the 1950s, rapidly developing digital electronics lead to the development of digital carriers. These carriers relied on a technique called Pulse-Code Modulation or PCM. PCM has sort of an odd name due to the history; it dates so far back into the history of digital communications that it was named before the terminology was well-established. "Pulse-code modulation" really just means "quantizing an analog signal to numbers and sending the numbers," which is now intuitive and obvious, but was an exciting new development of the 1940s.

PCM had a lot of potential for carrying phone calls, because the relaxed needs of voice transmission meant that calls could be digitized to fairly low-rate streams (8kHz by 8 bits) and engineers could see that there was a huge variety of possible techniques for combining and transporting digital signals. The best thing, though, was that digital signals could reliably be transmitted as exact copies rather than analog recreations. That meant that PCM telephone calls could pass through a complex network, with many mux/demux steps, without the reduction in quality that analog FDM carriers caused.

Even better, analog channel banks were large systems with a lot of sensitive analog components. They were expensive to build, required controlled environments, and were still subject to drift that required regular maintenance by technicians. Digital technology involved far fewer sensitive analog components, promising cheaper equipment with less maintenance. Digital was clearly the future.

The question, though, was how to best carry these digital signals over the wiring that made up much of the telephone system: copper pairs. In the late 1950s, Bell Laboratories developed T-carrier as the answer.

T-Carrier

T-carrier is a specification for transporting a stream of bits over copper pairs. The plural here is important: because T-carrier supported multiple channels, it was developed as a trunk technology, used for communication between telephone exchanges rather than between a telephone exchange and a customer. So, like other carriers used for trunks, T-carrier was four-wire, requiring two pairs to carry bidirectional signals.

T-carrier operates at 1.544Mbps, and that's about all you can say about it. The logical protocol used over T-carrier, the actual application of those bits, is determined by a separate protocol called the digital signal or DS. You can roughly think of this as a layer model, with DS running on top of T.

Here, we need to address the elephant in the room: the numbers. The numbers follow a convention used throughout AT&T-developed digital standards that is most clearly illustrated with DS. A DS0, by analogy to DS raised to the zeroeth power, is one single telephone channel expressed as a PCM digital signal. Since a telephone call is conveyed as 8-bit samples at 8kHz, a DS0 is 64kbps of data.

A DS1 is a combination of 24 DS0s, for 1.544Mbps.

A DS2 is a combination of 4 DS1s, for 96 channels or 6.312Mbps.

A DS3 is a combination of 7 DS2s, for 672 channels or 44.736Mbps.

Each level of the DS hierarchy is a TDM combination of several instances of the level below. The numbers are kind of odd, though, right? 24, 4, 7, it has the upsetting feeling of a gallon being four quarts each of which is two pints.

The DS system was developed in close parallel with the carriers actually used to convey the signal, so the short explanation for this odd scheme is that a DS1 is the number of DS0s that fit onto a T1 line, and a DS2 is the number of DS1s that fit onto a T2 line. The numbers are thus parallel: DS1 over T1, DS2 over T2, DS3 over T3. The distinction between T and DS is thus not always that important, and the terms do get used interchangeably.

But still, why 24?

Well, it's said that the number of channels on a T1 was just determined empirically. The 64kbps rate of a DS0 was fixed by the 8b x 8kHz digital format. A test T1 installation was built using a typical in-place copper telephone cable, and the number of channels was increased until it no longer functioned reliably. 24 channels was the magic number, the most achieved without unacceptable errors.

T-Carrier Infrastructure

T1 was designed for operation over a "typical" telephone cable trunk. In the 1950s, this meant a twisted-pair telephone cable installed in 6,600 foot sections with a loading coil at the junction of each section. A loading coil was essentially a big inductor hooked up to a telephone line at regular intervals to compensate for the capacitance of the line---long telephone cables, even four-wire, needed loading coils at regular intervals or the higher frequencies of speech would be lost. Loading coils also had disadvantages, though, in that they imposed a pretty sharp maximum frequency cutoff on the line. High-speed digital signaling needed to operate at those high frequencies, so T1 was designed to fit into existing long cables by replacing the loading coils with repeaters.

That means that T1 required a repeater very 6,600 feet. This repeaters were fairly small devices, often enclosed in pressurized cans to keep water out. 6,600 feet might sound pretty frequent, but because of the loading coil (and splice box) requirements trunk lines usually had underground vaults or equipment cabinets at that interval anyway.

Over time, the 6,600 foot interval became increasingly inconvenient. This was especially true as end-users started to order T1 service, requiring that T1 be supported on local loops that were often longer than 6,600 feet. Rather than installing new repeaters out in the field, it became a widespread practice to deliver T1 over a version of DSL called HDSL. HDSL is older and slower than the newer ADSL and VDSL protocols, and requires four wires, but it was fast enough to carry a DS1 signal and could cover a much longer span than traditional T-carrier. HDSL used the voice frequency band and thus could not coexist with voice calls like ADSL or VDSL, but this had the upside that it "fully controlled" the telephone line and could use measures like continuous monitoring (using a mark signal when there was no traffic) to maintain high reliability.

For the era of internet-over-T1, then, it was far more likely that a given customer actually had an HDSL connection that was converted to T1 at the customer premises by a device called a "smart jack." This pattern of the telco providing a regulated T1 service over a different media of their choice, and converting it at the customer premises, is still common today. T1s ordered later on may have actually been delivered via fiber with a similar on-premises media converter, depending on what was most convenient to the telco.

T1 is typically carried over telephone cable with 8P8C modular connectors, much like Ethernet. It requires two pairs, much like the two-line telephone wiring commonly installed in buildings. However, like most digital carriers, T-carrier is more particular about the wiring than conventional telephone. T1 wiring must be twisted-pair, and it is polarity sensitive.

DS1 Protocol

The DS1 protocol defines the carriage of 24 64kbps channels over a T1 interface. This basically amounts to TDM-muxing the 24 channels by looping over them sending one byte at a time, but there are a surprising number of nuances and details.

Early versions of DS1 only actually carried 7 bits for each sample, which was sufficient for a telephone call when companding was used to recover some of the dynamic range. The eighth bit was used for framing. T-carrier is a completely synchronous system, requiring that all of the equipment on the line have perfectly synchronized "frame clocks" to understand what bits belong to which logical channels. The framing mechanism provided a synchronization signal to achieve this perfect coordination. Later improvements in the framing protocol allowed for the use of all eight bits in some or even all of the samples. This gets to be a complicated and tangled story with many caveats, so I am going to leave it out here or this article would get a lot longer and probably contain a bunch more mistakes.

The various combinations of technologies and conventions used at different points get confusing. If you are curious, look into "robbed bit signaling," an artifact of the transition of where framing and control signals in T1 were placed that was, for some reason, a pet topic of one of my college professors. I think we spent more time on robbed-bit signaling than we did on all of MPLS, which is way cooler. Anyway, the point of this is to understand the protocol overhead involved: T1 operates at 1.544Mbps, but at least one 8-bit "sample" must be used for framing purposes, leaving 1.536Mbps of actual payload. The payload may be further reduced by other framing/signaling overhead, depending on exactly how the channel bank is configured. Most of these issue are specific to carrying telephone calls (and their related signaling); "internet service" T1 lines typically used a maximally-efficient configuration.

The Internet

So far we have pretty much only talked about telephone calls, because that's what T-carrier was developed for. By the 1980s, though, the computer industry was producing a lot of new applications for high-speed digital connections. T1 was widely available, and in many cases a tariffed/regulated service, so it was one of the most widely available high-speed data connections. Especially in the very early days of the internet, was often the only option.

Into the 1990s, T1 was one of the dominant options for commercial internet access. It was rarely seen in homes, though, as it was quite expensive. Keep in mind that, from the perspective of the phone company, a T1 line was basically 24 phone lines. They charged accordingly.

To obtain internet service, you would order internet service either from the telco itself or from an independent provider that then ordered a connection from your premises to their point of presence on an open access basis. In this case you were effectively paying two bills, one to the telco for the T1 line and the other to the independent provider for internet connectivity... but the total was still often more affordable than the telco's high rates for internet services.

Because of the popularity of T1 for internet access, routers with T1 interfaces were readily available. Well, really, the wide variety of media used for data connections before Ethernet became such a common standard means that many routers of the era took interchangeable modules, and T1 was one of the modules you could get for them.

In implementation, a T1 line was basically a fast serial line from your equipment to your ISP's equipment. What actually ran over that serial line was up to the ISP, and there were multiple options. The most classical would be frame relay, an X.25-derived protocol mostly associated with ISDN. PPP was also a fairly common option, as with consumer DSL, and more exotic protocols existed for specialized purposes or ISPs that were just a little weird.

When the internet was new, 1.5Mbps T1 was really very fast---the NSFNET backbone was lauded for its speed when it went online as an all-T1 network in 1991. Of course, today, a 1.5Mbps "backbone" internet connection is pretty laughable. Even as the '90s progressed, 1.5Mbps started to feel tight.

One of the things I find odd about the role of T1 in the history of internet access is that the era when a T1 was "blazing fast" was really pretty short. By 2000, when online gaming for example was taking off, both DSL and cable offered significantly better downstream speeds than T1. However, the nature of T1 as a circuit-switched, telephone-engineered TDM protocol made it very reliable and low-latency, properties that most faster internet media performed poorly on (early DSL notoriously so). Multiplayer gaming would likely have been a better experience on T1 than on a DSL connection offering multiples of the theoretical bandwidth.

A faster T1

Of course, there were options to speed up T-carrier. The most obvious is to combine multiple T1 connections, which was indeed a common practice. Later T1 interfaces were often supplied in multiple for that reason. MLPPP is a variant of the PPP protocol intended for combining the bandwidth of multiple links, referred to in the telephone industry as bonding.

But there were also higher levels in the hierarchy. Remember DS2 and DS3? Well, in practice, T2 wasn't really used. It was far more common to bond multiple T1 connections to reach the equivalent speed of a T2. 44.736Mbps T3 did find use, though. The catch is that T3 required specialized cabling (coaxial pairs!) and had a fairly short range, so it was usually not practical between a telephone exchange and a business.

Fortunately, by the time these larger bandwidths were in demand, fiber optic technology had become well-established. The telephone industry primarily used SONET, a fiber media over which Synchronous Digital Hierarchy (SDH) channels were carried. SONET comes in formats identified by OC (Optical Carrier) numbers in a way very similar to T-carrier numbers. An OC-1 is 51.840Mbps, already faster than T3/DS3. So, in practice, DS3 service was pretty much always provided by a media converter from an OC-1 SONET ring. As bandwidth demands further increased, businesses were much more likely to directly order SONET service rather than T-carrier. SONET was available into the multiple Gbps and enjoyed a long life as a popular internet carrier.

Of course, as the internet proliferated, so too did the stack of network media designed specifically for opportunistic packet-switching computer networks. Chief among them was Ethernet. These protocols have now overtaken traditional telephony protocols in most internet applications, so SONET as well is now mostly out of the picture. On the upside, Ethernet networks are generally more cost-effective on a bandwidth basis and allow you to obtain much faster service than you would be able to afford over the telephone network. On the downside, Ethernet networks have no circuit switching or traffic engineering, and so they do not provide the quality of service that the telephone network did. This means more jitter and less predictability of the available bandwidth, an enduring challenge for real-time media applications.

Miscellaneous

One of the confusing things about T1 to modern readers is its interaction with ISDN. ISDN was a later development that introduced a lot more standardization to digital communications over the telephone network, but it incorporated and reused existing digital technology including T-carrier. In the world of ISDN, a "DS0" or single telephone channel is called a basic rate interface (BRI), while the 24-channel T1 bandwidth is called a primary rate interface (PRI). Many T1 connections in the 1990s were actually ISDN PRIs.

The difference between the two is in the fine details: many of the details related to framing and control signals that were, shall we say, "loosey-goosey" with T-carrier are completely standardized under ISDN. An ISDN PRI always consists of 23 bearer channels ("payload" channels) and one control channel, and the framing is always the same. Since there is a dedicated control channel, there's no need to do weird things with the bits in the bearer channels, and so the overhead is standard across all ISDN PRIs.

In practice, the difference between T1 and ISDN PRI was usually just a matter of configuring your router's interface appropriately. Because of the curious details of the regulation and tariff processes, one was sometimes cheaper than the other, and in general the choice of whether to use a "T1" or a "PRI" was often somewhat arbitrary. It's even possible to use some T1 channels in the traditional way and others as ISDN.

While T1 is now mostly forgotten, some parts of its design live on. DSL and T1 have always had a relationship, DSL having originally been developed as basically a "better T-carrier" for ISDN use. In the modern world, DSL pretty much always refers to either ADSL or VDSL, newer protocols designed for consumer service that can coexist with a voice line and provide very high speeds. Many aspects of how DSL works have their roots in T-carrier, including the common use of protocols like ATM (now rare) or PPP (fading out) to encapsulate IP over DSL.

Okay, I realize this article has been sort of dry and factual, but I thought it'd be interesting to share a bit about T-carrier. I think it's something that people my age vaguely remember as important but have never thought about that much. I, personally, am probably just a bit too old to have had much real interaction with T-carrier. When I worked for an MSP for a bit in high school I saw a few customers that still had HDSL-based T1 service, and when I later interned at General Electric we had an OC-3 SONET connection that was in the process of being turned down. Just really catching the trail end... and yet for years later the Steam Hardware Survey was still asking if I had a T1.

Why did T1 get stuck so long in the very specific context of video games? I assume because video game developers frequently had T-carrier connections to their offices and knew that its guaranteed bandwidth provided very good performance for video games. The poor latency of ADSL meant that, despite a theoretical bandwidth several times larger, it was not really a better choice for the specific application of multiplayer games. So the "T1 is god tier" thing hung around for longer than you would have otherwise expected.

  1. I actually don't know why the microwave carriers have multi-letter names starting in T. Something to look into. This convention is older than T-carrier and presumably started with TDX, the experimental microwave carrier that went into use in the late 1940s. I think the naming convention for carriers changed around the mid-century, as T1 is often said to stand for "transmission system one" which is consistent with later AT&T naming conventions but inconsistent with A-carrier through L-carrier, where the letters didn't stand for anything in particular. On the other hand, it is entirely possible that "T" was just the next letter in the sequence, and it standing for "transmission" was a later invention. You will also see people assert that the "T" stands for "trunk," perhaps evidence that the meaning is made up.↩

the video lunchbox

13 September 2025 at 00:00

An opening note: would you believe that I have been at this for five years, now? If I planned ahead better, I would have done this on the five-year anniversary, but I missed it. Computers Are Bad is now five years and four months old.

When I originally launched CAB, it was my second attempt at keeping up a blog. The first, which I had called 12 Bit Word, went nowhere and I stopped keeping it up. One of the reasons, I figured, is that I had put too much effort into it. CAB was a very low-effort affair, which was perhaps best exemplified by the website itself. It was monospace and 80 characters wide, a decision that I found funny (in a shitposty way) and generated constant complaints. To be fair, if you didn't like the font, it was "user error:" I only ever specified "monospace" and I can't be blamed that certain platforms default to Courier. But there were problems beyond the appearance; the tool that generated the website was extremely rough and made new features frustrating to implement.

Over the years, I have not invested much (or really any) effort in promoting CAB or even making it presentable. I figured my readership, interested in vintage computing, would probably put up with it anyway. That is at least partially true, and I am not going to put any more effort into promotion, but some things have changed. Over time I have broadened my topics quite a bit, and I now regularly write about things that I would have dropped as "off topic" three or four years ago. Similarly, my readership has broadened, and probably to a set of people that find 80 characters of monospace text less charming.

I think I've also changed my mind in some ways about what is "special" about CAB. One of the things that I really value about it, that I don't think comes across to readers well, is the extent to which it is what I call artisanal internet. It's like something you'd get at the farmer's market. What I mean by this is that CAB is a website generated by a static site generator that I wrote, and a newsletter sent by a mailing list system that I wrote, and you access them by connecting directly to a VM that I administer, on a VM cluster that I administer, on hardware that I own, in a rack that I lease in a data center in downtown Albuquerque, New Mexico. This is a very old-fashioned way of doing things, now, and one of the ironies is that it is a very expensive way of doing things. It would be radically cheaper and easier to use wordpress.com, and it would probably go down less often and definitely go down for reasons that are my fault less often. But I figure people listen to me in part because I don't use wordpress.com, because I have weird and often impractical opinions about how to best contribute to internet culture.

I spent a week on a cruise ship just recently, and took advantage of the great deal of time I had to look at the sea to also get some work done. Strategically, I decided, I want to keep the things that are important to me (doing everything myself) and move on from the things that are not so important (the website looking, objectively, bad). So this is all a long-winded announcement that I am launching, with this post, a complete rewrite of the site generator and partial rewrite of the mailing list manager.

This comes with several benefits to you. First, computer.rip is now much more readable and, arguably, better looking. Second, it should be generally less buggy (although to be fair I had eliminated most of the problems with the old generator through sheer brute force over the years). Perhaps most importantly, the emails sent to the mailing list are no longer the unrendered Markdown files.

I originally didn't use markup of any kind, so it was natural to just email out the plaintext files. But then I wanted links, and then I wanted pictures, leading me to implement Markdown in generating the webpages... but I just kept emailing out the plaintext files. I strongly considered switching to HTML emails as a solution and mostly finished the effort, but in the end I didn't like it. HTML email is a massive pain in the ass and, I think, distasteful. Instead, I modified a Markdown renderer to create human-readable plaintext output. Things like links and images will still be a little weird in the plaintext emails, but vastly better than they were before.

I expect some problems to surface when I put this all live. It is quite possible that RSS readers will consider the most recent ten posts to all be new again due to a change in how the article IDs are generated. I tried to avoid that happening but, look, I'm only going to put so much time into testing and I've found RSS readers to be surprisingly inconsistent. If anything else goes weird, please let me know.


There has long been a certain connection between the computer industry and the art of animation. The computer, with a frame-oriented raster video output, is intrinsically an animation machine. Animation itself is an exacting, time-consuming process that has always relied on technology to expand the frontier of the possible. Walt Disney, before he was a business magnate, was a technical innovator in animation. He made great advances in cel animation techniques during the 1930s, propelling the Disney Company to fame not only by artistic achievement but also by reducing the cost and time involved in creating feature-length animated films.

Most readers will be familiar with the case of Pixar, a technical division of Lucasfilm that operated primarily as a computer company before its 1986 spinoff under computer executive Steve Jobs---who led the company through a series of creative successes that overshadowed the company's technical work until it was known to most only as a film studio.

Animation is hard. There are several techniques, but most ultimately come down to an animator using experience, judgment, and trial and error to get a series of individually composed frames to combine into fluid motion. Disney worked primarily in cel animation: each element of each frame was hand-drawn, but on independent transparent sheets. Each frame was created by overlaying the sheets like layers in a modern image editor. The use of separate cels made composition and corrections easier, by allowing the animator to move and redraw single elements of the final image, but it still took a great deal of experience to produce a reasonable result.

The biggest challenge was in anticipating how motion would appear. From the era of Disney's first work, problems like registration (consistent positioning of non-moving objects) had been greatly simplified by the use of clear cels and alignment pegs on the animator's desk that held cels in exact registration for tracing. But some things in an animation are supposed to move, I would say that's what makes it animation. There was no simple jig for ensuring that motion would come out smoothly, especially for complex movements like a walking or gesturing character. The animator could flip two cels back and forth, but that was about as good as they could get without dedicating the animation to film.

For much of the mid-century, a typical animation workflow looked like this: a key animator would draw out the key frames in final or near-final quality, establishing the most important moments in the animation, the positions and poses of the characters. The key animator or an assistant would then complete a series of rough pencil sketches for the frames that would need to go in between. These sketches were sent to the photography department for a "pencil test."

In the photography department, a rostrum camera was used: a cinema camera, often 16mm, permanently mounted on an adjustable stand that pointed it down at a flat desk. The rostrom camera looked a bit like a photographic enlarger and worked much the same way, but backwards: the photographer laid out the cels or sketches on the desk, adjusted the position and focus of the camera for the desired framing, and then exposed one frame. This process was repeated, over and over and over, a simple economy that explains the common use of a low 12 FPS frame rate in animation.

Once the pencil test had been photographed, the film went to the lab where it was developed, and then returned to the animation studio where the production team could watch it played on a cinema projector in a viewing room. Ideally, any problems would be identified during this first viewing before the key frames and pencil sketches were sent to the small army of assistant animators. These workers would refine the cels and redraw the pencil sketches in part by tracing, creating the "in between" frames of the final animation. Any needed changes were costly, even when caught at the earliest stage, as it usually took a full day for the photography department to return a new a pencil test (making the pencil test very much analogous to the dailies used in film). What separated the most skilled animators from amateurs, then, was often their ability to visualize the movement of their individual frames by imagination. They wanted to get it right the first time.


Graphics posed a challenge to computers for similar reasons. Even a very basic drawing involves a huge number of line segments, which a computer will need to process individually during rendering. Add properties such as color, consider the practicalities of rasterizing, and then make it all move: just the number of simple arithmetic problems involved in computer graphics becomes enormous. It is not a coincidence that we picture all early computer systems as text-only, although it is a bit unfair. Graphical output is older than many realize, originating with vector-mode CRT displays in the 1950s. Still, early computer graphics were very slow. Vector-mode displays were often paired with high-end scientific computers and you could still watch them draw in real time. Early graphics-intensive computer applications like CAD used specialized ASICs for drawing and yet provided nothing like the interactivity we expect from computers today.

The complexity of computer graphics ran head-first against an intense desire for more capable graphical computers, driven most prominently by the CAD industry. Aerospace and other advanced engineering fields were undergoing huge advancements during the second half of the 20th century. World War II had seen adoption of the jet engine, for example, machines which were extremely powerful but involved complex mathematics and a multitude of 3D parts that made them difficult for a human to reason over. The new field of computer-aided design promised a revolutionary leap in engineering capability, but ironically, the computers were quite bad at drawing. In the first decades, CAD output was still being sent to traditional draftsmen for final drawings. The computers were not only slow, but unskilled at the art of drafting: limitations on the number and complexity of the shapes that computers could render limited them to only very basic drawings, without the extensive annotations that would be needed for manufacturing.

During the 1980s, the "workstation" began to replace the mainframe in engineering applications. Today, "workstation" mostly just identifies PCs that are usually extra big and always extra expensive. Historically, workstations were a different class of machines from PCs that often employed fundamentally different architectures. Many workstations were RISC, an architecture selected for better mathematical performance. They frequently ran UNIX or a derivative, and featured the first examples of what we now call a GPU. Some things don't change: they were also very big, and very expensive.

It was the heady days of the space program and the Concorde, then, that brought us modern computer graphics. The intertied requirements for scientific computing, numerical simulation, and computer graphics that emerged from Cold War aerospace and weapons programs forged a strong bond between high-end computing and graphics. One could perhaps say that the nexus between AI and GPUs today is an extension of this era, although I think it's a bit of a stretch given the text-heavy applications. The echoes of the dawn of computer graphics are much quieter today, but still around. They persist, for example, in the heavy emphasis on computer visualization seen throughout scientific computing but especially in defense-related fields. They persist also in the names of the companies born in that era, names like Silicon Graphics and Mentor Graphics.


The development of video technology, basically the combination of preexisting television technology with new video tape recorders, lead to a lot of optimizations in film. Video was simply not of good enough quality to displace film for editing and distribution, but it was fast and inexpensive. For example, beginning in the 1960s filmmakers began to adopt a system called "video assist." A video camera was coupled to the film camera, either side-by-side with matched lenses or even sharing the same lens via a beam splitter. By running a video tape recorder during filming, the crew could generate something like an "instant daily" and play the tape back on an on-set TV. For the first time, a director could film a scene and then immediately rewatch it. Video assist was a huge step forward, especially in the television industry where it furthered the marriage of film techniques and television techniques for the production of television dramas.

It certainly seems that there should be a similar optimization for animation. It's not easy, though. Video technology was all designed around sequences of frames in a continuous analog signal, not individual images stored discretely. With the practicalities of video cameras and video recorders, it was surprisingly difficult to capture single frames and then play them back to back.

In the 1970s, animators Bruce Lyon and John Lamb developed the Lyon-Lamb Video Animation System (VAS). The original version of the VAS was a large workstation that replaced a rostrum camera with a video camera, monitor, and a custom video tape recorder. Much like the film rostrum camera, the VAS allowed an operator to capture a single frame at a time by composing it on the desk. Unlike the traditional method, the resulting animation could be played back immediately on the included monitor.

The VAS was a major innovation in cel animation, and netted both an Academy Award and an Emmy for technical achievement. While it's difficult to say for sure, it seems like a large portion of the cel-animated features of the '80s had used the VAS for pencil tests. The system was particularly well-suited to rotoscoping, overlaying animation on live-action images. Through a combination of analog mixing techniques and keying, the VAS could directly overlay an animator's work on the video, radically accelerating the process. To demonstrate the capability, John Lamb created a rotoscoped music video for the Tom Waits song "The One That Got Away." The resulting video, titled "Tom Waits for No One," was probably the first rotoscoped music video as well as the first production created with the video rotoscope process. As these landmarks often do, it languished in obscurity until it was quietly uploaded to YouTube in 2006.

The VAS was not without its limitations. It was large, and it was expensive. Even later generations of the system, greatly miniaturized through the use of computerized controls and more modern tape recorders, came in at over $30,000 for a complete system. And the VAS was designed around the traditional rostrom camera workflow, intended for a dedicated operator working at a desk. For many smaller studios the system was out of reach, and for forms of animation that were not amenable to top-down photography on a desk, the VAS wasn't feasible.


There are some forms of animation that are 3D---truly 3D. Disney had produced pseudo-3D scenes by mounting cels under a camera on multiple glass planes, for example, but it was obviously possible to do so in a more complete form by the use of animated sculptures or puppets. Practical challenges seem to have left this kind of animation mostly unexplored until the rise of its greatest producer, Will Vinton.

Vinton grew up in McMinnville, Oregon, but left to study at UC Berkeley. His time in Berkeley left him not only with an architecture degree (although he had studied filmmaking as well), but also a friendship with Bob Gardiner. Gardiner had a prolific and unfortunately short artistic career, in which he embraced many novel media including the hologram. Among his inventions, though, was a novel form of animation using clay: Gardiner was fascinated with sculpting and posing clay figures, and demonstrated the animation potential to Vinton. Vinton, in turn, developed a method of using his student film camera to photograph the clay scenes frame by frame.

Their first full project together, Closed Mondays, took the Academy Award for Best Animated Short Film in 1975. It was notable not only for the moving clay sculptures, but for its camerawork. Vinton had realized that in clay animation, where scenes are composed in real 3D space, the camera can be moved from frame to frame just like the figures. Not long after this project, Vinton and Gardiner split up. Gardiner seems to have been a prolific artist in that way where he could never stick to one thing for very long, and Vinton had a mind towards making a business out of this new animation technology. It was Vinton who christened it Claymation, then a trademark of his new studio.

Vinton returned to his home state and opened Will Vinton Studios in Portland. Vinton Studios released a series of successful animated shorts in the '70s, and picked up work on numerous other projects, contributing for example to the "Wizard of Oz" film sequel "Return to Oz" and the Disney film "Captain EO." By far Vinton Studios most famous contributions to our culture, though, are their advertising projects. Will Vinton Studios brought us the California Raisins, the Noid, and walking, talking M&M's.

Will Vinton Studios struggled with producing claymation at commercial scale. Shooting with film cameras, it took hours to see the result. Claymation scenes were more difficult to rework than cel animation, setting an even larger penalty for reshoots. Most radically, claymation scenes had to be shot on sets, with camera and light rigging. Reshooting sections without continuity errors was as challenging as animating those sections in the first place.

To reduce rework, they used pencil tests: quicker, lower-effort versions of scenes shot to test the lighting, motion, and sound synchronization before photography with a film camera. Their pencil tests were apparently captured on a crude system of customized VCRs, allowing the animator to see the previous frame on a monitor as they composed the next, and then to play back the whole sequence. It was better than working from film, but it was still slow going.


The area from Beaverton to Hillsboro, in Oregon near Portland, is sometimes called "the silicon forest" largely on the influence of Intel and Tektronix. As in the better known silicon valley, these two keystone companies were important not only on their own, but also as the progenitors of dozens of new companies. Tektronix, in particular, had a steady stream of employees leaving to start their own businesses. Among these alumni was Mentor Graphics.

Mentor Graphics was an early player in electronic design automation (EDA), sort of like a field of CAD specialized to electronics. Mentor products assisted not just in the physical design of circuit boards and ICs, but also simulation and validation of their functionality. Among the challenges of EDA are its fundamentally graphical nature: the final outputs of EDA are often images, masks for photolithographic manufacturing processes, and engineers want to see both manufacturing drawings and logical diagrams as they work on complex designs.

When Mentor started out in 1981, EDA was in its infancy and relied mostly on custom hardware. Mentor went a different route, building a suite of software products that ran on Motorola 68000-based workstations from Apollo. The all-software architecture had cost and agility advantages, and Mentor outpaced their competition to become the field's leader.

Corporations want for growth, and by the 1990s Mentor had a commanding position in EDA and went looking for other industries to which their graphics-intensive software could be applied. One route they considered was, apparently, animation: computer animation was starting to take off, and there were very few vendors for not just the animation software but the computer platforms capable of rendering the product. In the end, Mentor shied away: companies like Silicon Graphics and Pixar already had a substantial lead, and animation was an industry that Mentor knew little about. As best I can tell, though, it was this brief investigation of a new market that exposed Mentor engineering managers Howard Mozeico and Arthur Babitz to the animation industry.

I don't know much about their career trajectories in the years shortly after, only that they both decided to leave Mentor for their own reasons. Arthur Babitz went into independent consulting, and found a client reminiscent of his work at Mentor, an established animation studio that was expanding into computer graphics: Will Vinton Studios. Babitz's work at Will Vinton Studios seems to have been largely unrelated to claymation, but it exposed him to the process, and he watched the way they used jury-rigged VCRs and consumer video cameras to preview animations.

Just a couple of years later, Mozeico and Babitz talked about their experience with animation at Mentor, a field in which they were both still interested. Babitz explained the process he had seen at Will Vinton Studios, and his ideas for improving it. Both agreed that they wanted to figure out a sort of retirement enterprise, what we might now call a "lifestyle business": they each wanted to found a company that would keep them busy, but not too busy. The pair incorporated Animation Toolworks, headquartered in Mozeico's Sherwood, Oregon home.


In 1998 Animation Toolworks hit trade shows with the Video Lunchbox. The engineering was mostly by Babitz, the design and marketing by Mozeico, and the manufacturing done on contract by a third party. The device took its name from its form factor, a black crinkle paint box with a handle on top of its barn-roof-shaped lid. It was something like the Lyon Lamb VAS, if it was portable, digital, and relatively inexpensive.

The Lunchbox was essentially a framegrabber, a compact and simplified version of the computer framegrabbers that were coming into use in the animation industry. You plugged a video camera into the input, and a television monitor into the output. You could see the output of the camera, live, on the monitor while you composed a scene. Then, one press of a button captured a single frame and stored it. With a press of another button, you could swap between the stored frame and the live image, helping to compose the next. You could even enable an automatic "flip-flop" mode that alternated the two rapidly, for hands-free adjustment.

Each successive press of the capture button stored another frame to the Lunchbox's memory, and buttons allowed you to play the entire set of stored frames as a loop, or manually step forward or backward through the frames. And that was basically it: there were a couple of other convenience features like an intervalometer (for time lapse) and the ability to record short sections of real-time video, but complete operation of the device was really very simple. That seems to have been one of its great assets. The Lunchbox was much easier to sell after Mozeico gave a brief demonstration and said that that was all there is to it.

To professionals, the Lunchbox was a more convenient, more reliable, and more portable version of the video tape recorder or computer framegrabber systems they were already using for pencil tests. Early customers of Animation Toolworks included Will Vinton Studios alongside other animation giants like Disney, MTV, and Academy Award-winning animator Mark Osborne. Animation Toolworks press quoted animators from these firms commenting on the simplicity and ease of use, saying that it had greatly sped up the animation test process.

In a review for Animation World Magazine, Kellie-Bea Rainey wrote:

In most cases, computers as framegrabbers offer more complications than solutions. Many frustrations stem from the complexity of learning the computer, the software and it's constant upgrades. But one of the things Gary Schwartz likes most about the LunchBox is that the system requires no techno-geeks. "Computers are too complex and the technology upgrades are so frequent that the learning curve keeps you from mastering the tools. It seems that computers are taking the focus off the art. The Video LunchBox has a minimum learning curve with no upgrade manuals. Everything is in the box, just plug it in."

Indeed, the Lunchbox was so simple that it caught on well beyond the context of professional studios. It is remembered most as an educational tool. Disney used the Lunchbox for teaching cel animation in a summer program, but closer to home, the Lunchbox made its way to animation enthusiast and second-grade teacher Carrie Caramella. At Redmond, Oregon's John Tuck Elementary School, Caramella acted as director of a student production team that brought their short film "The Polka Dot Day" to the Northwest Film Center's Young People's Film and Video Festival. During the early 2000s, after-school and summer animation programs proliferated, many using claymation, and almost all using the Video Lunchbox.

At $3,500, the Video Lunchbox was not exactly cheap. It cost more than some of the more affordable computer-based options, but it was so much easier to use, and so much more durable, that it was very much at home in a classroom. Caramella:

"By using the lunchbox, we receive instant feedback because the camera acts > as an eye. It is also child-friendly, and you can manipulate the film a lot more."

Caramella championed animation at John Tuck, finding its uses in other topics. A math teacher worked with students to make a short animation of a chicken. In a unit on compound words, Caramella led students in animating their two words together: a sun and a flower dance; the word is "sunflower." Butter and milk, base and ball.

In Lake Oswego, an independent summer program called Earthlight Studios took up the system.

With the lunchbox, Corey's black-and-white drawings spring to life, two catlike animΓ© characters circling each other with broad edged swords. It's the opening seconds of what he envisions will be an action-adventure film.

We can imagine how cringeworthy these student animations must be to their creators today, but early-'00s education was fascinated with multimedia and it seems rare that technology served the instructional role so well.

It was in this context that I crossed ways with the Lunchbox. As a kid, I went to a summer animation program at OMSI---a claymation program, which I hazily remember was sponsored by a Will Vinton Studios employee. In an old industrial building beside the museum, we made crude clay figures and then made them crudely walk around. The museum's inventory of Lunchboxes already showed their age, but they worked, in a way that was so straightforward that I think hardly any time was spent teaching operation of the equipment. It was a far cry from an elementary school film project in which, as I recall, nearly an entire day of class time was burned trying to get video off of a DV camcorder and into iMovie.


Mozeico and Babitz aimed for modest success, and that was exactly what they found. Animation Toolworks got started on so little capital that it turned a profit the first year, and by the second year the two made a comfortable salary---and that was all the company would ever really do. Mozeico and Babitz continued to improve on the concept. In 2000, they launched the Lunchbox Sync, which added an audio recorder and the ability to cue audio clips at specific frame numbers. In 2006, the Lunchbox DV added digital video.

By the mid-2000s, computer multimedia technology had improved by leaps and bounds. Framegrabbers and real-time video capture devices were affordable, and animation software on commodity PCs overtook the Lunchbox on price and features. Still, the ease of use and portability of the Lunchbox was a huge appeal to educators. By 2005 Animation Toolworks was basically an educational technology company, and in the following years computers overtook them in that market as well.

The era of the Lunchbox is over, in more ways than one. A contentious business maneuver by Phil Knight saw Will Vinton pushed out of Will Vinton Studios. He was replaced by Phil Knight's son, Travis Knight, and the studio rebranded to Laika. The company has struggled under its new management, and Laika has not achieved the renaissance of stop-motion that some thought Coraline might bring about. Educational technology has shifted its focus, as a business, to a sort of lightweight version of corporate productivity platforms that is firmly dominated by Google.

Animation Toolworks was still selling the Lunchbox DV as late as 2014, but by 2016 Mozeico and Babitz had fully retired and offered support on existing units only. Mozeico died in 2017, crushed under a tractor on his own vineyard. There are worse ways to go. Arthur Babitz is a Hood River County Commissioner.

Kellie-Bea Rainey:

I took the two-minute tutorial and taped it to the wall. I cleaned off a work table and set up a stage and a character. Then I put my Sharp Slimcam on a tripod... Once the camera was plugged into the LunchBox, I focused it on my animation set-up. Next, I plugged in my monitor.

All the machines were on and all the lights were green, standing by. It's time to hit the red button on the LunchBox and animate!

Yippee! Look Houston, we have an image! That was quick, easy and most of all, painless. I want to do more, and more, and even more.

The next time you hear from me I'll be having fun, teaching my own animation classes and making my own characters come to life. I think Gary Schwartz says it best, "The LunchBox brings the student back to what animation is all about: art, self-esteem, results and creativity."

I think we're all a little nostalgic for the way technology used to be. I know I am. But there is something to be said for a simple device, from a small company, that does a specific thing well. I'm not sure that I have ever, in my life, used a piece of technology that was as immediately compelling as the Video Lunchbox. There are numerous modern alternatives, replete with USB and Bluetooth and iPad apps. Somehow I am confident that none of them are quite as good.

CodeSOD: Contracting Space

29 September 2025 at 06:30

A ticket came in marked urgent. When users were entering data in the header field, the spaces they were putting in kept getting mangled. This was in production, and had been in production for sometime.

Mike P picked up the ticket, and was able to track down the problem to a file called Strings.java. Yes, at some point, someone wrote a bunch of string helper functions and jammed them into a package. Of course, many of the functions were re-implementations of existing functions: reinvented wheels, now available in square.

For example, the trim function.

    /**
     * @param str
     * @return The trimmed string, or null if the string is null or an empty string.
     */
    public static String trim(String str) {
        if (str == null) {
            return null;
        }

        String ret = str.trim();

        int len = ret.length();
        char last = '\u0021';    // choose a character that will not be interpreted as whitespace
        char c;
        StringBuffer sb = new StringBuffer();
        for (int i = 0; i < len; i++) {
            c = ret.charAt(i);
            if (c > '\u0020') {
                if (last <= '\u0020') {
                    sb.append(' ');
                }
                sb.append(c);
            }
            last = c;
        }
        ret = sb.toString();

        if ("".equals(ret)) {
            return null;
        } else {
            return ret;
        }
    }

Now, Mike's complaint is that this function could have been replaced with a regular expression. While that would likely be much smaller, regexes are expensive- in performance and frequently in cognitive overhead- and I actually have no objections to people scanning strings.

But let's dig into what we're doing here.

They start with a null check, which sure. Then they trim the string; never a good sign when your homemade trim method calls the built-in.

Then, they iterate across the string, copying characters into a StringBuffer. If the current character is above unicode character 20- the realm of printable characters- and if the last character was a whitespace character, we copy a whitespace character into the output, and then the printable character into the output.

What this function does is simply replace runs of whitespace with single whitespace characters.

"This        string"
becomes
"This string"

Badly I should add. Because there are plenty of whitespace characters which appear above \u0020- like the non-breaking space (\u00A0), and many other control characters. While you might be willing to believe your users will never figure out how to type those, you can't guarantee that they'll never copy/paste them.

For me, however, this function does something far worse than being bad at removing extraneous whitespace. Because it has that check at the end- if I handed it a perfectly good string that is only whitespace, it hands me back a null.

I can see the argument- it's a bad input, so just give me back an objectively bad result. No IsNullOrEmpty check, just a simple null check. But I still hate it- turning an actual value into a null just bothers me, and seems like an easy way to cause problems.

In any case, the root problem with this bug was simply developer invented requirements: the users never wanted stray spaces to be automatically removed in the middle of the string. Trimmed yes, gutted no.

No one tried to use multiple spaces for most of the history of the application, thus no one noticed the problem. No one expected it to not work. Hence the ticket and the panic by users who didn't understand what was going on.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

Error'd: Pickup Sticklers

26 September 2025 at 06:30

An Anonymous quality analyst and audiophile accounted "As a returning customer at napalmrecords.com I was forced to update my Billing Address. Fine. Sure. But what if my *House number* is a very big number? More than 10 "symbols"? Fortunately, 0xDEADBEEF for House number and J****** for First Name both passed validation."

3

And then he proved it, by screenshot:

4

Β 

Richard P. found a flubstitution failure mocking "I'm always on the lookout for new and interesting Lego sets. I definitely don't have {{product.name}} in my collection!"

2

Β 

"I guess short-named siblings aren't allowed for this security question," pointed out Mark T.

0

Β 

Finally, my favorite category of Error'd -- the security snafu. Tim R. reported this one, saying "Sainsbury/Argos in the UK doesn't want just anybody picking up the item I've ordered online and paid for, so they require not one, not two, but 3 pieces of information when I come to collect it. There's surely no way any interloper could possibly find out all 3, unless they were all sent in the same email obviously." Personally, my threat model for my grocery pickups is pretty permissive, but Tim cares.

1

Β 

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

Coded Smorgasbord: High Strung

25 September 2025 at 06:30

Most languages these days have some variation of "is string null or empty" as a convenience function. Certainly, C#, the language we're looking at today does. Let's look at a few example of how this can go wrong, from different developers.

We start with an example from Jason, which is useless, but not a true WTF:

/// <summary>
/// Does the given string contain any characters?
/// </summary>
/// <param name="strToCheck">String to check</param>
/// <returns>
/// true - String contains some characters.
/// false - String is null or empty.
/// </returns>
public static bool StringValid(string strToCheck)
{
        if ((strToCheck == null) ||
                (strToCheck == string.Empty))
                return false;

        return true;
}

Obviously, a better solution here would be to simply return the boolean expression instead of using a conditional, but equally obvious, the even better solution would be to use the built-in. But as implementations go, this doesn't completely lose the plot. It's bad, it shouldn't exist, but it's barely a WTF. How can we make this worse?

Well, Derek sends us an example line, which is scattered through the codebase.

if (Port==null || "".Equals(Port)) { /* do stuff */}

Yes, it's frequently done as a one-liner, like this, with the do stuff jammed all together. And yes, the variable is frequently different- it's likely the developer responsible saved this bit of code as a snippet so they could easily drop it in anywhere. And they dropped it in everywhere. Any place a string got touched in the code, this pattern reared its head.

I especially like the "".Equals call, which is certainly valid, but inverted from how most people would think about doing the check. It echos Python's string join function, which is invoked on the join character (and not the string being joined), which makes me wonder if that's where this developer started out?

I'll never know.

Finally, let's poke at one from Malfist. We jump over to Java for this one. Malfist saw a function called checkNull and foolishly assumed that it returned a boolean if a string was null.

public static final String checkNull(String str, String defaultStr)
{
    if (str == null)
        return defaultStr ;
    else
        return str.trim() ;
}

No, it's not actually a check. It's a coalesce function. Okay, misleading names aside, what is wrong with it? Well, for my money, the fact that the non-null input string gets trimmed, but the default string does not. With the bonus points that this does nothing to verify that the default string isn't null, which means this could easily still propagate null reference exceptions in unexpected places.

I've said it before, and I'll say it again: strings were a mistake. We should just abolish them. No more text, everybody, we're done.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

CodeSOD: Across the 4th Dimension

24 September 2025 at 06:30

We're going to start with the code, and then talk about it. You've seen it before, you know the chorus: bad date handling:

C_DATE($1)
C_STRING(7;$0)
C_STRING(3;$currentMonth)
C_STRING(2;$currentDay;$currentYear)
C_INTEGER($month)

$currentDay:=String(Day of($1))
$currentDay:=Change string("00";$currentDay;3-Length($currentDay))
$month:=Month of($1)
Case of

: ($month=1)
$currentMonth:="JAN"

: ($month=2)
$currentMonth:="FEB"

: ($month=3)
$currentMonth:="MAR"

: ($month=4)
$currentMonth:="APR"

: ($month=5)
$currentMonth:="MAY"

: ($month=6)
$currentMonth:="JUN"

: ($month=7)
$currentMonth:="JUL"

: ($month=8)
$currentMonth:="AUG"

: ($month=9)
$currentMonth:="SEP"

: ($month=10)
$currentMonth:="OCT"

: ($month=11)
$currentMonth:="NOV"

: ($month=12)
$currentMonth:="DEC"

End case

$currentYear:=Substring(String(Year of($1));3;2)

$0:=$currentDay+$currentMonth+$currentYear

At this point, most of you are asking "what the hell is that?" Well, that's Brewster's contribution to the site, and be ready to be shocked: the code you're looking at isn't the WTF in this story.

Let's rewind to 1984. Every public space was covered with a thin layer of tobacco tar. The Ground Round restaurant chain would sell children's meals based on the weight of the child and have magicians going from table to table during the meal. And nobody quite figured out exactly how relational databases were going to factor into the future, especially because in 1984, the future was on the desktop, not the big iron "server side".

Thus was born "Silver Surfer", which changed its name to "4th Dimension", or 4D. 4D was an RDBMS, an IDE, and a custom programming language. That language is what you see above. Originally, they developed on Apple hardware, and were almost published directly by Apple, but "other vendors" (like FileMaker) were concerned that Apple having a "brand" database would hurt their businesses, and pressured Apple- who at the time was very dependent on its software vendors to keep its ecosystem viable. In 1993, 4D added a server/client deployment. In 1995, it went cross platform and started working on Windows. By 1997 it supported building web applications.

All in all, 4D seems to always have been a step or two behind. It released a few years after FileMaker, which served a similar niche. It moved to Windows a few years after Access was released. It added web support a few years after tools like Cold Fusion (yes, I know) and PHP (I absolutely know) started to make building data-driven web apps more accessible. It started supporting Service Oriented Architectures in 2004, which is probably as close to "on time" as it ever got for shipping a feature based on market demands.

4D still sees infrequent releases. It supports SQL (as of 2008), and PHP (as of 2010). The company behind it still exists. It still ships, and people- like Brewster- still ship applications using it. Which brings us all the way back around to the terrible date handling code.

4D does have a "date display" function, which formats dates. But it only supports a handful of output formats, at least in the version Brewster is using. Which means if you want DD-MMM-YYYY (24-SEP-2025) you have to build it yourself.

Which is what we see above. The rare case where bad date handling isn't inherently the WTF; the absence of good date handling in the available tooling is.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

CodeSOD: One Last ID

23 September 2025 at 06:30

Chris's company has an unusual deployment. They had a MySQL database hosted on Cloud Provider A. They hired a web development company, which wanted to host their website on Cloud Provider B. Someone said, "Yeah, this makes sense," and wrote the web dev company a sizable check. They app was built, tested, and released, and everyone was happy.

Everyone was happy until the first bills came in. They expected the data load for the entire month to be in the gigabytes range, based on their userbase and expected workloads. But for some reason, the data transfer was many terabytes, blowing up their operational budget for the year in a single month.

Chris fired up a traffic monitor and saw that, yes, huge piles of data were getting shipped around with every request. Well, not every request. Every insert operation ended up retrieving a huge pile of data. A little more research was able to find the culprit:

SELECT last_insert_id() FROM some_table_name

The last_insert_id function is a useful one- it returns the last autogenerated ID in your transaction. So you can INSERT, and then check what ID was assigned to the inserted record. Great. But the way it's meant to be used is like so: SELECT last_insert_id(). Note the lack of a FROM clause.

By adding the FROM, what the developers were actually saying were "grab all rows from this table, and select the last_insert_id once for each one of them". The value of last_insert_id() just got repeated once for each row, and there were a lot of rows. Many millions. So every time a user inserted a row into most tables, the database sent back a single number, repeated millions and millions of times. Each INSERT operation caused a 30MB reply. And when you have high enough traffic, that adds up quickly.

On a technical level, it was an easy fix. On a practical one, it took six weeks to coordinate with the web dev company and their hosting setup to make the change, test the change, and deploy the change. Two of those weeks were simply spent convincing the company that yes, this was in fact happening, and yes, it was in fact their fault.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

CodeSOD: Identify a Nap

22 September 2025 at 06:30

Guy picked up a bug ticket. There was a Hiesenbug; sometimes, saving a new entry in the application resulted in a duplicate primary key error, which should never happen.

The error was in the message-bus implementation someone else at the company had inner-platformed together, and it didn't take long to understand why it failed.

/**
 * This generator is used to generate message ids.
 * This implementation merely returns the current timestamp as long.
 *
 * We are, thus, limited to insert 1000 new messages per second.
 * That throughput seems reasonable in regard with the overall
 * processing of a ticket.
 *
 * Might have to re-consider that if needed.
 *
 */
public class IdGenerator implements IdentifierGenerator
{

        long previousId;
       
        @Override
        public synchronized Long generate (SessionImplementor session, Object parent) throws HibernateException {
                long newId = new Date().getTime();
                if (newId == previousId) {
                        try { Thread.sleep(1); } catch (InterruptedException ignore) {}
                        newId = new Date().getTime();
                }
                return newId;
        }
}

This generates IDs based off of the current timestamp. If too many requests come in and we start seeing repeating IDs, we sleep for a second and then try again.

This… this is just an autoincrementing counter with extra steps. Which most, but I suppose not all databases supply natively. It does save you the trouble of storing the current counter value outside of a running program, I guess, but at the cost of having your application take a break when it's under heavier than average load.

One thing you might note is absent here: generate doesn't update previousId. Which does, at least, mean we won't ever sleep for a second. But it also means we're not doing anything to avoid collisions here. But that, as it turns out, isn't really that much of a problem. Why?

Because this application doesn't just run on a single server. It's distributed across a handful of nodes, both for load balancing and resiliency. Which means even if the code properly updated previousId, this still wouldn't prevent collisions across multiple nodes, unless they suddenly start syncing previousId amongst each other.

I guess the fix might be to combine a timestamp with something unique to each machine, like… I don't know… hmmm… maybe the MAC address on one of their network interfaces? Oh! Or maybe you could use a sufficiently large random number, like really large. 128-bits or something. Or, if you're getting really fancy, combine the timestamp with some randomness. I dunno, something like that really sounds like it could get you to some kind of universally unique value.

Then again, since the throughput is well under 1,000 messages per second, you could probably also just let your database handle it, and maybe not generate the IDs in code.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

Error'd: You Talkin' to Me?

19 September 2025 at 06:30

The Beast In Black is back with a simple but silly factual error on the part of the gateway to all (most) human knowledge.

0

Β 

B.J.H. "The old saying is "if you don't like the weather wait five minutes". Weather.com found a time saver." The trick here is to notice that the "now" temperature is not the same as the headline temperature, also presumably now.

1

Β 

"That's some funny math you got there. Be a shame if it was right," says Jason . "The S3 bucket has 10 files in it. Picking any two (or more) causes the Download button to go disabled with this message when moused over. All I could think of is that this S3 bucket must be in the same universe as https://thedailywtf.com/articles/free-birds " Alas, we are all in the same universe as https://thedailywtf.com/articles/free-birds .

2

Β 

"For others, the markets go up and down, but me, I get real dividends!" gloats my new best friend Mr. TA .

3

Β 

David B. is waiting patiently. "Somewhere in the USPS a package awaits delivery. Either rain, nor snow, nor gloom of night shall prevent the carrier on their appointed rounds. When these rounds will occur are not the USPS's problem." We may not know the day, but we know the hour!

4

Β 

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

CodeSOD: An Echo In Here in here

18 September 2025 at 06:30

Tobbi sends us a true confession: they wrote this code.

The code we're about to look at is the kind of code that mixes JavaScript and PHP together, using PHP to generate JavaScript code. That's already a terrible anti-pattern, but Tobbi adds another layer to the whole thing.


if (AJAX)
{
    <?php
        echo "AJAX.open(\"POST\", '/timesheets/v2/rapports/FactBCDetail/getDateDebutPeriode.php', true);";
            
    ?>
    
    AJAX.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
    AJAX.onreadystatechange = callback_getDateDebutPeriode;
    AJAX.send(strPostRequest);
}

if (AJAX2)
{
    <?php
        echo "AJAX2.open(\"POST\", '/timesheets/v2/rapports/FactBCDetail/getDateFinPeriode.php', true);";
    ?>
    AJAX2.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
    AJAX2.onreadystatechange = callback_getDateFinPeriode;
    AJAX2.send(strPostRequest);
}

So, this uses server side code to… output string literals which could have just been written directly into the JavaScript without the PHP step.

"What was I thinking when I wrote that?" Tobbi wonders. Likely, you weren't thinking, Tobbi. Have another cup of coffee, I think you need it.

All in all, this code is pretty harmless, but is a malodorous brain-fart. As for absolution: this is why we have code reviews. Either your org doesn't do them, or it doesn't do them well. Anyone can make this kind of mistake, but only organizational failures get this code merged.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

Representative Line: Brace Yourself

17 September 2025 at 06:30

Today's representative line is almost too short to be a full line. But I haven't got a category for representative characters, so we'll roll with it. First, though, we need the setup.

Brody inherited a massive project for a government organization. It was the kind of code base that had thousands of lines per file, and frequently thousands of lines per function. Almost none of those lines were comments. Almost.

In the middle of one of the shorter functions (closer to 500 lines), Brody found this:

//    }

This was the only comment in the entire file. And it's a beautiful one, because it tells us so much. Specifically, it tells us the developer responsible messed up the brace count (because clearly a long function has loads of braces in it), and discovered their code didn't compile. So they went around commenting out extra braces until they found the offender. Code compiled, and voila- on to the next bug, leaving the comment behind.

Now, I don't know for certain that's why a single closing brace is commented out. But also, I know for certain that's what happened, because I've seen developers do exactly that.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

Representative Line: Reduced to a Union

16 September 2025 at 06:30

The code Clemens M supported worked just fine for ages. And then one day, it broke. It didn't break after a deployment, which implied some other sort of bug. So Clemens dug in, playing the game of "what specific data rows are breaking the UI, and why?"

One of the organizational elements of their system was the idea of "zones". I don't know the specifics of the application as a whole, but we can broadly describe it thus:

The application oversaw the making of widgets. Widgets could be assigned to one or more zones. A finished product requires a set of widgets. Thus, the finished product has a number of zones that's the union of all of the zones of its component widgets.

Which someone decided to handle this way:

zones.reduce((accumulator, currentValue) => accumulator = _.union(currentValue))

So, we reduce across zones (which is an array of arrays, where the innermost arrays contain zone names, like zone-0, zone-1). In each step we union it with… nothing. The LoDash union function expects an array of arrays, and returns an array that's the union of all its inputs. This isn't how that function is meant to be used, but the behavior from this incorrect usage was that accumulator would end up holding the last element in zones. Which actually worked until recently, because until recently no one was splitting products across zones. When all the inputs were in the same zone, grabbing the last one was just fine.

The code had been like this for years. It was only just recently, as the company expanded, that it became problematic. The fix, at least, was easy- drop the reduce and just union correctly.

[Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.

CodeSOD: Functionally, a Date

15 September 2025 at 06:30

Dates are messy things, full of complicated edge cases and surprising ways for our assumptions to fail. They lack the pure mathematical beauty of other data types, like integers. But that absence doesn't mean we can't apply the beautiful, concise, and simple tools of functional programming to handling dates.

I mean, you or I could. J Banana's co-worker seems to struggle a bit with it.

/**
 * compare two dates, rounding them to the day
 */
private static int compareDates( LocalDateTime date1, LocalDateTime date2 ) {
    List<BiFunction<LocalDateTime,LocalDateTime,Integer>> criterias = Arrays.asList(
            (d1,d2) -> d1.getYear() - d2.getYear(),
            (d1,d2) -> d1.getMonthValue() - d2.getMonthValue(),
            (d1,d2) -> d1.getDayOfMonth() - d2.getDayOfMonth()
        );
    return criterias.stream()
        .map( f -> f.apply(date1, date2) )
        .filter( r -> r != 0 )
        .findFirst()
        .orElse( 0 );
}

This Java code creates a list containing three Java functions. Each function will take two dates and returns an integer. It then streams that list, applying each function in turn to a pair of dates. It then filters through the list of resulting integers for the first non-zero value, and failing that, returns just zero.

Why three functions? Well, because we have to check the year, the month, and the day. Obviously. The goal here is to return a negative value if date1 preceeds date2, zero if they're equal, and positive if date1 is later. And on this metric… it does work. That it works is what makes me hate it, honestly. This not only shouldn't work, but it should make the compiler so angry that the computer gets up and walks away until you've thought about what you've done.

Our submitter replaced all of this with a simple:

return date1.toLocalDate().compareTo( date2.toLocalDate() );
[Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.

Error'd: Free Birds

12 September 2025 at 06:30

"These results are incomprensible," Brian wrote testily. "The developers at SkillCertPro must use math derived from an entirely different universe than ours. I can boast a world record number of answered questions in one hour and fifteen minutes somewhere."

0

Β 

"How I Reached Inbox -1," Maia titled her Tickity Tock. "Apparently I've read my messages so thoroughly that my email client (Mailspring) time traveled into the future and read a message before it was even sent."

1

Β 

... which taught Jason how to use Mailspring to order timely tunes. "Apparently, someone invented a time machine and is able to send us vinyls from the future..."

4

Β 

"Yes, we have no bananas," sang out Peter G. , rapping "... or email addresses or phone numbers, but we're going to block your post just the same (and this is better than the previous result of "Whoops something went wrong", because you'd never be able to tell something had gone wrong without that helpful message)."

2

Β 

Finally, our favorite cardsharp Adam R. might have unsharp eyes but sharp browser skills. "While reading an online bridge magazine, I tried to zoom out a bit but was dismayed to find I couldn't zoom out. Once it zooms in to NaN%, you're stuck there."

3

Β 

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!

CodeSOD: The Getter Setter Getter

11 September 2025 at 06:30

Today's Java snippet comes from Capybara James.

The first sign something was wrong was this:

private Map<String, String> getExtractedDataMap(PayloadDto payload) {
    return setExtractedDataToMap(payload);
}

Java conventions tell us that a get method retrieves a value, and a set method mutates the value. So a getter that calls a setter is… confusing. But neither of these are truly getters nor setters.

setExtractedDataToMap converts the PayloadDto to a Map<String, String>. getExtractedMap just calls that, which is just one extra layer of indirection that nobody needed, but whatever. At its core, this is just two badly named methods where there should be one.

But that distracts from the true WTF in here. Why on Earth are we converting an actual Java object to a Map<String,String>? That is a definite code smell, a sign that someone isn't entirely comfortable with object-oriented programming. You can't even say, "Well, maybe for serialization to JSON or something?" because Java has serializers that just do this transparently. And that's just the purpose of a DTO in the first place- to be a bucket that holds data for easy serialization.

We're left wondering what the point of all of this code is, and we're not alone. James writes:

I found this gem of a code snippet while trying to understand a workflow for data flow documentation purpose. I was not quite sure what the original developer was trying to achieve and at this point I just gave up

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

CodeSOD: Upsert Yours

10 September 2025 at 06:30

Henrik H sends us a short snippet, for a relative value of short.

We've all seen this method before, but this is a particularly good version of it:

public class CustomerController
{
    public void MyAction(Customer customer)
    {
        // snip 125 lines

        if (customer.someProperty)
            _customerService.UpsertSomething(customer.Id, 
            customer.Code, customer.Name, customer.Address1, 
            customer.Address2, customer.Zip, customer.City, 
            customer.Country, null, null, null, null, null, 
            null, null, null, null, null, null, null, null, 
            null, false, false, null, null, null, null, null, 
            null, null, null, null, null, null, null, false, 
            false, false, false, true, false, null, null, null,
            false, true, false, true, true, 0, false, false, 
            false, false, customer.TemplateId, false, false, false, 
            false, false, string.Empty, true, false, false, false, 
            false, false, false, false, false, true, false, false, 
            true, false, false, MiscEnum.Standard, false, false, 
            false, true, null, null, null);
        else
            _customerService.UpsertSomething(customer.Id, 
            customer.Code, customer.Name, customer.Address1, 
            customer.Address2, customer.Zip, customer.City, 
            customer.Country, null, null, null, null, null, 
            null, null, null, null, null, null, null, null, 
            null, false, false, null, null, null, null, null, 
            null, null, null, null, null, null, null, false, 
            false, false, false, true, false, null, null, null, 
            false, false, false, true, true, 0, false, false, 
            false, false, customer.TemplateId, false, false, false, 
            false, false, string.Empty, true, false, false, false, 
            false, false, false, false, true, true, false, false, 
            true, false, false, MiscEnum.Standard, false, false, 
            false, true, null, null, null);

        // snip 52 lines
    }
}

Welcome to the world's most annoying "spot the difference" puzzle. I've added line breaks (as each UpsertSomething was all on one line in the original) to help you find it. Here's a hint: it's one of the boolean values. I'm sure that narrows it down for you. It means the original developed didn't need the if/else and instead could have simply passed customer.someProperty as a parameter.

Henrick writes:

While on a simple assignment to help a customer migrate from .NET Framework to .NET core, I encountered this code. The 3 lines are unfortunately pretty representative for the codebase

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!

Myopic Focus

9 September 2025 at 06:30

Chops was a developer for Initrode. Early on a Monday, they were summoned to their manager Gary's office before the caffeine had even hit their brain.

Gary glowered up from his office chair as Chops entered. This wasn't looking good. "We need to talk about the latest commit for Taskmaster."

Taskmaster was a large application that'd been around for decades, far longer than Chops had been an employee. Thousands of internal and external customers relied upon it. Refinements over time had led to remarkable stability, its typical uptime now measured in years. However, just last week, their local installation had unexpectedly suffered a significant crash. Chops had been assigned to troubleshooting and repair.

Looker Studio Marketing Dashboard Overview

"What's wrong?" Chops asked.

"Your latest commit decreased the number of unit tests!" Gary replied as if Chops had slashed the tires on his BMW.

Within Taskmaster, some objects that were periodically generated were given a unique ID from a pool. The pool was of limited size and required scanning to find a spare ID. Each time a value was needed, a search began where the last search ended. IDs returned to the pool as objects were destroyed would only be reused when the search wrapped back around to the start.

Chops had discovered a bug in the wrap-around logic that would inevitably produce a crash if Taskmaster ran long enough. They also found that if the number of objects created exceeded the size of the pool, this would trigger an infinite loop.

Rather than attempt to patch any of this, Chops had nuked the whole thing and replaced it with code that assigned each object a universally unique identifier (UUID) from a trusted library UUID generator within its constructor. Gone was the bad code, along with its associated unit tests.

Knowing they would probably only get in a handful of words, Chops wonderered how on earth to explain all this in a way that would appease their manager. "Wellβ€”"

"That number must NEVER go down!" Gary snapped.

"Butβ€”"

"This is non-negotiable! Roll it back and come up with something better!"

And so Chops had no choice but to remove their solution, put all the janky code back in place, and patch over it with kludge. Every comment left to future engineers contained a tone of apology.

Taskmaster became less stable. Time and expensive developer hours were wasted. Risk to internal and external customers increased. But Gary could rest assured, knowing that his favored metric never faltered on his watch.

[Advertisement] Keep all your packages and Docker containers in one place, scan for vulnerabilities, and control who can access different feeds. ProGet installs in minutes and has a powerful free version with a lot of great features that you can upgrade when ready.Learn more.

FreeBSD vs. SmartOS: Who's Faster for Jails, Zones, and bhyve VMs?

19 September 2025 at 08:50
Which virtualization host performs better? I put FreeBSD and SmartOS in a head-to-head showdown. The performance of Jails, Zones, and bhyve VMs surprised me, forcing a second round of tests on different hardware to find the real winner.

That Secret Service SIM Farm Story Is Bogus

By: Nick Heer
28 September 2025 at 05:01

Robert Graham, clarifying the bad reporting of the big SIM farm bust in New York:

The Secret Service is lying to the press. They know it’s just a normal criminal SIM farm and are hyping it into some sort of national security or espionage threat. We know this because they are using the correct technical terms that demonstrate their understanding of typical SIM farm crimes. The claim that they will likely find other such SIM farms in other cities likewise shows they understand this is a normal criminal activity and not any special national security threat.

One of the things we must always keep in mind is that press releases are written to persuade. That is as true for businesses as it is for various government agencies. In this case, the Secret Service wanted attention, so they exaggerated the threat. And one wonders why public trust in institutions is falling.

βŒ₯ Permalink

Google Provides Feedback on the Digital Markets Act

By: Nick Heer
27 September 2025 at 05:27

Something I missed in posting about Apple’s critical appraisal of the Digital Markets Act is its timing. Why now? Well, it turns out the European Commission sought feedback beginning in July, and with a deadline of just before midnight on 24 September. That is why it published that statement, and why Google did the same.

Oliver Bethell, Google’s β€œsenior director, competition”, a job title which implies a day spent chuckling to oneself:

Consider the DMA’s impact on Europe’s tourism industry. The DMA requires Google Search to stop showing useful travel results that link directly to airline and hotel sites, and instead show links to intermediary websites that charge for inclusion. This raises prices for consumers, reduces traffic to businesses, and makes it harder for people to quickly find reliable, direct booking information.

Key parts of the European tourism industry have already seen free, direct booking traffic from Google Search plummet by up to 30%. A recent study on the economic impact of the DMA estimates that European businesses across sectors could face revenue losses of up to €114 billion.

The study in question, though published by Copenhagen Business School, was funded by the Computer & Communications Industry Association, a tech industry lobbying firm funded in part by Google. I do not have the background to assess if the paper’s conclusions are well-founded, but it should be noted the low-end of the paper’s estimates was a loss of €8.5 billion, or just 0.05% of total industry revenue (page 45). The same lobbyists also funded a survey (PDF) conducted online by Nextrade Group.

Like Apple, Google clearly wants this law to go away. It might say it β€œremain[s] committed to complying with the DMA” and that it β€œappreciate[s] the Commission’s consistent openness to regulatory dialogue”, but nobody is fooled. To its credit, Google posted the full response (PDF) it sent the Commission which, though clearly defensive, has less of a public relations sheen than either of the company’s press releases.

βŒ₯ Permalink

U.S. Federal Trade Commission Settles With Amazon Just Two Days Into Trial

By: Nick Heer
26 September 2025 at 04:53

In 2023 Lina Khan, then-chair of the U.S. Federal Trade Commission, sued Amazon over using (PDF) β€œmanipulative, coercive, or deceptive user-interface designs known as β€˜dark patterns’ to trick consumers into enrolling in automatically-renewing Prime subscriptions” and β€œknowingly complicat[ing] the cancellation process”. Some people thought this case was a long-shot, or attempted to use Khan’s scholarship against her.

Earlier this week, the trial began to adjudicate the government’s claims which, in addition to accusing Amazon itself, also involved charges against company executives. It was looking promising for the FTC.

Annie Palmer, CNBC:

The FTC notched an early win in the case last week when U.S. District Court Judge John Chun ruled Amazon and two senior executives violated the Restore Online Shoppers’ Confidence Act by gathering Prime members’ billing information before disclosing the terms of the service.

Chun also said that the two senior Amazon executives would be individually liable if a jury sides with the FTC due to the level of oversight they maintained over the Prime enrollment and cancellation process.

Then, just two days into the trial, the FTC announced it had reached a settlement:

The Federal Trade Commission has secured a historic order with Amazon.com, Inc., as well as Senior Vice President Neil Lindsay and Vice President Jamil Ghani, settling allegations that Amazon enrolled millions of consumers in Prime subscriptions without their consent, and knowingly made it difficult for consumers to cancel. Amazon will be required to pay a $1 billion civil penalty, provide $1.5 billion in refunds back to consumers harmed by their deceptive Prime enrollment practices, and cease unlawful enrollment and cancellation practices for Prime.

As usual for settlements like these, Amazon will admit no wrongdoing. The executives will not face liability, something Adam Kovacevich, head of the Chamber of Progress, a tech industry lobbying group, said today was a β€œwild … theory” driven by β€œKhan’s ego”. Nonsense. The judge in the case, after saying Amazon broke the law, gave credence to the concept these executives were personally liable for the harm they were alleged to have caused.

Former FTC commissioner Alvaro Bedoya on X:

Based on my initial read, do the executives need to do anything separate from that? Do they pay any fines? Are they being demoted? Are they subject to extra monitoring? Do they need to admit any guilt whatsoever? The answers, as far as I can tell are no, no, no, no, and no. What’s worse, the order applies to the executives for only three years β€” seven years less than the company.

Two-and-a-half billion is a lot of dollars in the abstract. CIRP estimates there are 197 million U.S. subscribers to Amazon Prime, which costs anywhere from $7 to $15 per month. For the sake of argument, assume everyone is β€” on average β€” on the annual plan of $11.58 per month. It will take barely more than one billing cycle for Amazon to recoup that FTC settlement. The executives previously charged will bear little responsibility for this outcome.

Those million-dollar inauguration β€œinvestments”, as CBS News put it, sure are paying off.

βŒ₯ Permalink

Privacy Commissioner of Canada Releases Findings of Investigation Into TikTok

By: Nick Heer
26 September 2025 at 04:10

Catharine Tunney, CBC News:

The immensely popular social media app TikTok has been collecting sensitive information from hundreds of thousands of Canadians under 13 years old, a joint investigation by privacy authorities found.

[…]

The privacy commissioners said TikTok agreed to enhance its age verification and provide up-front notices about its wide-ranging collection of data.

Off the top, the Privacy Commissioner’s report was limited in scope and did not examine β€œperceived risks to national security” since they were not related to β€œprivacy in the context of commercial activity” and have been adjudicated elsewhere. The results of national security reviews by other agencies have not been published. However, the Commissioner’s review of the company’s privacy practices is still comprehensive for what was in scope.

TikTok detects and removes about 500,000 accounts of Canadian children under 13 annually. Yet even though the company has dedicated significant engineering efforts to estimating users’ ages for advertising and to produce recommendations, it has not developed similar capabilities for restricting minors’ access.

Despite my skepticism of the Commissioner’s efficacy in cases like these, this investigation produced a number of results. TikTok made several changes as the investigation progressed, including restricting ad targeting to minors:

As an additional measure, in its response to the Offices’ Preliminary Report of Investigation, TikTok committed to limit ad targeting for users under 18 in Canada. TikTok informed the Offices that it implemented this change on April 1st, 2025. As a result, advertisers can no longer deliver targeted ads to users under 18, other than according to generic data (such as language and approximate location).

This is a restriction TikTok has in place for some regions, but not everywhere. It is not unique to TikTok, either; Meta and Google targeted minors, and Meta reportedly guessed teens’ emotional state for ad targeting purposes. This industry cannot police itself. All of these companies say they have rules against ad targeting to children and have done so for years, yet all of them have been found to ignore those rules when they are inconvenient.

βŒ₯ Permalink

Apple Attempts to Rally Users Against E.U. Digital Markets Act

By: Nick Heer
25 September 2025 at 22:14

Apple issued a press release criticizing the E.U.’s Digital Markets Act in a curious mix of countries. It published it on its European sites β€” of course β€” and in Australia, Canada, New Zealand, and the United States, all English-speaking. It also issued the same press release in Brazil, China, Guinea-Bissau, Indonesia, and Thailand β€” and a handful of other places β€” but not in Argentina, India, Japan, Mexico, or Singapore. Why this mix? Why did Apple bother to translate it into Thai but not Japanese? It is a fine mystery. Read into it what you will.

Anyway, you will be amazed to know how Apple now views the DMA:

It’s been more than a year since the Digital Markets Act was implemented. Over that time, it’s become clear that the DMA is leading to a worse experience for Apple users in the EU. It’s exposing them to new risks, and disrupting the simple, seamless way their Apple products work together. And as new technologies come out, our European users’ Apple products will only fall further behind.

[…]

That’s why we’re urging regulators to take a closer look at how the law is affecting the EU citizens who use Apple products every day. We believe our users in Europe deserve the best experience on our technology, at the same standard we provide in the rest of the world β€” and that’s what we’ll keep fighting to deliver.

It thinks the DMA should disappear.

Its reasoning is not great; Michael Tsai read the company’s feature delays more closely and is not convinced. One of the delayed features is Live Translation, about which I wrote:

This is kind of a funny limitation because fully half the languages Live Translation works with β€” French, German, and Spanish β€” are the versions spoken in their respective E.U. countries and not, for example, Canadian French or Chilean Spanish. […]

Because of its launch languages, I think Apple expects this holdup will not last for long.

I did not account for a cynical option: Apple is launching with these languages as leverage.

The way I read Apple’s press release is as a fundamental disagreement between the role each party believes it should play, particularly when it comes to user privacy. Apple seems to believe it is its responsibility to implement technical controls to fulfill its definition of privacy and, if that impacts competition and compatibility, too bad. E.U. regulators seem to believe it has policy protections for user privacy, and that users should get to decide how their private data is shared.

Adam Engst, TidBits:

Apple’s claim of β€œthe same standard we provide in the rest of the world” rings somewhat hollow, given that it often adjusts its technology and services to comply with local laws. The company has made significant concessions to operate in China, doesn’t offer FaceTime in the United Arab Emirates, and removes apps from the still-functional Russian App Store at the Russian government’s request. Apple likely pushed back in less public ways in those countries, but in the EU, this public statement appears aimed at rallying its users and influencing the regulatory conversation.

I know what Engst is saying here, and I agree with the sentiment, but this is a bad group of countries to be lumped in together with. That does not mean the DMA is equal to the kinds of policies that restrict services in these other countries. It remains noteworthy how strict Apple is in restricting DMA-mandated features only to countries where they are required, but you can just change your region to work around the UAE FaceTime block.

βŒ₯ Permalink

Australian Opposition Parties Encourage Migration to Forthcoming U.S. Version of TikTok

By: Nick Heer
25 September 2025 at 15:21

Oscar Godsell, Sky News:

The opposition’s shadow finance minister James Paterson has since urged the Australian Labor government to follow suit.

Mr Paterson told Sky News if the US was able to create a β€œsafer version” of TikTok, then Australia should liaise with the Trump administration to become part of that solution.

β€œIt would be an unfortunate thing if there was a safe version of TikTok in the United States, but a version of TikTok in Australia which was still controlled by a foreign authoritarian government,” he said.

I am not sure people in Australia are asking for yet more of the country’s media to be under the thumb of Rupert Murdoch. Then again, I also do not think the world needs more social media platforms controlled by the United States, though that is very clearly the wedge the U.S. government is creating: countries can accept the existing version of TikTok, adopt the new U.S.-approved one, or ban them both. The U.S. spinoff does not resolve user privacy problems and it raises new concerns about the goals of its government-friendly ownership and management.

βŒ₯ Permalink

Patreon Will Automatically Enable Audience Irritation Features Next Week

By: Nick Heer
25 September 2025 at 03:53

Do you manage a Patreon page as a β€œcreator”? I do; it is where you can give me five dollars per month to add to my guilt over not finishing my thoughts about Liquid Glass.1 You do not have to give me five dollars. I feel guilty enough as it is.

Anyway, you might have missed an email Patreon sent today advising you that Autopilot will be switched on beginning October 1 unless you manually turn it off. According to Patreon’s email:

Autopilot is a growth engine that automatically sends your members and potential members strategic, timely offers which encourage them to join, upgrade, or retain your membership β€” without you having to lift a finger.

As an extremely casual user, I do not love this; I think it is basically spam. I am sympathetic toward those who make their living with Patreon. I turned this off. If you have a Patreon creator page and missed this email, now you know.

And if you are a subscriber to anyone on Patreon and begin receiving begging emails next week, please be gracious. They might not be aware this feature was switched on.


  1. I am most looking forward to reading others’ reviews when I am done, which I have so far avoided so my own piece is not tainted.Β β†₯︎

βŒ₯ Permalink

Setting Up a New Apple TV Is Still Not Good

By: Nick Heer
25 September 2025 at 03:35

Tonight, I set up a new Apple TV β€” well, as β€œnew” as a refurbished 2022-though-still-current-generation model can be β€” and it was not a good time. I know Apple might be releasing a new model later this year, but any upgrades are probably irrelevant for how I have used my existing ten-year-old model. I do not even have a 4K television.

My older model has some drawbacks. It is pretty slow, and the storage space is pitiful β€” I think it is the 32 GB model β€” so it keeps offloading apps. What I wanted to do was get a new one and bump the old Apple TV to my kitchen, where I have a receiver and a set of speakers I have used with Bluetooth, and then I would be able to AirPlay music in all my entertaining spaces. Real simple stuff.

Jason Snell, in a sadly still-relevant Six Colors article:

The setup starts promisingly: You can bring your iPhone near the Apple TV, and it will automatically log your Apple ID in. If you’ve got the One Home Screen feature turned on, all your apps will load and appear in all the right places. It will feel like you’ve done a data transfer.

But it’s all a mirage.

One Home Screen is a nice feature, but it’s not an iCloud backup of your Apple TV, nor is it the Apple TV equivalent of Migration Assistant. It is exactly what its name suggests β€” a home-screen-syncing feature and nothing more.

I went into this upgrade realizing my wife and I would need to set up all our streaming apps again. (She was cool with it.) That is not great, but at least I had that expectation.

But even the β€œpromising” parts of the setup experience did not work for me. When I brought my iPhone near the new Apple TV, it spun before throwing a mysterious error. After setting it up manually, it thought it was not connected to Wi-Fi β€” even though it was β€” and then it tried syncing the home screen. Some of the apps are right, but it has not synced all of them, and none of them are in the correct position.

Then I opened Music on my phone to try and AirPlay to both Apple TVs, only to find it was not listed. It turns out that is a separate step. I had to add it to my Home, which again involved me bringing my iPhone into close proximity and tapping a button. This failed the three times I tried it. So I restarted my Apple TV and my phone, and then Settings told me I needed to complete my Home setup. I guess it worked but somehow did not move to the next step. At last, AirPlay worked β€” and, frankly, it is pretty great.

I know bugs happen about as often as blog posts complaining about bugs. This thing is basically an appliance, though. I am glad Apple ultimately did not make a car.

βŒ₯ Permalink

U.S. Secret Service Busts Giant SIM Farm in New York

By: Nick Heer
23 September 2025 at 23:53

The U.S. Secret Service:

The U.S. Secret Service dismantled a network of electronic devices located throughout the New York tristate area that were used to conduct multiple telecommunications-related threats directed towards senior U.S. government officials, which represented an imminent threat to the agency’s protective operations.

This protective intelligence investigation led to the discovery of more than 300 co-located SIM servers and 100,000 SIM cards across multiple sites.

That sure is a lot of SIM cards, and a scary-sounding mix of words in the press release:

  • β€œ[…] telecommunications-related threats directed towards senior U.S. government officials […]”

  • β€œ[…] these devices could be used to conduct a wide range of telecommunications attacks […]”

  • β€œThese devices were concentrated within 35 miles of the global meeting of the United Nations General Assembly […]”

Reporters pounced. The New York Times, NBC News, CBS News, and even security publications like the Record seized on dramatic statements like those, and another said by the special agent in a video the Service released: β€œthis network had the potential to […] essentially shut down the cellular network in New York City”. Scary stuff.

When I read the early reports, it sure looked to me like some reporters were getting a little over their skis.

For a start, emphasizing the apparent proximity to the U.N. in New York seems to me like a stretch. A thirty-five mile area around the U.N. looks like this β€” and that is diameter, not radius. If you cannot see that or this third-party website goes away at some point, that is a circle encompassing just about the entire island of Manhattan, going deep into Brooklyn and Queens, stretching all the way up to Chappaqua, and out into Connecticut and New Jersey. That is a massive area. One could just as easily say it was within thirty-five miles of any number of New York-based landmarks and be just as accurate.

Second, the ability to β€œfacilitat[e] anonymous, encrypted communication between potential threat actors and criminal enterprises” is common to basically any internet-connected device. The scale of this one is notable, but you do not need a hundred-thousand SIM cards to make criminal plans. And the apparent possibility of β€œshut[ting] down the cellular network in New York” is similarly common to any large-scale installation. This is undeniably peculiar, huge, and it seems to be nefarious, but a lot of this seems to be a red herring.

Andy Greenberg, Lily Hay Newman, and Matt Burgess, Wired:

Despite speculation in some reporting about SIM farm operation that suggests it was created by a foreign state such as Russia or China and used for espionage, it’s far more likely that the operation’s central focus was scams and other profit-motivated forms of cybercrime, says Ben Coon, who leads intelligence at the cybersecurity firm Unit 221b and has carried out multiple investigations into SIM farms. β€œThe disruption of cell services is possible, flooding the network to the degree that it couldn’t take any more traffic,” Coon says. β€œMy gut is telling me there was some type of fraud involved here.”

These reporters point to a CNN article by John Miller and Celina Tebor elaborating on the threat to β€œsenior U.S. government officials”: they were swatting calls targeting various lawmakers. Not nothing and certainly dangerous, but this is not looking anything like how many reporters have described it, nor what the U.S. Secret Service is suggesting through its word choices.

βŒ₯ Permalink

β€˜How A.I. Helped Locate a Viral Video’s True Origin’ Was by Ignoring A.I.

By: Nick Heer
23 September 2025 at 23:23

This story of how Full Fact geolocated a viral video claiming to be shot in London is intriguing because it disproves its own headline’s claim that β€œA.I. helped”.

Charlotte Green, Full Fact:

But in this case, directly reverse image searching through Google took me to a TikTok video with a location marker for β€˜Pondok Pesantren Al Fatah Temboro’, in Indonesia.

This is enough information to give the Full Fact team a great start: translated, it is a school in Temboro.

Green:

We found a slightly different compilation of similar videos on Facebook, seemingly from the same area, also with women in Islamic dress, but with more geographical features visible, such as a sign and clearer views of buildings.

Using stills from this video as references, we asked the AI chatbot ChatGPT if it could provide coordinates to the location, using the possible location of the Al Fatah school in Indonesia.

Up to the point where ChatGPT was invoked, there is no indication any A.I. tools were used. After that β€” and I do not intend to be mean β€” it is unclear to me why anyone would ask ChatGPT for coordinates to a known, named location when you can just search Google Maps. It is the third one down in my searches; the first two would quickly be eliminated when comparing to either video.

Green:

But this did not match the location of the original video we were trying to fact checkβ€”or anywhere in the near vicinity. While we were very confident the video had been filmed in Temboro, we needed to investigate further to prove this.

After this, no A.I. tools were used. ChatGPT was only able to do as much as a basic Google Maps search. After that, Full Fact had to do some old-fashioned comparative geolocation, and were ultimately successful.

I found this via Charles Arthur, who writes:

And thus we see the positive uses of geolocation by chatbots.

On the contrary, this proved little about the advantages of A.I. geolocation. These tools can certainly be beneficial; Green links to an experiment in Bellingcat in comparison to Google’s reverse image search tools.

I think Full Fact did great work in geolocating this video and deflating its hateful context in that tweet. But a closer reading of the actual steps taken shows any credit to ChatGPT or A.I. is overblown.

βŒ₯ Permalink

Amazon to End Inventory Commingling

By: Nick Heer
22 September 2025 at 23:38

Allison Smith, Modern Retail (via Michael Tsai):

Amazon revealed at its annual Accelerate seller conference in Seattle that it is shutting down its long-running β€œcommingling” program β€” a move that drew louder applause from sellers than any other update of the morning.

The decision marks the end of a controversial practice in which Amazon pooled identical items from different sellers under one barcode. The system, intended to speed deliveries and save warehouse space, had also allowed counterfeit or expired goods to be mixed in with authentic ones, according to The Wall Street Journal. For years, brands complained that commingling made it difficult to trace problems back to specific sellers and left their reputations vulnerable when customers received knockoffs. In 2013, Johnson & Johnson temporarily pulled many of its consumer products from Amazon, arguing the retailer wasn’t doing enough to curb third-party sales of damaged or expired goods.

I had no idea Amazon did this until I complained on Mastodon about how terrible its shopping experience is, and Ben replied referencing this practice, nor did I know it has been doing so for at least twelve years. I am certain I have received counterfeit products more than once from Amazon, and I think this is how it happened.

βŒ₯ Permalink

Meta’s Steak Sauce Demo Should Have Been Dumber

By: Nick Heer
20 September 2025 at 18:50

John Walker, Kotaku

Rather than because of wifi, the reason this happened is because these so-called AIs are just regurgitating information that has been parsed from scanning the internet. It will have been trained on recipes written by professional chefs, home cooks and cookery sites, then combined this information to create something that sounds a lot like a recipe for a Korean sauce. But it, not being an intelligence, doesn’t know what Korean sauce is, nor what recipes are, because it doesn’tΒ know anything. So it can only make noises that sound like the way real humans have described things. Hence it having no way of knowing that ingredients haven’t already been mixed β€” just the ability to mimic recipe-like noises. The recipes it will have been trained on will say β€œafter you’ve combined the ingredients…” so it does too.

I would love to know how this demo was supposed to go. In an ideal world, is it supposed to walk you through the preparation ingredient-by-ingredient? If Jack Mancuso had picked up the soy sauce, would it have guided the recipe-suggested amount? That would be impressive, if it had worked. The New York Times’ tech reporters got to try the glasses for about thirty minutes and, while they shared no details, said it was β€œas spotty as Mr. Zuckerberg’s demonstration”.

I think Walker is too hard on the faux off-the-cuff remarks, though they are mock-worthy in the context of the failed demo. But I think the diagnosis of this is entirely correct: what we think of as β€œA.I.” is kind of overkill for this situation. I can see some utility. For example, I could not find a written recipe that exactly matched the ingredients on Mancuso’s bench, but perhaps Meta’s A.I. software can identify the ingredients, and assume the lemons are substituting for rice vinegar. Sure. After that, what would actually be useful is a straightforward recitation of a specific recipe: measure out a quarter-cup of soy sauce and pour it into a bowl; next, stir in one tablespoon of honey β€” that kind of thing. This is pretty basic text-to-speech stuff, though it would be cool if it can respond to questions like how much ginger?, and did I already add the honey?, too.

Also, I would want to know which recipe it was following. A.I. has a terrible problem with not crediting its sources of information in general, and it is no different here.

Also β€” and this probably goes without saying β€” even if these glasses worked as well as Meta suggests they should, there is no way I would buy a pair. You are to tell me that I should strap a legacy of twenty years of privacy violations and user hostility to my face? Oh, please.

βŒ₯ Permalink

U.S. Federal Trade Commission Sues Live Nation and Ticketmaster

By: Nick Heer
19 September 2025 at 23:47

In 2018, the Toronto Star and CBC News jointly published an investigation into Ticketmaster’s sales practices:

Data journalists monitored Ticketmaster’s website for seven months leading up to this weekend’s show at Scotiabank Arena, closely tracking seats and prices to find out exactly how the box-office system works.

Here are the key findings:

  • Ticketmaster doesn’t list every seat when a sale begins.

  • Hikes prices mid-sale.

  • Collects fees twice on tickets scalped on its site.

Dave Seglins, Rachel Houlihan, Laura Clementson, CBC News:

Posing as scalpers and equipped with hidden cameras, the journalists were pitched on Ticketmaster’s professional reseller program.

[…]

TradeDesk allows scalpers to upload large quantities of tickets purchased from Ticketmaster’s site and quickly list them again for resale. With the click of a button, scalpers can hike or drop prices on reams of tickets on Ticketmaster’s site based on their assessment of fan demand.

Ticketmaster, of course, disputed these journalists’ findings. But the very existence of TradeDesk β€” owned by Ticketmaster β€” seems to be in direct opposition to Ticketmaster’s obligations to purchasers. One part of the company is ostensibly in the business of making sure legitimate buyers acquire no more than their fair share of tickets to a popular show, while another part facilitates easy reselling at massive scale. The TradeDesk platform is not something accessible by just anyone; you cannot create an account on demand. Someone from Ticketmaster has to set up your TradeDesk account for you.

These stories have now become a key piece of evidence in a lawsuit filed by the U.S. Federal Trade Commission against Live Nation, the owner of Ticketmaster:

The FTC alleges that in public, Ticketmaster maintains that its business model is at odds with brokers that routinely exceed ticket limits. But in private, Ticketmaster acknowledged that its business model and bottom line benefit from brokers preventing ordinary Americans from purchasing tickets to the shows they want to see at the prices artists set.

The complaint’s description (PDF) of the relationship between Ticketmaster and TradeDesk, beginning at paragraph 84 and continuing through paragraph 101, is damning. If true, Ticketmaster must be aware of the scalper economy it is effectively facilitating through TradeDesk.

βŒ₯ Permalink

Sponsor: Magic Lasso Adblock: Incredibly Private and Secure Safari Web Browsing

By: Nick Heer
19 September 2025 at 18:00

My thanks to Magic Lasso Adblock for sponsoring Pixel Envy this week.

With over 5,000 five star reviews, Magic Lasso Adblock is simply the best ad blocker for your iPhone, iPad, and Mac.

Magic Lasso Adblock: No ads, no trackers, no annoyances, no worries

Designed from the ground up to protect your privacy, Magic Lasso blocks all intrusive ads, trackers, and annoyances. It stops you from being followed by ads around the web and with App Ad Blocking it stops your app usage being harvested by ad networks.

So, join over 350,000 users and download Magic Lasso Adblock today.

βŒ₯ Permalink

Meta’s Whiffed Its Live Demos at Connect

By: Nick Heer
18 September 2025 at 20:12

Rani Molla, Sherwood News:

While the prerecorded videos of the products in use were slick and highly produced, some of the live demos simply failed.

β€œGlasses are the ideal form factor for personal superintelligence because they let you stay present in the moment while getting access to all of these AI capabilities to make you smarter, help you communicate better, improve your memory, improve your senses,” CEO Mark Zuckerberg reiterated at the start of the event, but the ensuing bloopers certainly didn’t make it feel that way.

I like that Meta took a chance with live demos but, in addition to the bloopers, Connect felt like another showcase of an inspiration-bereft business. The opening was a more grounded β€” figuratively and literally β€” version of the Google Glass skydive from 2012. Then, beginning at about 52 minutes, Zuckerberg introduced the wrist-based control system, saying β€œevery new computing platform has a new way to interact with it”, summarizing a piece of the Macworld 2007 iPhone introduction. It is not that I am offended by Meta cribbing others’ marketing. What I find amusing, more than anything, is Zuckerberg’s clear desire to be thought of as an inventor and futurist, despite having seemingly few original ideas.

βŒ₯ Permalink

Reviewing the iPhone 17 Models as Cameras

By: Nick Heer
18 September 2025 at 05:07

If you want reviews of the iPhone 17 β€” mostly the Pro β€” from the perspective of photography, two of the best come from Chris Niccolls and Jordan Drake of PetaPixel and Tyler Stalman. Coincidentally, both from right here in Calgary. I am not in the market for an upgrade, but I think these are two of the most comprehensive and interesting reviews I have seen specifically about the photo and video features. Alas, both are video-based reviews, so if that is not your bag, sorry.

Niccolls and Drake walk you through the typical PetaPixel review, just as you want it. The Portrait Mode upgrades they show are obvious to me. Stalman’s test of Action Mode plus the 8Γ— zoom feature is wild. He also took a bunch of spectacular photos at the Olds Rodeo last week. Each of these reviews focuses on something different, with notably divergent opinions on some video features.

βŒ₯ Permalink

WSJ Says TikTok Divestiture of U.S. Operations Nears Completion, I Say It Will Make Everyone Mad

By: Nick Heer
17 September 2025 at 03:49

Raffaele Huang, Lingling Wei, and Alex Leary, Wall Street Journal:

The arrangement, discussed by U.S. and Chinese negotiators in Madrid this week, would create a new U.S. entity to operate the app, with U.S. investors holding a roughly 80% stake and Chinese shareholders owning the rest, the people said.

It must be at least an 80% stake. That is the letter of the law this administration has been failing to enforce.

This new company would also have an American-dominated board with one member designated by the U.S. government.

A golden share, perhaps?

Existing users in the U.S. would be asked to shift to a new app, which TikTok has built and is testing, people familiar with the matter said. […]

β€œAsked”?

[…] TikTok engineers will re-create a set of content-recommendation algorithms for the app, using technology licensed from TikTok’s parent ByteDance, the people said. U.S. software giant Oracle, a longtime TikTok partner, would handle user data at its facilities in Texas, they said. […]

And I am sure this will satisfy everyone who has found TikTok’s success alarming. Oracle already has access to TikTok’s source code and β€” at best β€” will allow TikTok employees to rewrite it to get a β€œMade in the USA” stamp. It is possible the recommendations system will be unchanged.

Of course, Chinese investors will still have a stake in the U.S. company and, unless the U.S. company is entirely siloed from TikTok everywhere else, users will still be recommended videos the U.S. government framed as a national security threat. But now the U.S. app will seem suspicious to anyone who has been skeptical of the country’s increasing state involvement in the tech industry.

Some TikTok users are going to be furious about this. Some people who viewed its Chinese ownership as inherently problematic are not going to be satisfied by this. It is going to make everyone a little bit upset. It is unclear if it will solve any of the pressing concerns, either. From a distance and in summary, what it looks like is the U.S. government panicked over the only massively successful social media app not based in the U.S., then wrested control of the app and gave it to people friendlier to this government. That is too simplified but, also, not inaccurate.

βŒ₯ Permalink

Liquid Glass Pours Out to Apple Devices Today

By: Nick Heer
15 September 2025 at 18:31

Craig Grannell, Wired:

Apple revealed Liquid Glass as part of its WWDC announcement this June, with all the pomp usually reserved for shiny new gear. The press release promised a β€œdelightful and elegant new software design” that β€œreflects and refracts its surroundings while dynamically transforming to bring greater focus to content.” Today it launches globally onto compatible Apple devices.

If you haven’t encountered it yet, brace yourself. Inspired by visionOS β€” the software powering the Apple Vision Pro mixed reality headset β€” Liquid Glass infuses every Apple platform with a layered glass aesthetic. This is paired with gloopy animations and a fixation on hiding interface components when possibleβ€”and showing content through them when it isn’t.

Grannell interviewed several developers for this piece, which is ultimately quite critical of Liquid Glass.

I, too, have thoughts, but life got in the way of completing anything by today’s release. Luckily, there is no shortage of people with opinions about this new material and the broader redesign across Apple’s family of operating systems. I trust you will find their commentary adequate, and I hope you will still be interested in mine whenever I can finish.

In the meantime, I think a chunk of Dan Moren’s iOS 26 review, for Six Colors, is quite good:

Apple has designed extensive rules to try and minimize some of the most distracting impacts of Liquid Glass. For example, if you’re viewing black-on-white content and suddenly scroll past a darker image, the UI widgets will only flip from light to dark mode based on the speed of your scrolling: scroll past it quickly and they won’t flip; it’s only if you slow down or stop with the widgets over the image that they’ll shift into dark mode.

While clever, this also feels remarkably over-engineered to work around the fundamental nature of these devices. It’s a little reminiscent of the old apocryphal story about how the American space industry spent years and millions of dollars designing a pen that could write in space while the Soviets used a pencil. Perhaps they should have used a design that doesn’t require adjusting its look on the fly.

Also, Federico Viticci has published his extraordinary annual review. In addition to the section on design, I am also looking forward to his thoughts in particular on iPadOS 26. Lots to read and lots to discover.

βŒ₯ Permalink

Embargoed Reviews for New Apple Stuff Begin With the AirPods Pro 3

By: Nick Heer
15 September 2025 at 18:17

Nicole Nguyen, Wall Street Journal:

My husband, who grew up in Switzerland, helped me test: He spoke French, which turned into English audio in my ears. I responded in English, and he read the French translation on-screen.

There was a delay between his speech and my in-ear translation, which made the conversation stilted. This is par for the course for real-time translators, including the Google Meet and Google Pixel versions I’ve tried. But the AirPods delay was long and it didn’t always transcribe speech correctly, leading to nonsensical translations. (β€œDown” became β€œdone,” β€œsmoothie” became β€œmovie,” etc.)

Live Translation is still in beta, so I’ll try it again down the line.

Kate Kozuch, Tom’s Guide:

The AirPods Pro 3 are the first AirPods to include a dedicated heart rate sensor.

You can start about 50 different workouts from the iOS 26 fitness app on your iPhone, and your AirPods Pro 3 become the heart rate source, no Apple Watch required. They even sync with Workout Buddy for Apple Intelligence-based workout guidance and Apple Music to launch a workout playlist automatically.

I do not use an Apple Watch, so this feature is compelling for tracking my cycling trips more comprehensively. A similar sensor is in the Beats Powerbeats Pro 2; I wonder if the workout tracking features will work with those, too.

Apple’s AirPods remain, for me, the most difficult product not to buy. I enjoyed my AirPods 2 while they lasted, and using a set of wired headphones afterwards does not feel quite right. But these new models still do not have replaceable batteries. It is hard to write this without sounding preachy, so just assume this is my problem, not yours. I continue to be perplexed by treating perfectly good speaker drivers, microphones, and chips as disposable simply because they are packaged with a known consumable part. The engineering for swappable batteries would be, I assume, diabolical, but I still cannot get to a point where I am okay with spending over three hundred Canadian dollars every few years because of this predictable limitation.

It is difficult to resist, though.

βŒ₯ Permalink

Sponsor: Magic Lasso Adblock: Block Ads in iPhone, iPad, and Mac Apps

By: Nick Heer
15 September 2025 at 13:30

Do you want to block ads and trackers across all apps on your iPhone, iPad, or Mac β€” not just in Safari?

Then download Magic Lasso Adblock β€” the ad blocker designed for you.

Magic Lasso: No ads, No trackers, No annoyances, No worries

The new App Ad Blocking feature in Magic Lasso Adblock v5.0 builds upon our powerful Safari and YouTube ad blocking, extending protection to:

  • News apps

  • Social media

  • Games

  • Other browsers like Chrome and Firefox

All ad blocking is done directly on your device, using a fast, efficient Swift-based architecture that follows our strict zero data collection policy.

With over 5,000 five star reviews, it’s simply the best ad blocker for your iPhone, iPad, and Mac.

And unlike some other ad blockers, Magic Lasso Adblock respects your privacy, doesn’t accept payment from advertisers, and is 100% supported by its community of users.

So, join over 350,000 users and download Magic Lasso Adblock today.

βŒ₯ Permalink

Syndication feed fetchers, HTTP redirects, and conditional GET

By: cks
29 September 2025 at 03:49

In response to my entry on how ETag values are specific to a URL, a Wandering Thoughts reader asked me in email what a syndication feed reader (fetcher) should do when it encounters a temporary HTTP redirect, in the context of conditional GET. I think this is a good question, especially if we approach it pragmatically.

The specification compliant answer is that every final (non-redirected) URL must have its ETag and Last-Modified values tracked separately. If you make a conditional GET for URL A because you know its ETag or Last-Modified (or both) and you get a temporary HTTP redirection to another URL B that you don't have an ETag or Last-Modified for, you can't make a conditional GET. This means you have to insure that If-None-Match and especially If-Modified-Since aren't copied from the original HTTP request to the newly re-issued redirect target request. And when you make another request for URL A later, you can't send a conditional GET using ETag or Last-Modified values you got from successfully fetching URL B; you either have to use the last values observed for URL A or make an unconditional GET. In other words, saved ETag and Last-Modified values should be per-URL properties, not per-feed properties.

(Unfortunately this may not fit well with feed reader code structures, data storage, or uses of low-level HTTP request libraries that hide things like HTTP redirects from you.)

Pragmatically, you can probably get away with re-doing the conditional GET when you get a temporary HTTP redirect for a feed, with the feed's original saved ETag and Last-Modified information. There are three likely cases for a temporary HTTP redirection of a syndication feed that I can think of:

  • You're receiving a generic HTTP redirection to some sort of error page that isn't a valid syndication feed. Your syndication feed fetcher isn't going to do anything with a successful fetch of it (except maybe add an 'error' marker to the feed), so a conditional GET that fools you with "nothing changed" is harmless.

  • You're being redirected to an alternate source of the normal feed, for example a feed that's normally dynamically generated might serve a (temporary) HTTP redirect to a static copy under high load. If the conditional GET matches the ETag (probably unlikely in practice) or the Last-Modified (more possible), then you almost certainly have the most current version and are fine, and you've saved the web server some load.

  • You're being (temporarily) redirected to some kind of error feed; a valid syndication feed that contains one or more entries that are there to tell the person seeing them about a problem. Here, the worst thing that happens if your conditional GET fools you with "nothing has changed" is that the person reading the feed doesn't see the error entry (or entries).

The third case is a special variant of an unlikely general case where the normal URL and the redirected URL are both versions of the feed but each has entries that the other doesn't. In this general case, a conditional GET that fools you with a '304 Not Modified' will cause you to miss some entries. However, this should cure itself when the temporary HTTP redirect stops happening (or when a new entry is published to the temporary location, which should change its ETag and reset its Last-Modified date to more or less now).

A feed reader that keeps a per-feed 'Last-Modified' value and updates it after following a temporary HTTP redirect is living dangerously. You may not have the latest version of the non-redirected feed but the target of the HTTP redirection may be 'more recent' than it for various reasons (even if it's a valid feed; if it's not a valid feed then blindly saving its ETag and Last-Modified is probably quite dangerous). When the temporary HTTP redirection goes away and the normal feed's URL resumes responding with the feed again, using the target's "Last-Modified" value for a conditional GET of the original URL could cause you to receive "304 Not Modified" until the feed is updated again (and its Last-Modified moves to be after your saved value), whenever that happens. Some feeds update frequently; others may only update days or weeks later.

Given this and the potential difficulties of even noticing HTTP redirects (if they're handled by some underlying library or tool), my view is that if a feed provides both an ETag and a Last-Modified, you should save and use only the ETag unless you're sure you're going to handle HTTP redirects correctly. An ETag could still get you into trouble if used across different URLs, but it's much less likely (see the discussion at the end of my entry about Last-Modified being specific to the URL).

(All of this is my view as someone providing syndication feeds, not someone writing syndication feed fetchers. There may be practical issues I'm unaware of, since the world of feeds is very large and it probably contains a lot of weird feed behavior (to go with the weird feed fetcher behavior).)

The HTTP Last-Modified value is specific to the URL (technically so is the ETag value)

By: cks
28 September 2025 at 01:08

Last time around I wrote about how If-None-Match values (which come from ETag values) must come from the actual URL itself, not (for example) from another URL that you were at one point redirected to. In practice, this is only an issue of moderate concern for ETag/If-None-Match; you can usually make a conditional GET using an ETag from another URL and get away with it. This is very much an issue if you make the mistake of doing the same thing with an If-Modified-Since header based on another URL's Last-Modified header. This is because the Last-Modified header value isn't unique to a particular document, in a way that ETag values can often be.

If you take the Last-Modified timestamp from URL A and perform a conditional GET for URL B with an 'If-Modified-Since' of that timestamp, the web server may well give you exactly what you asked for but not what you wanted by saying 'this hasn't been modified since then' even though the contents of those URLs are entirely different. You told the web server to decide purely on the basis of timestamps without reference to anything that might even vaguely specify the content, and so it did. This can happen even if the server is requiring an exact timestamp match (as it probably should), because there are any number of ways for the 'Last-Modified' timestamp of a whole bunch of URLs to be exactly the same because some important common element of them was last updated at that point.

(This is how DWiki works. The Last-Modified date of a page is the most recent timestamp of all of the elements that went into creating it, so if I change some shared element, everything will promptly take on the Last-Modified of that element.)

This means that if you're going to use Last-Modified in conditional GETs, you must handle HTTP redirects specially. It's actively dangerous (to actually getting updates) to mingle Last-Modified dates from the original URL and the redirection URL; you either have to not use Last-Modified at all, or track the Last-Modified values separately. For things that update regularly, any 'missing the current version' problems will cure themselves eventually, but for infrequently updated things you could go quite a while thinking that you have the current content when you don't.

In theory this is also true of ETag values; the specification allows them to be calculated in ways that are URL-specific (the specification mentions that the ETag might be a 'revision number'). A plausible implementation of serving a collection of pages from a Git repository could use the repository's Git revision as the common ETag for all pages; after all, the URL (the page) plus that git revision uniquely identifies it, and it's very cheap to provide under the right circumstances (eg, you can record the checked out git revision).

In practice, common ways of generating ETags will make them different across different URLs, potentially unless the contents are the same. DWiki generates ETag values using a cryptographic hash, so two different URLs will only have the same ETag if they have the same contents, which I believe is a common approach for pages that are generated dynamically. Apache generates ETag values for static files using various file attributes that will be different for different files, which is probably also a common approach for things that serve static files. Pragmatically you're probably much safer sending an ETag value from one URL in an If-None-Match header to another URL (for example, through repeating it while following a HTTP redirection). It's still technically wrong, though, and it may cause problems someday.

(This feels obvious but it was only today that I realized how it interacts with conditional GETs and HTTP redirects.)

Go's builtin 'new()' function will take an expression in Go 1.26

By: cks
27 September 2025 at 03:20

An interesting little change recently landed in the development version of Go, and so will likely appear in Go 1.26 when it's released. The change is that the builtin new() function will be able to take an expression, not just a type. This change stems from the proposal in issue 45624, which dates back to 2021 (and earlier for earlier proposals). The new specifications language is covered in, for example, this comment on the issue. An example is in the current development documentation for the release notes, but it may not sound very compelling.

A variety of uses came up in the issue discussion, some of which were a surprise to me. One case that's apparently surprisingly common is to start with a pointer and want to make another pointer to a (shallow) copy of its value. With the change to 'new()', this is:

np = new(*p)

Today you can write this as a generic function (apparently often called 'ref()'), or do it with a temporary variable, but in Go 1.26 this will (probably) be a built in feature, and perhaps the Go compiler will be able to optimize it in various ways. This sort of thing is apparently more common than you might expect.

Another obvious use for the new capability is if you're computing a new value and then creating a pointer to it. Right now, this has to be written using a temporary variable:

t := <some expression>
p := &t

With 'new(expr)' this can be written as one line, without a temporary variable (although as before a 'ref()' generic function can do this today).

The usage example from the current documentation is a little bit peculiar, at least as far as providing a motivation for this change. In a slightly modified form, the example is:

type Person struct {
    Name string `json:"name"`
    Age  *int   `json:"age"` // age if known; nil otherwise
}

func newPerson(name string, age int) *Person {
    return &Person{
        Name: name,
	   Age:  new(age),
     }
}

The reason this is a bit peculiar is that today you can write 'Age: &age' and it works the same way. Well, at a semantic level it works the same way. The theoretical but perhaps not practical complication is inlining combined with escape analysis. If newPerson() is inlined into a caller, then the caller's variable for the 'age' parameter may be unused after the (inlined) call to newPerson, and so could get mapped to 'Age: &callervar', which in turn could force escape analysis to put that variable in the heap, which might be less efficient than keeping the variable in the stack (or registers) until right at the end.

A broad language reason is that allowing new() to take an expression removes the special privilege that structs and certain other compound data structures have had, where you could construct pointers to initialized versions of them. Consider:

type ints struct { i int }
[...]
t := 10
ip := &t
isp := &ints{i: 10}

You can create a pointer to the int wrapped in a struct on a single line with no temporary variable, but a pointer to a plain int requires you to materialize a temporary variable. This is a bit annoying.

A pragmatic part of adding this is that people appear to write and use equivalents of new(value) a fair bit. The popularity of an expression is not necessarily the best reason to add a built-in equivalent to the language, but it does suggest that this feature will get used (or will eventually get used, since the existing uses won't exactly get converted instantly for all sorts of reasons).

This strikes me as a perfectly fine change for Go to make. The one thing that's a little bit non-ideal is that 'new()' of constant numbers has less type flexibility than the constant numbers themselves. Consider:

var ui uint
var uip *uint

ui = 10       // okay
uip = new(10) // type mismatch error

The current error that the compiler reports is 'cannot use new(10) (value of type *int) as *uint value in assignment', which is at least relatively straightforward.

(You fix it by casting ('converting') the untyped constant number to whatever you need. The now more relevant than before 'default type' of a constant is covered in the specification section on Constants.)

The broad state of ZFS on Illumos, Linux, and FreeBSD (as I understand it)

By: cks
26 September 2025 at 02:45

Once upon a time, Sun developed ZFS and put it in Solaris, which was good for us. Then Sun open-sourced Solaris as 'OpenSolaris', including ZFS, although not under the GPL (a move that made people sad and Scott McNealy is on record as regretting). ZFS development continued in Solaris and thus in OpenSolaris until Oracle bought Sun and soon afterward closed Solaris source again (in 2010); while Oracle continued ZFS development in Oracle Solaris, we can ignore that. OpenSolaris was transmogrified into Illumos, and various Illumos distributions formed, such as OmniOS (which we used for our second generation of ZFS fileservers).

Well before Oracle closed Solaris, separate groups of people ported ZFS into FreeBSD and onto Linux, where the effort was known as "ZFS on Linux". Since the Linux kernel community felt that ZFS's license wasn't compatible with the kernel's license, ZoL was an entirely out of (kernel) tree effort, while FreeBSD was able to accept ZFS into their kernel tree (I believe all the way back in 2008). Both ZFS on Linux and FreeBSD took changes from OpenSolaris into their versions up until Oracle closed Solaris in 2010. After that, open source ZFS development split into three mostly separate strands.

(In theory OpenZFS was created in 2013. In practice I think OpenZFS at the time was not doing much beyond coordination of the three strands.)

Over time, a lot more people wanted to build machines using ZFS on top of FreeBSD or Linux (including us) than wanted to keep using Illumos distributions. Not only was Illumos a different environment, but Illumos and its distributions didn't see the level of developer activity that FreeBSD and Linux did, which resulted in driver support issues and other problems (cf). For ZFS, the consequence of this was that many more improvements to ZFS itself started happening in ZFS on Linux and in FreeBSD (I believe to a lesser extent) than were happening in Illumos or OpenZFS, the nominal upstream. Over time the split of effort between Linux and FreeBSD became an obvious problem and eventually people from both sides got together. This resulted in ZFS on Linux v2.0.0 becoming 'OpenZFS 2.0.0' in 2020 (see also the Wikipedia history) and also becoming portable to FreeBSD, where it became the FreeBSD kernel ZFS implementation in FreeBSD 13.0 (cf).

The current state of OpenZFS is that it's co-developed for both Linux and FreeBSD. The OpenZFS ZFS repository routinely has FreeBSD specific commits, and as far as I know OpenZFS's test suite is routinely run on a variety of FreeBSD machines as well as a variety of Linux ones. I'm not sure how OpenZFS work propagates into FreeBSD itself, but it does (some spelunking of the FreeBSD source repository suggests that there are periodic imports of the latest changes). On Linux, OpenZFS releases and development versions propagate to Linux distributions in various ways (some of them rather baroque), including people simply building their own packages from the OpenZFS repository.

Illumos continues to use and maintain its own version of ZFS, which it considers separate from OpenZFS. There is an incomplete Illumos project discussion on 'consuming' OpenZFS changes (via, also), but my impression is that very few changes move from OpenZFS to Illumos. My further impression is that there is basically no one on the OpenZFS side who is trying to push changes into Illumos; instead, OpenZFS people consider it up to Illumos to pull changes, and Illumos people aren't doing much of that for various reasons. At this point, if there's an attractive ZFS change in OpenZFS, the odds of it appearing in Illumos on a timely basis appear low (to put it one way).

(Some features have made it into Illumos, such as sequential scrubs and resilvers, which landed in issue 10405. This feature originated in what was then ZoL and was ported into Illumos.)

Even if Illumos increases the pace of importing features from OpenZFS, I don't ever expect it to be on the leading edge and I think that's fine. There have definitely been various OpenZFS features that needed some time before they became fully ready for stable production use (even after they appeared in releases). I think there's an ecological niche for a conservative ZFS that only takes solidly stable features, and that fits Illumos's general focus on stability.

PS: I'm out of touch with the Illumos world these days, so I may have mis-characterized the state of affairs there. If so, I welcome corrections and updates in the comments.

If-None-Match values must come from the actual URL itself

By: cks
24 September 2025 at 16:55

Because I recently looked at the web server logs for Wandering Thoughts, I said something on the Fediverse:

It's impressive how many ways feed readers screw up ETag values. Make up their own? Insert ETags obtained from the target of a HTTP redirect of another request? Stick suffixes on the end? Add their own quoting? I've seen them all.

(And these are just the ones that I can readily detect from the ETag format being wrong for the ETags my techblog generates.)

(Technically these are If-None-Match values, not ETag values; it's just that the I-N-M value is supposed to come from an ETag you returned.)

One of these mistakes deserves special note, and that's the HTTP redirect case. Suppose you request a URL, receive a HTTP 302 temporary redirect, follow the redirect, and get a response at the new URL with an ETag value. As a practical matter, you cannot then present that ETag value in an If-None-Match header when you re-request the original URL, although you could if you re-requested the URL that you were redirected to. The two URLs are not the same and they don't necessarily have the same ETag values or even the same format of ETags.

(This is an especially bad mistake for a feed fetcher to make here, because if you got a HTTP redirect that gives you a different format of ETag, it's because you've been redirected to a static HTML page served directly by Apache (cf) and it's obviously not a valid syndication feed. You shouldn't be saving the ETag value for responses that aren't valid syndication feeds, because you don't want to get them again.)

This means that feed readers can't just store 'an ETag value' for a feed. They need to associate the ETag value with a specific, final URL, which may not be the URL of the feed (because said feed URL may have been redirected). They also need to (only) make conditional requests when they have an ETag for that specific URL, and not copy the If-None-Match header from the initial GET into a redirected GET.

This probably clashes with many low level HTTP client APIs, which I suspect want to hide HTTP redirects from the caller. For feed readers, such high level APIs are a mistake. They actively need to know about HTTP redirects so that, for example, they can consider updating their feed URL if they get permanent HTTP redirects to a new URL. And also, of course, to properly handle conditional GETs.

A hack: outsourcing web browser/client checking to another web server

By: cks
24 September 2025 at 03:18

A while back on the Fediverse, I shared a semi-cursed clever idea:

Today I realized that given the world's simplest OIDC IdP (one user, no password, no prompting, the IdP just 'logs you in' if your browser hits the login URL), you could put @cadey's Anubis in front of anything you can protect with OIDC authentication, including anything at all on an Apache server (via mod_auth_openidc). No need to put Anubis 'in front' of anything (convenient for eg static files or CGIs), and Anubis doesn't even have to be on the same website or machine.

This can be generalized, of course. There are any number of filtering proxies and filtering proxy services out there that will do various things for you, either for free or on commercial terms; one example of a service is geoblocking that's maintained by someone else who's paid to be on top of it and be accurate. Especially with services, you may not want to put them in front of your main website (that gives the service a lot of power), but you would be fine with putting a single-purpose website behind the service or the proxy, if your main website can use the result. With the world's simplest OIDC IdP, you can do that, at least for anything that will do OIDC.

(To be explicit, yes, I'm partly talking about Cloudflare.)

This also generalizes in the other direction, in that you don't necessarily need to use OIDC. You just need some system for passing authenticated information back and forth between your main website and your filtered, checked, proxied verification website. Since you don't need to carry user identity information around this can be pretty simple (although it's going to involve some cryptography, so I recommend just using OIDC or some well-proven option if you can). I've thought about this a bit and I'm pretty certain you can make a quite simple implementation.

(You can also use SAML if you happen to have an extremely simple SAML server and appropriate SAML clients, but really, why. OIDC is today's all-purpose authentication hammer.)

A custom system can pass arbitrary information back and forth between the main website and the verifier, so you can know (for example) if the two saw the same client details. I think you can do this to some extent with OIDC as well if you have a custom IdP, because nothing stops your IdP and your OIDC client from agreeing on some very custom OIDC claims, such as (say) 'clientip'.

(I don't know of any such minimal OIDC server, although I wouldn't be surprised if one exists, probably as a demonstration or test server. And I suppose you can always put a banner on your OIDC IdP's login page that tells people what login and password to use, if you can only find a simple IdP that requires an actual login.)

Unix mail programs have had two approaches to handling your mail

By: cks
23 September 2025 at 02:34

Historically, Unix mail programs (what we call 'mail clients' or 'mail user agents' today) have had two different approaches to handling your email, what I'll call the shared approach and the exclusive approach, with the shared approach being the dominant one. To explain the shared approach, I have to back up to talk about what Unix mail transfer agents (MTAs) traditionally did. When a Unix MTA delivered email to you, at first it delivered email into a single file in a specific location (such as '/usr/spool/mail/<login>') in a specific format, initially mbox; even then, this could be called your 'inbox'. Later, when the maildir mailbox format became popular, some MTAs gained the ability to deliver to maildir format inboxes.

(There have been a number of Unix mail spool formats over the years, which I'm not going to try to get into here.)

A 'shared' style mail program worked directly with your inbox in whatever format it was in and whatever location it was in. This is how the V7 'mail' program worked, for example. Naturally these programs didn't have to work on your inbox; you could generally point them at another mailbox in the same format. I call this style 'shared' because you could use any number of different mail programs (mail clients) on your mailboxes, providing that they all understood the format and also provided that all of them agreed on how to lock your mailbox against modifications, including against your system's MTA delivering new email right at the point where your mail program was, for example, trying to delete some.

(Locking issues are one of the things that maildir was designed to help with.)

An 'exclusive' style mail program (or system) was designed to own your email itself, rather than try to share your system mailbox. Of course it had to access your system mailbox a bit to get at your email, but broadly the only thing an exclusive mail program did with your inbox was pull all your new email out of it, write it into the program's own storage format and system, and then usually empty out your system inbox. I call this style 'exclusive' because you generally couldn't hop back and forth between mail programs (mail clients) and would be mostly stuck with your pick, since your main mail program was probably the only one that could really work with its particular storage format.

(Pragmatically, only locking your system mailbox for a short period of time and only doing simple things with it tended to make things relatively reliable. Shared style mail programs had much more room for mistakes and explosions, since they had to do more complex operations, at least on mbox format mailboxes. Being easy to modify is another advantage of the maildir format, since it outsources a lot of the work to your Unix filesystem.)

This shared versus exclusive design choice turned out to have some effects when mail moved to being on separate servers and accessed via POP and then later IMAP. My impression is that 'exclusive' systems coped fairly well with POP, because the natural operation with POP is to pull all of your new email out of the server and store it locally. By contrast, shared systems coped much better with IMAP than exclusive ones did, because IMAP is inherently a shared mail environment where your mail stays on the IMAP server and you manipulate it there.

(Since IMAP is the dominant way that mail clients/user agents get at email today, my impression is that the 'exclusive' approach is basically dead at this point as a general way of doing mail clients. Almost no one wants to use an IMAP client that immediately moves all of their email into a purely local data storage of some sort; they want their email to stay on the IMAP server and be accessible from and by multiple clients and even devices.)

Most classical Unix mail clients are 'shared' style programs, things like Alpine, Mutt, and the basic Mail program. One major 'exclusive' style program, really a system, is (N)MH (also). MH is somewhat notable because in its time it was popular enough that a number of other mail programs and mail systems supported its basic storage format to some degree (for example, procmail can deliver messages to MH-format directories, although it doesn't update all of the things that MH would do in the process).

Another major source of 'exclusive' style mail handling systems is GNU Emacs. I believe that both rmail and GNUS normally pull your email from your system inbox into their own storage formats, partly so that they can take exclusive ownership and don't have to worry about locking issues with other mail clients. GNU Emacs has a number of mail reading environments (cf, also) and I'm not sure what the others do (apart from MH-E, which is a frontend on (N)MH).

(There have probably been other 'exclusive' style systems. Also, it's a pity that as far as I know, MH never grew any support for keeping its messages in maildir format directories, which are relatively close to MH's native format.)

Maybe I should add new access control rules at the front of rule lists

By: cks
22 September 2025 at 03:14

Not infrequently I wind up maintaining slowly growing lists of filtering rules to either allow good things or weed out bad things. Not infrequently, traffic can potentially match more than one filtering rule, either because it has multiple bad (or good) characteristics or because some of the match rules overlap. My usual habit has been to add new rules to the end of my rule lists (or the relevant section of them), so the oldest rules are at the top and the newest ones are at the bottom.

After writing about how access control rules need some form of usage counters, it's occurred to me that maybe I want to reverse this, at least in typical systems where the first matching rule wins. The basic idea is that the rules I'm most likely to want to drop are the oldest rules, but by having them first I'm hindering my ability to see if they've been made obsolete by newer rules. If an old rule matches some bad traffic, a new rule matches all of the bad traffic, and the new rule is last, any usage counters will show a mix of the old rule and the new rule, making it look like the old rule is still necessary. If the order was reversed, the new rule would completely occlude the old rule and usage counters would show me that I could weed the old rule out.

(My view is that it's much less likely that I'll add a new rule at the bottom that's completely ineffectual because everything it matches is already matched by something earlier. If I'm adding a new rule, it's almost certainly because something isn't being handled by the collection of existing rules.)

Another possible advantage to this is that it will keep new rules at the top of my attention, because when I look at the rule list (or the section of it) I'll probably start at the top. Currently, the top is full of old rules that I usually ignore, but if I put new rules first I'll naturally see them right away.

(I think that most things I deal with are 'first match wins' systems. A 'last match wins' system would naturally work right here, but it has other confusing aspects. I also have the impression that adding new rules at the end is a common thing, but maybe it's just in the cultural water here.)

Our Django model class fields should include private, internal names

By: cks
21 September 2025 at 01:30

Let me tell you about a database design mistake I made in our Django web application for handling requests for Unix accounts. Our current account request app evolved from a series of earlier systems, and one of the things that these earlier systems asked people for was their 'status' with the university; were they visitors, graduate students, undergraduate students, (new) staff, or so on. When I created the current system I copied this and so the database schema includes a 'Status' model class. The only thing I put in this model class was a text field that people picked from in our account request form, and I didn't really think of the text there as what you could call load bearing. It was just a piece of information we asked people for because we'd always asked people for and faithfully duplicating the old CGI was the easy way to implement the web app.

Before too long, it turned out that we wanted to do some special things if people were graduate students (for example, notifying the department's administrative people so they could update their records to include the graduate student's Unix login and email address here). The obvious simple way to implement this was to do a text match on the value of the 'status' field for a particular person; if their 'status' was "Graduate Student", we knew they were a graduate student and we could do various special things. Over time, this knowledge of what the people-visible "Graduate Student" status text was wormed its way into a whole collection of places around our account systems.

For reasons beyond the scope of this entry, we now (recently) want to change the people-visible text to be not exactly "Graduate Student" any more. Now we have a problem, because a bunch of places know that exact text (in fact I'm not sure I remember where all of those places are).

The mistake I made, way back when we first wanted things to know that an account or account request was a 'graduate student', was in not giving our 'Status' model an internal 'label' field that wasn't shown to people in addition to the text shown to people. You can practically guarantee that anything you show to people will want to change sooner or later, so just like you shouldn't make actual people-exposed fields into primary or foreign keys, none of your code should care about their value. The correct solution is an additional field that acts as the internal label of a Status (with values that make sense to us), and then using this internal label any time the code wants to match on or find the 'Graduate Student' status.

(In theory I could use Django's magic 'id' field for this, since we're having Django create automatic primary keys for everything, including the Status model. In practice, the database IDs are completely opaque and I'd rather have something less opaque in code instead of everything knowing that ID '14' is the Graduate Student status ID.)

Fortunately, I've had a good experience with my one Django database migration so far, so this is a fixable problem. Threading the updates through all of the code (and finding all of the places that need updates, including in outside programs) will be a bit of work, but that's what I get for taking the quick hack approach when this first came up.

(I'm sure I'm not the only person to stub my toe this way, and there's probably a well known database design principle involved that would have told me better if I'd known about it and paid attention at the time.)

These days, systemd can be a cause of restrictions on daemons

By: cks
20 September 2025 at 02:59

One of the traditional rites of passage for Linux system administrators is having a daemon not work in the normal system configuration (eg, when you boot the system) but work when you manually run it as root. The classical cause of this on Unix was that $PATH wasn't fully set in the environment the daemon was running in but was in your root shell. On Linux, another traditional cause of this sort of thing has been SELinux and a more modern source (on Ubuntu) has sometimes been AppArmor. All of these create hard to see differences between your root shell (where the daemon works when run by hand) and the normal system environment (where the daemon doesn't work). These days, we can add another cause, an increasingly common one, and that is systemd service unit restrictions, many of which are covered in systemd.exec.

(One pernicious aspect of systemd as a cause of these restrictions is that they can appear in new releases of the same distribution. If a daemon has been running happily in an older release and now has surprise issues in a new Ubuntu LTS, I don't always remember to look at its .service file.)

Some of systemd's protective directives simply cause failures to do things, like access user home directories if ProtectHome= is set to something appropriate. Hopefully your daemon complains loudly here, reporting mysterious 'permission denied' or 'file not found' errors. Some systemd settings can have additional, confusing effects, like PrivateTmp=. A standard thing I do when troubleshooting a chain of programs executing programs executing programs is to shim in diagnostics that dump information to /tmp, but with PrivateTmp= on, my debugging dump files are mysteriously not there in the system-wide /tmp.

(On the other hand, a daemon may not complain about missing files if it's expected that the files aren't always there. A mailer usually can't really tell the difference between 'no one has .forward files' and 'I'm mysteriously not able to see people's home directories to find .forward files in them'.)

Sometimes you don't get explicit errors, just mysterious failures to do some things. For example, you might set IP address access restrictions with the intention of blocking inbound connections but wind up also blocking DNS queries (and this will also depend on whether or not you use systemd-resolved). The good news is that you're mostly not going to find standard systemd .service files for normal daemons shipped by your Linux distribution with IP address restrictions. The bad news is that at some point .service files may start showing up that impose IP address restrictions with the assumption that DNS resolution is being done via systemd-resolved as opposed to direct DNS queries.

(I expect some Linux distributions to resist this, for example Debian, but others may declare that using systemd-resolved is now mandatory in order to simplify things and let them harden service configurations.)

Right now, you can usually test if this is the problem by creating a version of the daemon's .service file with any systemd restrictions stripped out of it and then seeing if using that version makes life happy. In the future it's possible that some daemons will assume and require some systemd restrictions (for instance, assuming that they have a /tmp all of their own), making things harder to test.

Some stuff on how Linux consoles interact with the mouse

By: cks
19 September 2025 at 01:24

On at least x86 PCs, Linux text consoles ('TTY' consoles or 'virtual consoles') support some surprising things. One of them is doing some useful stuff with your mouse, if you run an additional daemon such as gpm or the more modern consolation. This is supported on both framebuffer consoles and old 'VGA' text consoles. The experience is fairly straightforward; you install and activate one of the daemons, and afterward you can wave your mouse around, select and paste text, and so on. How it works and what you get is not as clear, and since I recently went diving into this area for reasons, I'm going to write down what I now know before I forget it (with a focus on how consolation works).

The quick summary is that the console TTY's mouse support is broadly like a terminal emulator. With a mouse daemon active, the TTY will do "copy and paste" selection stuff on its own. A mouse aware text mode program can put the console into a mode where mouse button presses are passed through to the program, just as happens in xterm or other terminal emulators.

The simplest TTY mode is when a non-mouse-aware program or shell is active, which is to say a program that wouldn't try to intercept mouse actions itself if it was run in a regular terminal window and would leave mouse stuff up to the terminal emulator. In this mode, your mouse daemon reads mouse input events and then uses sub-options of the TIOCLINUX ioctl to inject activities into the TTY, for example telling it to 'select' some text and then asking it to paste that selection to some file descriptor (normally the console itself, which delivers it to whatever foreground program is taking terminal input at the time).

(In theory you can use the mouse to scroll text back and forth, but in practice that was removed in 2020, both for the framebuffer console and for the VGA console. If I'm reading the code correctly, a VGA console might still have a little bit of scrollback support depending on how much spare VGA RAM you have for your VGA console size. But you're probably not using a VGA console any more.)

The other mode the console TTY can be in is one where some program has used standard xterm-derived escape sequences to ask for xterm-compatible "mouse tracking", which is the same thing it might ask for in a terminal emulator if it wanted to handle the mouse itself. What this does in the kernel TTY console driver is set a flag that your mouse daemon can query with TIOCL_GETMOUSEREPORTING; the kernel TTY driver still doesn't directly handle or look at mouse events. Instead, consolation (or gpm) reads the flag and, when the flag is set, uses the TIOCL_SELMOUSEREPORT sub-sub-option to TIOCLINUX's TIOCL_SETSEL sub-option to report the mouse position and button presses to the kernel (instead of handling mouse activity itself). The kernel then turns around and sends mouse reporting escape codes to the TTY, as the program asked for.

(As I discovered, we got a CVE this year related to this, where the kernel let too many people trigger sending programs 'mouse' events. See the stable kernel commit message for details.)

A mouse daemon like consolation doesn't have to pay attention to the kernel's TTY 'mouse reporting' flag. As far as I can tell from the current Linux kernel code, if the mouse daemon ignores the flag it can keep on doing all of its regular copy and paste selection and mouse button handling. However, sending mouse reports is only possible when a program has specifically asked for it; the kernel will report an error if you ask it to send a mouse report at the wrong time.

(As far as I can see there's no notification from the kernel to your mouse daemon that someone changed the 'mouse reporting' flag. Instead you have to poll it; it appears consolation does this every time through its event loop before it handles any mouse events.)

PS: Some documentation on console mouse reporting was written as a 2020 kernel documentation patch (alternate version) but it doesn't seem to have made it into the tree. According to various sources, eg, the mouse daemon side of things can only be used by actual mouse daemons, not by programs, although programs do sometimes use other bits of TIOCLINUX's mouse stuff.

PPS: It's useful to install a mouse daemon on your desktop or laptop even if you don't intend to ever use the text TTY. If you ever wind up in the text TTY for some reason, perhaps because your regular display environment has exploded, having mouse cut and paste is a lot nicer than not having it.

Free and open source software is incompatible with (security) guarantees

By: cks
18 September 2025 at 02:53

If you've been following the tech news, one of the recent things that's happened is that there has been another incident where a bunch of popular and widely used packages on a popular package repository for a popular language were compromised, this time with a self-replicating worm. This is very inconvenient to some people, especially to companies in Europe, for some reason, and so some people have been making the usual noises. On the Fediverse, I had a hot take:

Hot take: free and open source is fundamentally incompatible with strong security *guarantees*, because FOSS is incompatible with strong guarantees about anything. It says so right there on the tin: "without warranty of any kind, either expressed or implied". We guarantee nothing by default, you get the code, the project, everything, as-is, where-is, how-is.

Of course companies find this inconvenient, especially with the EU CRA looming, but that's not FOSS's problem. That's a you problem.

To be clear here: this is not about the security and general quality of FOSS (which is often very good), or the responsiveness of FOSS maintainers. This is about guarantees, firm (and perhaps legally binding) assurances of certain things (which people want for software in general). FOSS can provide strong security in practice but it's inimical to FOSS's very nature to provide a strong guarantee of that or anything else. The thing that makes most of FOSS possible is that you can put out software without that guarantee and without legal liability.

An individual project can solemnly say it guarantees its security, and if it does so it's an open legal question whether that writing trumps the writing in the license. But in general a core and absolutely necessary aspect of free and open source is that warranty disclaimer, and that warranty disclaimer cuts across any strong guarantees about anything, including security and lack of bugs.

Are the compromised packages inconvenient to a lot of companies? They certainly are. But neither the companies nor commentators can say that the compromise violated some general strong security guarantee about packages, because there is and never will be such a guarantee with FOSS (see, for example, Thomas Depierre's I am not a supplier, which puts into words a sentiment a lot of FOSS people have).

(But of course the companies and sympathetic commentators are framing it that way because they are interested in the second vision of "supply chain security", where using FOSS code is supposed to magically absolve companies of the responsibility that people want someone to take.)

The obvious corollary of this is that widespread usage of FOSS packages and software, especially with un-audited upgrades of package versions (however that happens), is incompatible with having any sort of strong security or quality guarantee about the result. The result may have strong security and high quality, but if so, those come without guarantees; you've just been lucky. If you want guarantees, you will have to arrange them yourself and it's very unlikely you can achieve strong guarantees while using the typical every-changing pile of FOSS code.

(For example, if dependencies auto-update before you can audit them and their changes, or faster than you can keep up, you have nothing in practice.)

My Fedora machines need a cleanup of their /usr/sbin for Fedora 42

By: cks
17 September 2025 at 03:06

One of the things that Fedora is trying to do in Fedora 42 is unifying /usr/bin and /usr/sbin. In an ideal (Fedora) world, your Fedora machines will have /usr/sbin be a symbolic link to /usr/bin after they're upgraded to Fedora 42. However, if your Fedora machines have been around for a while, or perhaps have some third party packages installed, what you'll actually wind up with is a /usr/sbin that is mostly symbolic links to /usr/bin but still has some actual programs left.

One source of these remaining /usr/sbin programs is old packages from past versions of Fedora that are no longer packaged in Fedora 41 and Fedora 42. Old packages are usually harmless, so it's easy for them to linger around if you're not disciplined; my home and office desktops (which have been around for a while) still have packages from as far back as Fedora 28.

(An added complication of tracking down file ownership is that some RPMs haven't been updated for the /sbin to /usr/sbin merge and so still believe that their files are /sbin/<whatever> instead of /usr/sbin/<whatever>. A 'rpm -qf /usr/sbin/<whatever>' won't find these.)

Obviously, you shouldn't remove old packages without being sure of whether or not they're important to you. I'm also not completely sure that all packages in the Fedora 41 (or 42) repositories are marked as '.fc41' or '.fc42' in their RPM versions, or if there are some RPMs that have been carried over from previous Fedora versions. Possibly this means I should wait until a few more Fedora versions have come to pass so that other people find and fix the exceptions.

(On what is probably my cleanest Fedora 42 test virtual machine, there are a number of packages that 'dnf list --extras' doesn't list that have '.fc41' in their RPM version. Some of them may have been retained un-rebuilt for binary compatibility reasons. There's also the 'shim' UEFI bootloaders, which date from 2024 and don't have Fedora releases in their RPM versions, but those I expect to basically never change once created. But some others are a bit mysterious, such as 'libblkio', and I suspect that they may have simply been missed by the Fedora 42 mass rebuild.)

PS: In theory anyone with access to the full Fedora 42 RPM repository could sweep the entire thing to find packages that still install /usr/sbin files or even /sbin files, which would turn up any relevant not yet rebuilt packages. I don't know if there's any easy way to do this through dnf commands, although I think dnf does have access to a full file list for all packages (which is used for certain dnf queries).

Access control rules need some form of usage counters

By: cks
16 September 2025 at 03:15

Today, for reasons outside the scope of this entry, I decided to spend some time maintaining and pruning the access control rules for Wandering Thoughts, this blog. Due to the ongoing crawler plague (and past abuses), Wandering Thoughts has had to build up quite a collection of access control rules, which are mostly implemented as a bunch of things in an Apache .htaccess file (partly 'Deny from ...' for IP address ranges and partly as rewrite rules based on other characteristics). The experience has left me with a renewed view of something, which is that systems with access control rules need some way of letting you see which rules are still being used by your traffic.

It's in the nature of systems with access control rules to accumulate more and more rules over time. You hit another special situation, you add another rule, perhaps to match and block something or perhaps to exempt something from blocking. These rules often interact in various ways, and over time you'll almost certainly wind up with a tangled thicket of rules (because almost no one goes back to carefully check and revisit all existing rules when they add a new one or modify an existing one). The end result is a mess, and one of the ways to reduce the mess is to weed out rules that are now obsolete. One way a rule can be obsolete is that it's not used any more, and often these are the easiest rules to drop once you can recognize them.

(A rule that's still being matched by traffic may be obsolete for other reasons, and rules that aren't currently being matched may still be needed as a precaution. But it's a good starting point.)

If you have the necessary log data, you can sometimes establish if a rule was actually ever used by manually checking your logs. For example, if you have logs of rejected traffic (or logs of all traffic), you can search it for an IP address range to see if a particular IP address rule ever matched anything. But this requires tedious manual effort and that means that only determined people will go through it, especially regularly. The better way is to either have this information provided directly, such as by counters on firewall rules, or to have something in your logs that makes deriving it easy.

(An Apache example would be to augment any log line that was matched by some .htaccess rule with a name or a line number or the like. Then you could go readily through your logs to determine which lines were matched and how often.)

The next time I design an access control rule system, I'm hopefully going to remember this and put something in its logging to (optionally) explain its decisions.

(Periodically I write something that has an access control rule system of some sort. Unfortunately all of mine to date have been quiet on this, so I'm not at all without sin here.)

The idea of /usr/sbin has failed in practice

By: cks
15 September 2025 at 03:17

One of the changes in Fedora Linux 42 is unifying /usr/bin and /usr/sbin, by moving everything in /usr/sbin to /usr/bin. To some people, this probably smacks of anathema, and to be honest, my first reaction was to bristle at the idea. However, the more I thought about it, the more I had to concede that the idea of /usr/sbin has failed in practice.

We can tell /usr/sbin has failed in practice by asking how many people routinely operate without /usr/sbin in their $PATH. In a lot of environments, the answer is that very few people do, because sooner or later you run into a program that you want to run (as yourself) to obtain useful information or do useful things. Let's take FreeBSD 14.3 as an illustrative example (to make this not a Linux biased entry); looking at /usr/sbin, I recognize iostat, manctl (you might use it on your own manpages), ntpdate (which can be run by ordinary people to query the offsets of remote servers), pstat, swapinfo, and traceroute. There are probably others that I'm missing, especially if you use FreeBSD as a workstation and so care about things like sound volumes and keyboard control.

(And if you write scripts and want them to send email, you'll care about sendmail and/or FreeBSD's 'mailwrapper', both in /usr/sbin. There's also DTrace, but I don't know if you can DTrace your own binaries as a non-root user on FreeBSD.)

For a long time, there has been no strong organizing principle to /usr/sbin that would draw a hard line and create a situation where people could safely leave it out of their $PATH. We could have had a principle of, for example, "programs that don't work unless run by root", but no such principle was ever followed for very long (if at all). Instead programs were more or less shoved in /usr/sbin if developers thought they were relatively unlikely to be used by normal people. But 'relatively unlikely' is not 'never', and shortly after people got told to 'run traceroute' and got 'command not found' when they tried, /usr/sbin (probably) started appearing in $PATH.

(And then when you asked 'how does my script send me email about something', people told you about /usr/sbin/sendmail and another crack appeared in the wall.)

If /usr/sbin is more of a suggestion than a rule and it appears in everyone's $PATH because no one can predict which programs you want to use will be in /usr/sbin instead of /usr/bin, I believe this means /usr/sbin has failed in practice. What remains is an unpredictable and somewhat arbitrary division between two directories, where which directory something appears in operates mostly as a hint (a hint that's invisible to people who don't specifically look where a program is).

(This division isn't entirely pointless and one could try to reform the situation in a way short of Fedora 42's "burn the entire thing down" approach. If nothing else the split keeps the size of both directories somewhat down.)

PS: The /usr/sbin like idea that I think is still successful in practice is /usr/libexec. Possibly a bunch of things in /usr/sbin should be relocated to there (or appropriate subdirectories of it).

My machines versus the Fedora selinux-policy-targeted package

By: cks
14 September 2025 at 02:26

I upgrade Fedora on my office and home workstations through an online upgrade with dnf, and as part of this I read (or at least scan) DNF's output to look for problems. Usually this goes okay, but DNF5 has a general problem with script output and when I did a test upgrade from Fedora 41 to Fedora 42 on a virtual machine, it generated a huge amount of repeated output from a script run by selinux-policy-targeted, repeatedly reporting "Old compiled fcontext format, skipping" for various .bin files in /etc/selinux/targeted/contexts/files. The volume of output made the rest of DNF's output essentially unreadable. I would like to avoid this when I actually upgrade my office and home workstations to Fedora 42 (which I still haven't done, partly because of this issue).

(You can't make this output easier to read because DNF5 is too smart for you. This particular error message reportedly comes from 'semodule -B', per this Fedora discussion.)

The 'targeted' policy is one of several SELinux policies that are supported or at least packaged by Fedora (although I suspect I might see similar issues with the other policies too). My main machines don't use SELinux and I have it completely disabled, so in theory I should be able to remove the selinux-policy-targeted package to stop it from repeatedly complaining during the Fedora 42 upgrade process. In practice, selinux-policy-targeted is a 'protected' package that DNF will normally refuse to remove. Such packages are listed in /etc/dnf/protected.d/ in various .conf files; selinux-policy-targeted installs (well, includes) a .conf file to protect itself from removal once installed.

(Interestingly, sudo protects itself but there's nothing specifically protecting su and the rest of util-linux. I suspect util-linux is so pervasively a dependency that other protected things hold it down, or alternately no one has ever worried about people removing it and shooting themselves in the foot.)

I can obviously remove this .conf file and then DNF will let me remove selinux-policy-targeted, which will force the removal of some other SELinux policy packages (both selinux-policy packages themselves and some '*-selinux' sub-packages of other packages). I tried this on another Fedora 41 test virtual machine and nothing obvious broke, but that doesn't mean that nothing broke at all. It seems very likely that almost no one tests Fedora without the selinux-policy collective installed and I suspect it's not a supported configuration.

I could reduce my risks by removing the packages only just before I do the upgrade to Fedora 42 and put them back later (well, unless I run into a dnf issue as a result, although that issue is from 2024). Also, now that I've investigated this, I could in theory delete the .bin files in /etc/selinux/targeted/contexts/files before the upgrade, hopefully making it so that selinux-policy-targeted has less or nothing to complain about. Since I'm not using SELinux, hopefully the lack of these files won't cause any problems, but of course this is less certain a fix than removing selinux-policy-targeted (for example, perhaps the .bin files would get automatically rebuilt early on in the upgrade process as packages are shuffled around, and bring the problem back with them).

Really, though, I wish DNF5 didn't have its problem with script output. All of this is hackery to deal with that underlying issue.

Some notes on (Tony Finch's) exponential rate limiting in practice

By: cks
13 September 2025 at 03:43

After yesterday's entry where I discovered it, I went and implemented Tony Finch's exponential rate limiting for HTTP request rate limiting in DWiki, the engine underlying this blog, replacing the more brute force and limited version I had initially implemented. I chose exponential rate limiting over GCRA or leaky buckets because I found it much easier to understand how to set the limits (partly because I'm somewhat familiar with the whole thing from Exim). Exponential rate limiting needed me to pick a period of time and a number of (theoretical) requests that can be made in that time interval, which was easy enough; GCRA 'rate' and 'burst' numbers were less clear to me. However, exponential rate limiting has some slightly surprising things that I want to remember.

(Exponential ratelimits don't have a 'burst' rate as such but you can sort of achieve this by your choice of time intervals.)

In my original simple rate limiting, any rate limit record that had a time outside of my interval was irrelevant and could be dropped in order to reduce space usage (my current approach uses basically the same hack as my syndication feed ratelimits, so I definitely don't want to let its space use grow without bound). This is no longer necessarily true in exponential rate limiting, depending on how big of a rate the record (the source) had built up before it took a break. This old rate 'decays' at a rate I will helpfully put in a table for my own use:

Time since last seen Old rate multiplied by
1x interval 0.37
2x interval 0.13
3x interval 0.05
4x interval 0.02

(This is, eg, 'exp(-1)' for we only last saw the source 'interval' time ago.)

Where this becomes especially relevant is if you opt for 'strict' rate limiting instead of 'leaky', where every time the source makes a request you increase its recorded rate even if you reject the request for being rate limited. A high-speed source that insists on hammering you for a while can build up a very large current rate under a strict rate limit policy, and that means its old past behavior can affect it (ie, possibly cause it to be rate limited) well beyond your nominal rate limit interval. Especially with 'strict' rate limiting, you could opt to cap the maximum age a valid record could have and drop everything that you last saw over, say, 3x your interval ago; this would be generous to very high rate old sources, but not too generous (since their old rate would be reduced to 0.05 or less of what it was even if you counted it).

As far as I can see, the behavior with leaky rate limiting and a cost of 1 (for the simple case of all HTTP requests having the same cost) is that if the client keeps pounding away at you, one of its requests will get through on a semi-regular basis. The client will make a successful request, the request will push its rate just over your limit, it will get rate limited some number of times, then enough time will have passed since its last successful request that its new request will be just under the rate limit and succeed. In some environments, this is fine and desired. However, my current goal is to firmly cut off clients that are making requests too fast, so I don't want this; instead, I implemented the 'strict' behavior so you don't get through at all until your request rate and the interval since your last request drops low enough.

Mathematically, a client that makes requests with little or no gap between them (to the precision of your timestamps) can wind up increasing its rate by slightly over its 'cost' per request. If I'm understanding the math correctly, how much over the cost is capped by Tony Finch's 'max(interval, 1.0e-10)' step, with 1.0e-10 being a small but non-zero number that you can move up or down depending on, eg, your language and its floating point precision. Having looked at it, in Python the resulting factor with 1.0e-10 is '1.000000082740371', so you and I probably don't need to worry about this. If the client doesn't make requests quite that fast, its rate will go up each time by slightly less than the 'cost' you've assigned. In Python, a client that makes a request every millisecond has a factor for this of '0.9995001666249781' of the cost; slower request rates make this factor smaller.

This is probably mostly relevant if you're dumping or reporting the calculated rates (for example, when a client hits the rate limit) and get puzzled by the odd numbers that may be getting reported.

I don't know how to implement proper ratelimiting (well, maybe I do now)

By: cks
12 September 2025 at 01:53

In theory I have a formal education as a programmer (although it was a long time ago). In practice my knowledge from it isn't comprehensive, and every so often I run into an area where I know there's relevant knowledge and algorithms but I don't know what they are and I'm not sure how to find them. Today's area is scalable rate-limiting with low storage requirements.

Suppose, not hypothetically, that you want to ratelimit a collection of unpredictable sources and not use all that much storage per source. One extremely simple and obvious approach is to store, for each source, a start time and a count. Every time the source makes a request, you check to see if the start time is within your rate limit interval; if it is, you increase the count (or ratelimit the source), and if it isn't, you reset the start time to now and the count to 1.

(Every so often you can clean out entries with start times before your interval.)

The disadvantage of this simple approach is that it completely forgets about the past history of each source periodically. If your rate limit intervals are 20 minutes, a prolific source gets to start over from scratch every 20 minutes and run up its count until it gets rate limited again. Typically you want rate limiting not to forget about sources so fast.

I know there are algorithms that maintain decaying averages or moving (rolling) averages. The Unix load average is maintained this way, as is Exim ratelimiting. The Unix load average has the advantage that it's updated on a regular basis, which makes the calculation relatively simple. Exim has to deal with erratic updates that are unpredictable intervals from the previous update, and the comment in the source is a bit opaque to me. I could probably duplicate the formula in my code but I'd have to do a bunch of work to convince myself the result was correct.

(And now I've found Tony Finch's exponential rate limiting (via), which I'm going to have to read carefully, along with the previous GCRA: leaky buckets without the buckets.)

Given that rate limiting is such a common thing these days, I suspect that there are a number of algorithms for this with various different choices about how the limits work. Ideally, it would be possible to readily find writeups of them with internet searches, but of course as you know internet search is fairly broken these days.

(For example you can find a lot of people giving high level overviews of rate limiting without discussing how to actually implement it.)

Now that I've found Tony Finch's work I'm probably going to rework my hacky rate limiting code to do things better, because my brute force approach is using the same space as leaky buckets (as covered in Tony Finch's article) with inferior results. This shows the usefulness of knowing algorithms instead of just coding away.

(Improving the algorithm in my code will probably make no practical difference, but sometimes programming is its own pleasure.)

ZFS snapshots aren't as immutable as I thought, due to snapshot metadata

By: cks
11 September 2025 at 03:29

If you know about ZFS snapshots, you know that one of their famous properties is that they're immutable; once a snapshot is made, its state is frozen. Or so you might casually describe it, but that description is misleading. What is frozen in a ZFS snapshot is the state of the filesystem (or zvol) that it captures, and only that. In particular, the metadata associated with the snapshot can and will change over time.

(When I say it this way it sounds obvious, but for a long time my intuition about how ZFS operated was misled by me thinking that all aspects of a snapshot had to be immutable once made and trying to figure out how ZFS worked around that.)

One visible place where ZFS updates the metadata of a snapshot is to maintain information about how much unique space the snapshot is using. Another is that when a ZFS snapshot is deleted, other ZFS snapshots may require updates to adjust the list of snapshots (every snapshot points to the previous one) and the ZFS deadlist of blocks that are waiting to be freed.

Mechanically, I believe that various things in a dsl_dataset_phys_t are mutable, with the exception of things like the creation time and the creation txg, and also the block pointer, which points to the actual filesystem data of the snapshot. Things like the previous snapshot information have to be mutable (you might delete the previous snapshot), and things like the deadlist and the unique bytes are mutated as part of operations like snapshot deletion. The other things I'm not sure of.

(See also my old entry on a broad overview of how ZFS is structured on disk. A snapshot is a 'DSL dataset' and it points to the object set for that snapshot. The root directory of a filesystem DSL dataset, snapshot or otherwise, is at a fixed number in the object set; it's always object 1. A snapshot freezes the object set as of that point in time.)

PS: Another mutable thing about snapshots is their name, since 'zfs rename' can change that. The manual page even gives an example of using (recursive) snapshot renaming to keep a rolling series of daily snapshots.

How I think OpenZFS's 'written' and 'written@<snap>' dataset properties work

By: cks
10 September 2025 at 03:25

Yesterday I wrote some notes about ZFS's 'written' dataset property, where the short summary is that 'written' reports the amount of space written in a snapshot (ie, that wasn't in the previous snapshot), and 'written@<snapshot>' reports the amount of space written since the specified snapshot (up to either another snapshot or the current state of the dataset). In that entry, I left un-researched the question of how ZFS actually gives us those numbers; for example, if there was a mechanism in place similar to the complicated one for 'used' space. I've now looked into this and as far as I can see the answer is that ZFS determines information on the fly.

The guts of the determination are in dsl_dataset_space_written_impl(), which has a big comment that I'm going to quote wholesale:

Return [...] the amount of space referenced by "new" that was not referenced at the time the bookmark corresponds to. "New" may be a snapshot or a head. The bookmark must be before new, [...]

The written space is calculated by considering two components: First, we ignore any freed space, and calculate the written as new's used space minus old's used space. Next, we add in the amount of space that was freed between the two time points, thus reducing new's used space relative to old's. Specifically, this is the space that was born before zbm_creation_txg, and freed before new (ie. on new's deadlist or a previous deadlist).

(A 'bookmark' here is an internal ZFS thing.)

When this talks about 'used' space, this is not the "used" snapshot property; this is the amount of space the snapshot or dataset refers to, including space shared with other snapshots. If I'm understanding the code and the comment right, the reason we add back in freed space is because otherwise you could wind up with a negative number. Suppose you wrote a 2 GB file, made one snapshot, deleted the file, and then made a second snapshot. The difference in space referenced between the two snapshots is slightly less than negative 2 GB, but we can't report that as 'written', so we go through the old stuff that got deleted and add its size back in to make the number positive again.

To determine the amount of space that's been freed between the bookmark and "new", the ZFS code walks backward through all snapshots from "new" to the bookmark, calling another ZFS function to determine how much relevant space got deleted. This uses the ZFS deadlists that ZFS is already keeping track of to know when it can free an object.

This code is used both for 'written@<snap>' and 'written'; the only difference between them is that when you ask for 'written', the ZFS kernel code automatically finds the previous snapshot for you.

Some notes on OpenZFS's 'written' dataset property

By: cks
9 September 2025 at 03:28

ZFS snapshots and filesystems have a 'written' property, and a related 'written@snapshot one. These are documented as:

written
The amount of space referenced by this dataset, that was written since the previous snapshot (i.e. that is not referenced by the previous snapshot).

written@snapshot
The amount of referenced space written to this dataset since the specified snapshot. This is the space that is referenced by this dataset but was not referenced by the specified snapshot. [...]

(Apparently I never noticed the 'written' property before recently, despite it being there from very long ago.)

The 'written' property is related to the 'used' property, and it's both more confusing and less confusing as it relates to snapshots. Famously (but not famously enough), for snapshots the used property ('USED' in the output of 'zfs list') only counts space that is exclusive to that snapshot. Space that's only used by snapshots but that is shared by more than one snapshot is in 'usedbysnapshots'.

To understand 'written' better, let's do an experiment: we'll make a snapshot, write a 2 GByte file, make a second snapshot, write another 2 GByte file, make a third snapshot, and then delete the first 2 GB file. Since I've done this, I can tell you the results.

If there are no other snapshots of the filesystem, the first snapshot's 'written' value is the full size of the filesystem at the time it was made, because everything was written before it was made. The second snapshot's 'written' is 2 GBytes, the data file we wrote between the first and the second snapshot. The third snapshot's 'written' is another 2 GB, for the second file we wrote. However, at the end, after we delete one of the data files, the filesystem's 'written' is small (certainly not 2 GB), and so would be the 'written' of a fourth snapshot if we made one.

The reason the filesystem's 'written' is so small is that ZFS is counting concrete on-disk (new) space. Deleting a 2 GB file frees up a bunch of space but it doesn't require writing very much to the filesystem, so the 'written' value is low.

If we look at the 'used' values for all three snapshots, they're all going to be really low. This is because both 2 GByte data files we wrote are shared between the second and the third snapshot. Since they're both in multiple snapshots, they're in 'usedbysnapshots' but not 'used'.

(ZFS has a somewhat complicated mechanism to maintain all of this information.)

There is one interesting 'written' usage that appears to show you deleted space, but it is a bit tricky. The manual page implies that the normal usage of 'written@<snapshot>' is to ask for it for the filesystem itself; however, in experimentation you can ask for it for a snapshot too. So take the three snapshots above, and the filesystem after deleting the first data file. If you ask for 'written@first' for the filesystem, you will get 2 GB, but if you ask for 'written@first' for the third snapshot, you will get 4 GB. What the filesystem appears to be reporting is how much still-live data has been written between the first snapshot and now, which is only 2 GB because we deleted the other 2 GB. Meanwhile, all four GB are still alive in the third snapshot.

My conclusion from looking into this is that I can use 'written' as an indication of how much new data a snapshot has captured, but I can't use it as an indication of how much changed in a snapshot. As I've seen, deleting data is a potentially big change but a small 'written' value. If I'm understanding 'written' correctly, one useful thing about it is that it shows roughly how much data an incremental 'zfs send' of just that snapshot would send. Under some circumstances it will also give you an idea of how much data your backup system may need to back up; however, this works best if people are creating new files (and deleting old ones), instead of updating or appending to existing files (where ZFS only updates some blocks but a backup system probably needs to re-save the whole thing).

Why Firefox's media autoplay settings are complicated and imperfect

By: cks
8 September 2025 at 03:25

In theory, a website that wanted to play video or audio could throw in a '<video controls ...>' or '<audio controls ...>' element in the HTML of the page and be done with it. This would make handling playing media simple and blocking autoplay reliable; you'd ignore the autoplay element and the person using the browser would directly trigger playing media by interacting with things that the browser directly controlled and so the browser could know for sure that a person had directly clicked on them and the media should be played.

As anyone who's seen websites with audio and video on the web knows, in practice almost no one does it this way, with browser controls on the <video> or <audio> element. Instead, everyone displays controls of their own somehow (eg as HTML elements styled through CSS), attaches JavaScript actions to them, and then uses the HTMLMediaElement browser API to trigger playback and various other things. As a result of this use of JavaScript, browsers in general and Firefox in particular no longer have a clear, unambiguous view of your intentions to play media. At best, all they can know is that you interacted with the web page, this interaction triggered some JavaScript, and the JavaScript requested that media play.

(Browsers can know somewhat of how you interacted with a web page, such as whether you clicked or scrolled or typed a key.)

On good, well behaved websites, this interaction is with visually clear controls (such as a visual 'play' button) and the JavaScript that requests media playing is directly attached to those controls. And even on these websites, JavaScript may later legitimately act asynchronously to request more playing of things, or you may interact with media playback in other ways (such as spacebar to pause and then restart media playing). On not so good websites, well, any piece of JavaScript that manages to run can call HTMLMediaElement.play() to try to start playing the media. There are lots of ways to have JavaScript run automatically and so a web page can start trying to play media the moment its JavaScript starts running, and it can keep trying to trigger playback over and over again if it wants to through timers or suchlike.

Since Firefox only blocking the actual autoplay attribute and allowing JavaScript to trigger media playing any time it wants to would be a pretty obviously bad 'Block Autoplay' experience, it must try harder. Firefox's approach is to (also) block use of HTMLMediaElement.play() until you have done some 'user gesture' on the page. As far as I can tell from Firefox's description of this, the list of 'user gestures' is fairly expansive and covers much of how you interact with a page. Certainly, if a website can cause you to click on something, regardless of what it looks like, this counts as a 'user gesture' in Firefox.

(I'm sure that Firefox's selection of things that count as 'user gestures' are drawn from real people on real hardware doing things to deliberately trigger playback, including resuming playback after it's been paused by, for example, tapping spacebar.)

In Firefox, this makes it quite hard to actually stop a bad website from playing media while preserving your ability to interact with the site. Did you scroll the page with the spacebar? I think that counts as a user gesture. Did you use your mouse scroll wheel? Probably a user gesture. Did you click on anything at all, including to dismiss some banner? Definitely a user gesture. As far as I can tell, the only reliable way you can prevent a web page from starting media playback is to immediately close the page. Basically anything you do to use it is dangerous.

Firefox does have a very strict global 'no autoplay' policy that you can turn on through about:config, which they call click-to-play, where Firefox tries to limit HTMLMediaElement.play() to being called as the direct result of a JavaScript event handler. However, their wiki notes that this can break some (legitimate) websites entirely (well, for media playback), and it's a global setting that gets in the way of some things I want; you can't set it only for some sites. And even with click-to-play, if a website can get you to click on something of its choice, it's game over as far as I know; if you have to click or tap a key to dismiss an on-page popup banner, the page can trigger media playing from that event handler.

All of this is why I'd like a per-website "permanent mute" option for Firefox. As far as I know, there's literally no other way in standard Firefox to reliably prevent a potentially bad website (or advertising network that it uses) from playing media on you.

(I suspect that you can defeat a lot of such websites with click-to-play, though.)

PS: Muting a tab in Firefox is different from stopping media playback (or blocking it from starting). All it does is stop Firefox from outputting audio from that tab (to wherever you're having Firefox send audio). Any media will 'play' or continue to play, including videos displaying moving things and being distracting.

We can't expect people to pick 'good' software

By: cks
7 September 2025 at 02:35

One of the things I've come to believe in (although I'm not consistent about it) is that we can't expect people to pick software that is 'good' in a technical sense. People certainly can and do pick software that is good in that it works nicely, has a user interface that works for them, and so on, which is to say all of the parts of 'good' that they can see and assess, but we can't expect people to go beyond that, to dig deeply into the technical aspects to see how good their choice of software is. For example, how efficiently an IMAP client implements various operations at the protocol level is more or less invisible to most people. Even if you know enough to know about potential technical quality aspects, realistically you have to rely on any documentation the software provides (if it provides anything). Very few people are going to set up an IMAP server test environment and point IMAP clients at it to see how they behave, or try to read the source code of open source clients.

(Plus, you have to know a lot to set up a realistic test environment. A lot of modern software varies its behavior in subtle ways depending on the surrounding environment, such as the server (or client) at the other end, what your system is like, and so on. To extend my example, the same IMAP client may behave differently when talking to two different IMAP server implementations.)

Broadly, the best we can do is get software to describe important technical aspects of itself, to document them even if the software doesn't, and to explain to people why various aspects matter and thus what they should look for if they want to pick good software. I think this approach has seen some success in, for example, messaging apps, where 'end to end encrypted' or similar things has become a technical quality measure that's typically relatively legible to people. Other technical quality measures in other software are much less legible to people in general, including in important software like web browsers.

(One useful way to make technical aspects legible is to create some sort of scorecard for them. Although I don't think it was built for this purpose, there's caniuse for browsers and their technical quality for various CSS and HTML5 features.)

To me, one corollary to this is that there's generally no point in yelling at people (in various ways) or otherwise punishing them because they picked software that isn't (technically) good. It's pretty hard for a non-specialist to know what is actually good or who to trust to tell them what's actually good, so it's not really someone's fault if they wind up with not-good software that does undesirable things. This doesn't mean that we should always accept the undesirable things, but it's probably best to either deal with them or reject them as gracefully as possible.

(This definitely doesn't mean that we should blindly follow Postel's Law, because a lot of harm has been done to various ecosystems by doing so. Sometimes you have to draw a line, even if it affects people who simply had bad luck in what software they picked. But ideally there's a difference between drawing a line and yelling at people about them running into the line.)

Our too many paths to 'quiet' Prometheus alerts

By: cks
6 September 2025 at 02:54

One of the things our Prometheus environment has is a notion of different sorts of alerts, and in particular of less important alerts that should go to a subset of people (ie, me). There are various reasons for this, including that the alert is in testing, or it concerns a subsystem that only I should have to care about, or that it fires too often for other people (for example, a reboot notification for a machine we routinely reboot).

For historical reasons, there are at least four different ways that this can be done in our Prometheus environment:

  • a special label can be attached to the Prometheus alert rule, which is appropriate if the alert rule itself is in testing or otherwise is low priority.

  • a special label can be attached to targets in a scrape configuration, although this has some side effects that can be less than ideal. This affects all alerts that trigger based on metrics from, for example, the Prometheus host agent (for that host).

  • our Prometheus configuration itself can apply alert relabeling to add the special label for everything from a specific host, as indicated by a "host" label that we add. This is useful if we have so many exporters being scraped from a particular host, or if I want to keep metric continuity (ie, the metrics not changing their label set) when a host moves into production.

  • our Alertmanager configuration can specifically route certain alerts about certain machines to the 'less important alerts' destination.

The drawback of these assorted approaches is that now there are at least three places to check and possibly to update when a host moves from being a testing host into being a production host. A further drawback is some of these (the first two) are used a lot more often than others of these (the last two). When you have multiple things, some of which are infrequently used, and fallible humans have to remember to check them all, you can guess what can happen next.

And that is the simple version of why alerts about one of our fileservers wouldn't have gone to everyone here for about the past year.

How I discovered the problem was that I got an alert about one of the fileserver's Prometheus exporters restarting, and decided that I should update the alert configuration to make it so that alerts about this service restarting only went to me. As I was in the process of doing this, I realized that the alert already had only gone to me, despite there being no explicit configuration in the alert rule or the scrape configuration. This set me on an expedition into the depths of everything else, where I turned up an obsolete bit in our general Prometheus configuration.

On the positive side, now I've audited our Prometheus and Alertmanager configurations for any other things that shouldn't be there. On the negative side, I'm now not completely sure that there isn't a fifth place that's downgrading (some) alerts about (some) hosts.

Could NVMe disks become required for adequate performance?

By: cks
5 September 2025 at 03:34

It's not news that full speed NVMe disks are extremely fast, as well as extremely good at random IO and doing a lot of IO at once. In fact they have performance characteristics that upset general assumptions about how you might want to design systems, at least for reading data from disk (for example, you want to generate a lot of simultaneous outstanding requests, either explicitly in your program or implicitly through the operating system). I'm not sure how much write bandwidth normal NVMe drives can really deliver for sustained write IO, but I believe that they can absorb very high write rates for a short period as you flush out a few hundred megabytes or more. This is a fairly big sea change from even SATA SSDs (and I believe SAS SSDs), never mind HDDs.

About a decade ago, I speculated that everyone was going to be forced to migrate to SATA SSDs because developers would build programs that required SATA SSD performance. It's quite common for developers to build programs and systems that run well on their hardware (whether that's laptops, desktops, or servers, cloud or otherwise), and developers often use the latest and best. These days, that's going to have NVMe SSDs, and so it wouldn't be surprising if developers increasingly developed for full NVMe performance. Some of this may be inadvertent, in that the developer doesn't realize what the performance impact of their choices are on systems with less speedy storage. Some of this will likely be deliberate, as developers choose to optimize for NVMe performance or even develop systems that only work well with that level of performance.

This is a potential problem because there are a number of ways to not have that level of NVMe performance. Most obviously, you can simply not have NVMe drives; instead you may be using SATA SSDs (as we mostly are, including in our fileservers), or even HDDs (as we are in our Prometheus metrics server). Less obviously, you may have NVMe drives but be driving them in ways that don't give you the full NVMe bandwidth. For instance, you might have a bunch of NVMe drives behind a 'tri-mode' HBA, or have (some of) your NVMe drives hanging off the chipset with shared PCIe lanes to the CPU, or have to drive some of your NVMe drives with fewer than x4 PCIe because of limits on slots or lanes.

(Dedicated NVMe focused storage servers will be able to support lots of NVMe devices at full speed, but such storage servers are likely to be expensive. People will inevitably build systems with lower end setups, us included, and I believe that basic 1U servers are still mostly SATA/SAS based.)

One possible reason for optimism is that in today's operating systems, it can take careful system design and unusual programming patterns to really push NVMe disks to high performance levels. This may make it less likely that software accidentally winds up being written so it only performs well on NVMe disks; if it happens, it will be deliberate and the project will probably tell you about it. This is somewhat unlike the SSD/HDD situation a decade ago, where the difference in (random) IO operations per second was both massive and easily achieved.

(This entry was sparked in part by reading this article (via), which I'm not taking a position on.)

HTTP headers that tell syndication feed fetchers how soon to come back

By: cks
4 September 2025 at 03:17

Programs that fetch syndication feeds should fetch them only every so often. But how often? There are a variety of ways to communicate this, and for my own purposes I want to gather them in one place.

I'll put the summary up front. For Atom syndication feeds, your HTTP feed responses should contain a Cache-Control: max-age=... HTTP header that gives your desired retry interval (in seconds), such as '3600' for pulling the feed once an hour. If and when people trip your rate limits and get HTTP 429 responses, your 429s should include a Retry-After header with how long you want feed readers to wait (although they won't).

There are two syndication feed formats in general usage, Atom and RSS2. Although generally not great (and to be avoided), RSS2 format feeds can optionally contain a number of elements to explicitly tell feed readers how frequently they should poll the feed. The Atom syndication feed format has no standard element to communicate polling frequency. Instead, the nominally standard way to do this is through a general Cache-Control: max-age=... HTTP header, which gives a (remaining) lifetime in seconds. You can also set an Expires header, which gives an absolute expiry time, but not both.

(This information comes from Daniel Aleksandersen's Best practices for syndication feed caching. One advantage of HTTP headers over feed elements is that they can be returned on HTTP 304 Not Modified responses; one drawback is that you need to be able to set HTTP headers.)

If you have different rate limit policies for conditional GET requests and unconditional ones, you have a choice to make about the time period you advertise on successful unconditional GETs of your feed. Every feed reader has to do an unconditional GET the first time it fetches your feed, and many of them will periodically do unconditional GETs for various reasons. You could choose to be optimistic, assume that the feed reader's next poll will be a conditional GET, and give it the conditional GET retry interval, or you could be pessimistic and give it a longer unconditional GET one. My personal approach is to always advertise the conditional GET retry interval, because I assume that if you're not going to do any conditional GETs you're probably not paying attention to my Cache-Control header either.

As rachelbythebay's ongoing work on improving feed reader behavior has uncovered, a number of feed readers will come back a bit earlier than your advertised retry interval. So my view is that if you have a rate limit, you should advertise a retry interval that is larger than it. On Wandering Thoughts my current conditional GET feed rate limit is 45 minutes, but I advertise a one hour max-age (and I would like people to stick to once an hour).

(Unconditional GETs of my feeds are rate limited down to once every four hours.)

Once people trip your rate limits and start getting HTTP 429 responses, you theoretically can signal how soon they can come back with a Retry-After header. The simplest way to implement this is to have a constant value that you put in this header, even if your actual rate limit implementation would allow a successful request earlier. For example, if you rate limit to one feed fetch every half hour and a feed fetcher polls after 20 minutes, the simple Retry-After value is '1800' (half an hour in seconds), although if they tried again in just over ten minutes they could succeed (depending on how you implement rate limits). This is what I currently do, with a different Retry-After (and a different rate limit) for conditional GET requests and unconditional GETs.

My suspicion is that there are almost no feed fetchers that ignore your Cache-Control max-age setting but that honor your HTTP 429 Retry-After setting (or that react to 429s at all). Certainly I see a lot of feed fetchers here behaving in ways that very strongly suggest they ignore both, such as rather frequent fetch attempts. But at least I tried.

Sidebar: rate limit policies and feed reader behavior

When you have a rate limit, one question is whether failed (rate limited) requests should count against the rate limit, or if only successful ones count. If you nominally allow one feed fetch every 30 minutes and a feed reader fetches at T (successfully), T+20, and T+33, this is the difference between the third fetch failing (since it's less than 30 minutes from the previous attempt) or succeeding (since it's more than 30 minutes from the last successful fetch).

There are various situations where the right answer is that your rate limit counts from the last request even if the last request failed (what Exim calls a strict ratelimit). However, based on observed feed reader behavior, doing this strict rate limiting on feed fetches will result in quite a number of syndication feed readers never successfully fetching your feed, because they will never slow down and drop under your rate limit. You probably don't want this.

Mapping from total requests per day to average request rates

By: cks
3 September 2025 at 03:43

Suppose, not hypothetically, that a single IP address with a single User-Agent has made 557 requests for your blog's syndication feed in about 22 and a half hours (most of which were rate-limited and got HTTP 429 replies). If we generously assume that these requests were distributed evenly over one day (24 hours), what was the average interval between requests (the rate of requests)? The answer is easy enough to work out and it's about two and a half minutes between requests, if they were evenly distributed.

I've been looking at numbers like this lately and I don't feel like working out the math each time, so here is a table of them for my own future use.

Total requests Theoretical interval (rate)
6 Four hours
12 Two hours
24 One hour
32 45 minutes
48 30 minutes
96 15 minutes
144 10 minutes
288 5 minutes
360 4 minutes
480 3 minutes
720 2 minutes
1440 One minute
2880 30 seconds
5760 15 seconds
8640 10 seconds
17280 5 seconds
43200 2 seconds
86400 One second

(This obviously isn't comprehensive; instead I want it to give me a ballpark idea, and I care more about higher request counts than lower ones. But not too high because I mostly don't deal with really high rates. Every four hours and every 45 minutes are relevant to some ratelimiting I do.)

Yesterday there were about 20,240 requests for the main syndication feed for Wandering Thoughts, which is an aggregate rate of more than one request every five seconds. About 10,570 of those requests weren't blocked in various ways or ratelimited, which is still more than one request every ten seconds (if they were evenly spread out, which they probably weren't).

(There were about 48,000 total requests to Wandering Thoughts, and about 18,980 got successful responses, although almost 2,000 of those successful responses were a single rogue crawler that's now blocked. This is of course nothing compared to what a busy website sees. Yesterday my department's web server saw 491,900 requests, although that seems to have been unusually high. Interested parties can make their own tables for that sort of volume level.)

It's a bit interesting to see this table written out this way. For example, if I thought about it I knew there was a factor of ten difference between one request every ten seconds and one request every second, but it's more concrete when I see the numbers there with the extra zero.

In GNU Emacs, I should remember that the basics still work

By: cks
2 September 2025 at 03:42

Over on the Fediverse, I said something that has a story attached:

It sounds obvious to say it, but I need to remember that I can always switch buffers in GNU Emacs by just switching buffers, not by using, eg, the MH-E commands to switch (back) to another folder. The MH-E commands quite sensibly do additional things, but sometimes I don't want them.

GNU Emacs has a spectrum of things that range from assisting your conventional editing (such as LSP clients) to what are essentially nearly full-blown applications that happen to be embedded in GNU Emacs, such as magit and MH-E and the other major modes for reading your email (or Usenet news, or etc). One of my personal dividing lines is to what extent the mode takes over from regular Emacs keybindings and regular Emacs behaviors. On this scale, MH-E is quite high on the 'application' side; in MH-E folder buffers, you mostly do things through custom keybindings.

(Well, sort of. This is actually overselling the case because I use regular Emacs buffer movement and buffer searching commands routinely, and MH-E uses Emacs marks to select ranges of messages, which you establish through normal Emacs commands. But actual MH-E operations, like switching to another folder, are done through custom keybindings that involve MH-E functions.)

My dominant use of GNU Emacs at the moment is as a platform for MH-E. When I'm so embedded in an MH-E mindset, it's easy to wind up with a form of tunnel vision, where I think of the MH-E commands as the only way to do something like 'switch to another (MH) folder'. Sometimes I do need or want to use the MH-E commands, and sometimes they're the easiest way, but part of the power of GNU Emacs as a general purpose environment is that ultimately, MH-E's displays of folders and messages, the email message I'm writing, and so on, are all just Emacs buffers being displayed in Emacs windows. I don't have to switch between these things through MH-E commands if I don't want to; I can just switch buffers with 'C-x b'.

(Provided that the buffer already exists. If the buffer doesn't exist, I need to use the MH-E command to create it.)

Sometimes the reason to use native Emacs buffer switching is that there's no MH-E binding for the functionality, for example to switch from a mail message I'm writing back to my inbox (either to look at some other message or to read new email that just came in). Sometimes it's because, for example, the MH-E command to switch to a folder wants to rescan the MH folder, which forces me to commit or discard any pending deletions and refilings of email.

One of the things that makes this work is that MH-E uses a bunch of different buffers for things. For example, each MH folder gets its own separately named buffer, instead of MH-E simply loading the current folder (whatever it is) into a generic 'show a folder' buffer. Magit does something similar with buffer naming, where its summary buffer isn't called just 'magit' but 'magit: <directory>' (I hadn't noticed that until I started writing this entry, but of course Magit would do it that way as a good Emacs citizen).

Now that I've written this, I've realized that a bit of my MH-E customization uses a fixed buffer name for a temporary buffer, instead of a buffer name based on the current folder. I'm in good company on this, since a number of MH-E status display commands also use fixed-name buffers, but perhaps I should do better. On the other hand, using a fixed buffer name does avoid having a bunch of these buffers linger around just because I used my command.

(This is using with-output-to-temp-buffer, and a lot of use of it in GNU Emacs' standard Lisp is using fixed names, so maybe my usage here is fine. The relevant Emacs Lisp documentation doesn't have style and usage notes that would tell me either way.)

Some thoughts on Ubuntu automatic ('unattended') package upgrades

By: cks
1 September 2025 at 02:46

The default behavior of a stock Ubuntu LTS server install is that it enables 'unattended upgrades', by installing the package unattended-upgrades (which creates /etc/apt/apt.conf.d/20auto-upgrades, which controls this). Historically, we haven't believed in unattended automatic package upgrades and eventually built a complex semi-automated upgrades system (which has various special features). In theory this has various potential advantages; in practice it mostly results in package upgrades being applied after some delay that depends on when they come out relative to working days.

I have a few machines that actually are stock Ubuntu servers, for reasons outside the scope of this entry. These machines naturally have automated upgrades turned on and one of them (in a cloud, using the cloud provider's standard Ubuntu LTS image) even appears to automatically reboot itself if kernel updates need that. These machines are all in undemanding roles (although one of them is my work IPv6 gateway), so they aren't necessarily indicative of what we'd see on more complex machines, but none of them have had any visible problems from these unattended upgrades.

(I also can't remember the last time that we ran into a problem with updates when we applied them. Ubuntu updates still sometimes have regressions and other problems, forcing them to be reverted or reissued, but so far we haven't seen problems ourselves; we find out about these problems only through the notices in the Ubuntu security lists.)

If we were starting from scratch today in a greenfield environment, I'm not sure we'd bother building our automation for manual package updates. Since we have the automation and it offers various extra features (even if they're rarely used), we're probably not going to switch over to automated upgrades (including in our local build of Ubuntu 26.04 LTS when that comes out next year).

(The advantage of switching over to standard unattended upgrades is that we'd get rid of a local tool that, like all local tools, is all our responsibility. The less local weird things we have, the better, especially since we have so many as it is.)

I wish Firefox had some way to permanently mute a website

By: cks
31 August 2025 at 02:27

Over on the Fediverse, I had a wish:

My kingdom for a way to tell Firefox to never, ever play audio and/or video for a particular site. In other words, a permanent and persistent mute of that site. AFAIK this is currently impossible.

(For reasons, I cannot set media.autoplay.blocking_policy to 2 generally. I could if Firefox had a 'all subdomains of ...' autoplay permission, but it doesn't, again AFAIK.)

(This is in a Firefox setup that doesn't have uMatrix and that runs JavaScript.)

Sometimes I visit sites in my 'just make things work' Firefox instance that has JavaScript and cookies and so on allowed (and throws everything away when it shuts down), and it turns out that those sites have invented exceedingly clever ways to defeat Firefox's default attempts to let you block autoplaying media (and possibly their approach is clever enough to defeat even the strict 'click to start' setting for media.autoplay.blocking_policy). I'd like to frustrate those sites, especially ones that I keep winding up back on for various reasons, and never hear unexpected noises from Firefox.

(In general I'd probably like to invert my wish, so that Firefox never played audio or video by default and I had to specifically enable it on a site by site basis. But again this would need an 'all subdomains of' option. This version might turn out to be too strict, I'd have to experiment.)

You can mute a tab, but only once it starts playing, and your mute isn't persistent. As far as I know there's no (native) way to get Firefox to start a tab muted, or especially to always start tabs for a site in a muted state, or to disable audio and/or video for a site entirely (the way you can deny permission for camera or microphone access). I'm somewhat surprised that Firefox doesn't have any option for 'this site is obnoxious, put them on permanent mute', because there are such sites out there.

Both uMatrix and apparently NoScript can selectively block media, but I'd have to add either of them to this profile and I broadly want it to be as plain as reasonable. I do have uBlock Origin in this profile (because I have it in everything), but as far as I can tell it doesn't have a specific (and selective) media blocking option, although it's possible you can do clever things with filter rules, especially if you care about one site instead of all sites.

(I also think that Firefox should be able to do this natively, but evidently Firefox disagrees with me.)

PS: If Firefox actually does have an apparently well hidden feature for this, I'd love to know about it.

Argparse will let you have multiple long (and short) options for one thing

By: cks
30 August 2025 at 03:19

Argparse is the standard Python module for handling (Unix style) command line options, in the expected way (which not all languages follow). Or at least more or less the expected way; people are periodically surprised that by default argparse allows you to abbreviate long options (although you can safely turn that off if you assume Python 3.8 or later and you remember this corner case).

What I think of as the typical language API for specifying short and long options allows you to specify (at most) one of each; this is the API of, for example, the Go package I use for option handling. When I've written Python programs using argparse, I've followed this usage without thinking very much about it. However, argparse doesn't actually require you to restrict yourself this way. The addargument()_ accepts a list of option strings, and although the documentation's example shows a single short option and a single long option, you can give it more than one of each and it will work.

So yes, you can perfectly reasonably create an argparse option that can be invoked as either '--ns' or '--no-something', so that on the one hand you have a clear canonical version and on the other hand you have something short for convenience. If I'm going to do this (and sometimes I am), the thing I want to remember is that argparse's help output will report these options in the order I gave them to addargument()_ so I probably want to list the long one first, as the canonical and clearest form. In other words:

parser.add_argument("--no-something", "--ns", ....)

so that the -h output I get says:

--no-something, --ns     Don't do something

(If you have multiple '--no-...' options, abbreviated options aren't as compact as this '--ns' style. Of course it's a little bit unusual to have several long options that mean the same thing, but my view is that long options are sort of a zoo anyway and you might as well be convenient.)

Having multiple short (single letter) options for the same thing is also possible but much less in the Unix style, so I'm not sure I'd ever use it. One plausible use is mapping old short options to your real ones for compatibility (or just options that people are accustomed to using for some particular purpose from other programs, and keep using with yours).

(This is probably not news to anyone who's really used argparse. I'm partly writing this down so that I'll remember it in the future.)

You can only customize GNU Emacs so far due to primitives

By: cks
29 August 2025 at 03:52

GNU Emacs is famous as an editor written largely in itself, well, in Emacs Lisp, with a C core for some central high performance things and things that have to be done in C (called 'primitives' in Emacs jargon). It's perhaps popular to imagine that the overall structure of this is that the C parts of GNU Emacs expose a minimal and direct API that's mostly composed of primitive operations, so that as much of Emacs as possible can be implemented in Emacs Lisp. Unfortunately, this isn't really the case, or at least not necessarily as you'd like it, and one consequence of this is to limit the amount of customization you can feasibly do to GNU Emacs.

An illustration of this is in how GNU Emacs de-iconifies frames in X. In a minimal C API version of GNU Emacs, there might be various low level X primitives, including 'x-deiconify-frame', and the Emacs Lisp code for frame management would call these low level X primitives when running under X, and other primitives when running under Windows, and so on. In the actual GNU Emacs, deiconification of frames happens at multiple points and the exposed primitives are things like raise-frame and make-frame-visible. As their names suggest, these primitives aren't there to give Emacs Lisp code access to low level X operations, they're there to do certain higher level logical things.

This is a perfectly fair and logical decision by the GNU Emacs developers. To put it one way, GNU Emacs is opinionated. It and its developers have a certain model of how it works and how things should behave, what it means for the program to be 'GNU Emacs' as opposed to a hypothetical editor construction kit, and what the C code does is a reflection of that. To the Emacs developers, 'make a frame visible' is a sensible thing to do and is best done in C, so they did it that way.

(Buffers are another area where Emacs is quite opinionated on how it wants to work. This sometimes gets awkward, as anyone who's wrestled with temporarily displaying some information from Emacs Lisp may have experienced.)

The drawback of this is that sometimes you can only easily customize GNU Emacs in ways that line up with how the developers expected, since you can't change the inside of C level primitives. If your concept of an operation you want to hook, modify, block, or otherwise fiddles with matches with how GNU Emacs sees things, all is probably good. But if your concept of 'an operation' doesn't match up with how GNU Emacs sees it, you may find that what you want to touch is down inside the C layer and isn't exposed as a separate primitive.

(Even if it is exposed as a primitive in its own right, you can have problems, because when you advise a primitive, this doesn't affect calls to the primitive from other C functions. If there was a separate 'x-deiconify-frame' primitive, I could hook it for calls from Lisp, but not a call from 'make-frame-visible' if that was still a primitive. So to really have effective hooking of a primitive, you need it to be only called from Lisp code (at least for cases you care about).)

PS: This can lead to awkward situations even when everything you want to modify is in Emacs Lisp code, because the specific bit you want to change may be in the middle of a large function. Of course with Emacs Lisp you can always redefine the function, copying its code and modifying it to taste, but there are still drawbacks. You can make this somewhat more reliable in the face of changes (via a comment on this entry, but it's still not great.

The Bash Readline bindings and settings that I want

By: cks
28 August 2025 at 02:49

Normally I use Bash (and Readline in general) in my own environment, where I have a standard .inputrc set up to configure things to my liking (although it turns out that one particular setting doesn't work now (and may never have), and I didn't notice). However, sometimes I wind up using Bash in foreign environments, for example if I'm su'd to root at the moment, and when that happens the differences can be things that I get annoyed by. I spent a bit of today running into this again and being irritated enough that this time I figured out how to fix it on the fly.

The general Bash command to do readline things is 'bind', and I believe it accepts all of the same syntax as readline init files do, both for keybindings and for turning off (mis-)features like bracketed paste (which we dislike enough that turning it off for root is a standard feature of our install framework). This makes it convenient if I forget the exact syntax, because I can just look at my standard .inputrc and copy lines from it.

What I want to do is the following:

  • Switch Readline to the Unix word erase behavior I want:

    set bind-tty-special-chars off
    Control-w: backward-kill-word

    Both of these are necessary because without the first, Bash will automatically bind Ctrl-w (my normal word-erase character) to 'unix-word-rubout' and not let you override that with your own binding.

    (This is the difference that I run into all the time, because I'm very used to be able to use Ctrl-W to delete only the most recent component of a path. I think this partly comes from habit and partly because you tab-complete multi-component paths a component at a time, so if I mis-completed the latest component I want to Ctrl-W just it. M-Del is a standard Readline binding for this, but it's less convenient to type and not something I remember.)

  • Make readline completion treat symbolic links to directories as if they were directories:

    set mark-symlinked-directories on

    When completing paths and so on, I mostly don't bother thinking about the difference between an actual directory (such as /usr/bin) and a symbolic link to a directory (such as /bin on modern Linuxes). If I type '/bi<TAB>' I want this to complete to '/bin/', not '/bin', because it's basically guaranteed that I will go on to tab-complete something in '/bin/'. If I actually want the symbolic link, I'll delete the trailing '/' (which does happen every so often, but much less frequently than I want to tab-complete through the symbolic link).

  • Make readline forget any random edits I did to past history lines when I hit Return to finally do something:

    set revert-all-at-newline on

    The behavior I want from readline is that past history is effectively immutable. If I edit some bit of it and then abandon the edit by moving to another command in the history (or just start a command from scratch), the edited command should revert to being what I actually typed back when I executed it no later than when I hit Return on the current command and start a new one. It infuriates me when I cursor-up (on a fresh command) and don't see exactly the past commands that I typed.

    (My notes say I got this from Things You Didn't Know About GNU Readline.)

This is more or less in the order I'm likely to fix them. The different (and to me wrong) behavior of C-w is a relatively constant irritation, while the other two are less frequent.

(If this irritates me enough on a particular system, I can probably do something in root's .bashrc, if only to add an alias to use 'bind -f ...' on a prepared file. I can't set these in /root/.inputrc, because my co-workers don't particularly agree with my tastes on these and would probably be put out if standard readline behavior they're used to suddenly changed on them.)

(In other Readline things I want to remember, there's Readline's support for fishing out last or first or Nth arguments from earlier commands.)

Why Wandering Thoughts has fewer comment syndication feeds than yesterday

By: cks
27 August 2025 at 03:05

Over on the Fediverse I said:

My techblog used to offer Atom syndication feeds for the comments on individual entries. I just turned that off because it turns out to be a bad idea on the modern web when you have many years of entries. There are (were) any number of 'people' (feed things) that added the comment feeds for various entries years ago and then never took them out, despite those entries being years old and in some cases never having gotten comments in the first place.

DWiki, the engine behind Wandering Thoughts, is nothing if not general. Syndication feeds, for example, are a type of 'view' over a directory hierarchy, and are available for both pages and comments. A regular (page) syndication feed view can only be done over (on) a directory, because if it was applied to an individual page the feed would only ever contain that page. However, when I wrote DWiki it was obvious that a comment syndication feed for a particular page made sense; it would give you all of the comments 'under' that page (ie, on it). And so for almost all of the time that Wandering Thoughts has been in operation, you could have looked down to the bottom of an entry's page (on the web) and seen in small type 'Atom Syndication: Recent Comments' (with the 'recent comments' being a HTML link giving you the URL of that page's comment feed).

(The comment syndication feed for a directory is all comments on all pages underneath the directory.)

That's gone now, because I decided that it didn't make sense in what Wandering Thoughts has become and because I was slowly accumulating feed readers that were pulling the comment syndication feeds for more and more entries. This is exactly the behavior I should have expected from feed readers from the start; once someone puts a feed in, that feed is normally forever even if it's extremely inactive or has never had an entry. The feed reader will dutifully poll every feed for years to come (well, certainly every feed that responds with HTTP success and a valid syndication feed, which all of my comment feeds did).

(There weren't very many pages having their comment syndication feeds hit, but there were enough that I kept noticing them, especially when I added things like hacky rate limiting for feed fetching. I actually put in some extra hacks to deal with how requests for these feeds interacted with my rate limiting.)

There are undoubtedly places on the Internet where discussion (in the form of comments) continues on for years on certain pages, and so a comment feed for an individual page could make sense; you really might keep up (in your feed reader) with a slow moving conversation that lasts years. Other places on the Internet put definite cut-offs on further discussion (comments) on individual pages, which provides a natural deadline to turn off the page's comment syndication feed. But neither of those profiles describes Wandering Thoughts, where my entries remain open for comments more or less forever (and sometimes people do comment on quite old entries), but comments and discussions don't tend to go on for very long.

Of course, the other thing that this change prevents is that it stops (LLM) web crawlers from trying to crawl all of those URLs for comment syndication feeds. You can't crawl URLs that aren't advertised any more and no longer exist (well, sort of, they technically exist but the code for handling them arranges to return 404s if the new 'no comment feeds for actual pages' configuration option is turned on).

Giving up on Android devices using IPv6 on our general-access networks

By: cks
26 August 2025 at 03:42

We have a couple of general purpose, general access networks that anyone can use to connect their devices to; one is a wired network (locally, it's called our 'RED' network after the colour of the network cables used for it), and the other is a departmental wireless network that's distinct from the centrally run university-wide network. However, both of these networks have a requirement that we need to be able to more or less identify who is responsible for a machine on them. Currently, this is done through (IPv4) DHCP and registering the Ethernet address of your device. This is a problem for any IPv6 deployment, because the Android developers refuse to support DHCPv6.

We're starting to look more seriously at IPv6, including sort of planning out how our IPv6 subnets will probably work, so I came back to thinking about this issue recently. My conclusion and decision was to give up on letting Android devices use IPv6 on our networks. We can't use SLAAC (StateLess Address AutoConfiguration) because that doesn't require any sort of registration, and while Android devices apparently can use IPv6 Prefix Delegation, that would consume /64s at a prodigious rate using reasonable assumptions. We'd also have to build a system to do it. So there's no straightforward answer, and while I can think of potential hacks, I've decided that none of them are particular good options compared to the simple choice to not support IPv6 for Android by way of only supporting DHCPv6.

(Our requirement for registering a fixed Ethernet address also means that any device that randomizes its wireless Ethernet address on every connection has to turn that off. Hopefully all such devices actually have such an option.)

I'm only a bit sad about this, because you can only hope that a rock rolls uphill for so long before you give up. IPv6 is still not a critical thing in my corner of the world (as shown by how no one is complaining to us about the lack of it), so some phones continuing to not have IPv6 is not likely to be a big deal to people here.

(Android devices that can be connected to wired networking will be able to get IPv6 on some research group networks. Some research groups ask for their network to be open and not require pre-registration of devices (which is okay if it only exists in access-controlled space), and for IPv6 I expect we'll do this by turning on SLAAC on the research group's network and calling it a day.)

Connecting M.2 drives to various things (and not doing so)

By: cks
25 August 2025 at 03:06

As a result of discovering that (M.2) NVMe SSDs seem to have become the dominant form of SSDs, I started looking into what you could connect M.2 NVMe SSDs to. Especially I started looking to see if you could turn M.2 NVMe SSDs into SATA SSDs, so you could connect high capacity M.2 NVMe SSDs to, for example, your existing stock of ZFS fileservers (which use SATA SSDs). The short version is that as far as I can tell, there's nothing that does this, and once I started thinking about it I wasn't as surprised as I might be.

What you can readily find is passive adapters from M.2 NVMe or M.2 SATA to various other forms of either NVMe or SATA, depending. For example, there are M.2 NVMe to U.2 cases, and M.2 SATA to SATA cases; these are passive because they're just wiring things through, with no protocol conversion. There are also some non-passive products that go the other way; they're a M.2 'NVMe' 2280 card that has four SATA ports on it (and presumably a PCIe SATA controller). However, the only active M.2 NVMe product (one with protocol conversion) that I can find is M.2 NVMe to USB, generally in the form of external enclosures.

(NVMe drives are PCIe devices, so an 'M.2 NVMe' connector is actually providing some PCIe lanes to the M.2 card. Normally these lanes are connected to an NVMe controller, but I don't believe there's any intrinsic reason that you can't connect them to other PCIe things. So you can have 'PCIe SATA controller on an M.2 PCB' and various other things.)

When I thought about it, I realized the problem with my hypothetical 'obvious' M.2 NVMe to SATA board (and case): since it involves protocol conversion (between NVMe and SATA), someone would have to make the controller chipset for it. You can't make a M.2 NVMe to SATA adapter until someone goes to the expense of designing and fabricating (and probably programming) the underlying chipset, and presumably no one has yet found it commercially worthwhile to do so. Since (M.2) NVMe to USB adapters exist, protocol conversion is certainly possible, and since such adapters are surprisingly inexpensive, presumably there's enough demand to drive down the price of the underlying controller chipsets.

(These chipsets are, for example, the Realtek RTL9210B-CG or the ASMedia ASM3242.)

Designing a chipset is not merely expensive, it's very expensive, which to me explains why there aren't any high-priced options for connecting a NVMe drive up via SATA, the way there are high-priced options for some uncommon things (like connecting multiple NVMe drives to a single PCIe slot without PCIe bifurcation, which can presumably be done with the right existing PCIe bridge chipset).

(Since I checked, there also doesn't currently seem to be any high capacity M.2 SATA SSDs (which in theory could just be a controller chipset swap from the M.2 NVMe version). If they existing, you could use a passive M.2 SATA to 2.5" SATA adapter to get them into the form factor you want.)

It seems like NVMe SSDs have overtaken SATA SSDs for high capacities

By: cks
24 August 2025 at 02:20

For a long time, NVMe SSDs were the high end option; as the high end option they cost more than SATA SSDs of the same capacity, and SATA SSDs were generally available in higher capacity than NVMe SSDs (at least at prices you wanted to pay). This is why my home desktop wound up with a storage setup with a mirrored pair of 2 TB NVMe SSDs (which felt pretty indulgent) and a mirrored pair of 4 TB SATA SSDs (which felt normal-ish). Today, for reasons outside the boundary of this entry, I wound up casually looking to see how available large SSDs were. What I expected to find was that large-capacity SATA SSDs would now be reasonably available and not too highly priced, while NVMe SSDs would top out at perhaps 4TB and high prices.

This is not what I found, at least at some large online retailers. Instead, SATA SSDs seem to have almost completely stagnated at 4 TB, with capacities larger than that only available from a few specialty vendors at eye-watering prices. By contrast, 8 TB NVMe SSDs seem readily available at somewhat reasonable prices from mainstream drive vendors like WD (they aren't inexpensive but they're not unreasonable given the prices of 4 TB NVMe, which is roughly the price I remember 4 TB SATA SSDs being at). This makes me personally sad, because my current home desktop has more SATA ports than M.2 slots or even PCIe x1 slots.

(You can get PCIe x1 cards that mount a single NVMe SSD, and I think I'd get somewhat better than SATA speeds out of them. I have one to try out in my office desktop, but I haven't gotten around to it yet.)

At one level this makes sense. Modern motherboards have a lot more M.2 slots than they used to, and I speculated several years ago that M.2 NVMe drives would eventually be cheaper to make than 2.5" SSDs. So in theory I'm not surprised that probable consumer (lack of) demand has basically extinguished SATA SSDs above 4 TB. In practice, I am surprised and it feels disconcerting for NVMe SSDs to now look like the 'mainstream' choice.

(This is also potentially inconvenient for work, where we have a bunch of ZFS fileservers that currently use 4 TB 2.5" SATA SSDs (an update from their original 2 TB SATA SSDs). If there are no reasonably priced SATA SSDs above 4 TB, then our options for future storage expansion become more limited. In the long run we may have to move to U.2 to get hotswappable 4+ TB SSDs. On the other hand, apparently there are inexpensive M.2 to U.2 adapters, and we've done worse sins with our fileservers.)

Websites and web developers mostly don't care about client-side problems

By: cks
23 August 2025 at 03:30

In response to my entry on the fragility of the web in the face of the crawler plague, Jukka said in a comment:

While I understand the server-side frustrations, I think the corresponding client-side frustrations have largely been lacking from the debates around the Web.

For instance, CloudFlare now imposes heavy-handed checks that take a few seconds to complete. [...]

This is absolutely true but it's not new, and it goes well beyond anti-crawler and anti-robot defenses. As covered by people like Alex Russell, it's routine for websites to ignore most real world client side concerns (also, and including on desktops). Just recently (as of August 2025), Github put out a major update that many people are finding immensely slow even on developer desktops. If we can't get web developers to care about common or majority experiences for their UI, which in some sense has relatively little on the line, the odds of web site operators caring when their servers are actually experiencing problems (or at least annoyances) is basically nil.

Much like browsers have most of the power in various relationships with, for example, TLS certificate authorities, websites have most of the power in their relationship to clients (ie, us). If people don't like what a website is doing, their only option is generally a boycott. Based on the available evidence so far, any boycotts over things like CAPTCHA challenges have been ineffective so far. Github can afford to give people a UI with terrible performance because the switching costs are sufficiently high that they know most people won't.

(Another view is that the server side mostly doesn't notice or know that they're losing people; the lost people are usually invisible, with websites only having much visibility into the people who stick around. I suspect that relatively few websites do serious measurement of how many people bounce off or stop using them.)

Thus, in my view, it's not so much that client-side frustrations have been 'lacking' from debates around the web, which makes it sound like client side people haven't been speaking up, as that they've been actively ignored because, roughly speaking, no one on the server side cares about client-side frustrations. Maybe they vaguely sympathize, but they care a lot more about other things. And it's the web server side who decides how things operate.

(The fragility exposed by LLM crawler behavior demonstrates that clients matter in one sense, but it's not a sense that encourages website operators to cooperate or listen. Rather the reverse.)

I'm in no position to throw stones here, since I'm actively making editorial decisions that I know will probably hurt some real clients. Wandering Thoughts has never been hammered by crawler load the way some sites have been; I merely decided that I was irritated enough by the crawlers that I was willing to throw a certain amount of baby out with the bathwater.

Getting the Cinnamon desktop environment to support "AppIndicator"

By: cks
22 August 2025 at 02:34

The other day I wrote about what "AppIndicator" is (a protocol) and some things about how the Cinnamon desktop appeared to support it, except they weren't working for me. Now I actually understand what's going on, more or less, and how to solve my problem of a program complaining that it needed AppIndicator.

Cinnamon directly implements the AppIndicator notification protocol in xapp-sn-watcher, part of Cinnamon's xapp(s) package. Xapp-sn-watcher is started as part of your (Cinnamon) session. However, it has a little feature, namely that it will exit if no one is asking it to do anything:

XApp-Message: 22:03:57.352: (SnWatcher) watcher_startup: ../xapp-sn-watcher/xapp-sn-watcher.c:592: No active monitors, exiting in 30s

In a normally functioning Cinnamon environment, something will soon show up to be an active monitor and stop xapp-sn-watcher from exiting:

Cjs-Message: 22:03:57.957: JS LOG: [LookingGlass/info] Loaded applet xapp-status@cinnamon.org in 88 ms
[...]
XApp-Message: 22:03:58.129: (SnWatcher) name_owner_changed_signal: ../xapp-sn-watcher/xapp-sn-watcher.c:162: NameOwnerChanged signal received (n: org.x.StatusIconMonitor.cinnamon_0, old: , new: :1.60
XApp-Message: 22:03:58.129: (SnWatcher) handle_status_applet_name_owner_appeared: ../xapp-sn-watcher/xapp-sn-watcher.c:64: A monitor appeared on the bus, cancelling shutdown

This something is a standard Cinnamon desktop applet. In System Settings β†’ Applets, it's way down at the bottom and is called "XApp Status Applet". If you've accidentally wound up with it not turned on, xapp-sn-watcher will (probably) not have a monitor active after 30 seconds, and then it will exit (and in the process of exiting, it will log alarming messages about failed GLib assertions). Not having this xapp-status applet turned on was my problem, and turning it on fixed things.

(I don't know how it got turned off. It's possible I wen through the standard applets at some point and turned some of them off in an excess of ignorant enthusiasm.)

As I found out from leigh scott in my Fedora bug report, the way to get this debugging output from xapp-sn-watcher is to run 'gsettings set org.x.apps.statusicon sn-watcher-debug true'. This will cause xapp-sn-watcher to log various helpful and verbose things to your ~/.xsession-errors (although apparently not the fact that it's actually exiting; you have to deduce that from the timestamps stopping 30 seconds later and that being the timestamps on the GLib assertion failures).

(I don't know why there's both a program and an applet involved in this and I've decided not to speculate.)

The current (2025) crawler plague and the fragility of the web

By: cks
21 August 2025 at 03:33

These days, more and more people are putting more and more obstacles in the way of the plague of crawlers (many of them apparently doing it for LLM 'AI' purposes), me included. Some of these obstacles involve attempting to fingerprint unusual aspects of crawler requests, such as using old browser User-Agents or refusing to accept compressed things in an attempt to avoid gzip bombs; other obstacles may involve forcing visitors to run JavaScript, using CAPTCHAs, or relying on companies like Cloudflare to block bots with various techniques.

On the one hand, I sort of agree that these 'bot' (crawler) defenses are harmful to the overall ecology of the web. On the other hand, people are going to do whatever works for them for now, and none of the current alternatives are particularly good. There's a future where much of the web simply isn't publicly available any more, at least not to anonymous people.

One thing I've wound up feeling from all this is that the current web is surprisingly fragile. A significant amount of the web seems to have been held up by implicit understandings and bargains, not by technology. When LLM crawlers showed up and decided to ignore the social things that had kept those parts of the web going, things started coming down all over the place.

(This isn't new fragility; the fragility was always there.)

Unfortunately, I don't see a technical way out from this (and I'm not sure I see any realistic way in general). There's no magic wand that we can wave to make all of the existing websites, web apps, and so on not get impaired by LLM crawlers when the crawlers persist in visiting everything despite being told not to, and on top of that we're not going to make bandwidth free. Instead I think we're looking at a future where the web ossifies for and against some things, and more and more people see catgirls.

(I feel only slightly sad about my small part in ossifying some bits of the web stack. Another part of me feels that a lot of web client software has gotten away with being at best rather careless for far too long, and now the consequences are coming home to roost.)

What an "AppIndicator" is in Linux desktops and some notes on it

By: cks
20 August 2025 at 03:19

Suppose, not hypothetically, that you start up some program on your Fedora 42 Cinnamon desktop and it helpfully tells you "<X> requires AppIndicator to run. Please install the AppIndicator plugin for your desktop". You are likely confused, so here are some notes.

'AppIndicator' itself is the name of an application notification protocol, apparently originally from KDE, and some desktop environments may need a (third party) extension to support it, such as the Ubuntu one for GNOME Shell. Unfortunately for me, Cinnamon is not one of those desktops. It theoretically has native support for this, implemented in /usr/libexec/xapps/xapp-sn-watcher, part of Cinnamon's xapps package.

The actual 'AppIndicator' protocol is done over D-Bus, because that's the modern way. Since this started as a KDE thing, the D-Bus name is 'org.kde.StatusNotifierWatcher'. What provides certain D-Bus names is found in /usr/share/dbus-1/services, but not all names are mentioned there and 'org.kde.StatusNotifierWatcher' is one of the missing ones. In this case /etc/xdg/autostart/xapp-sn-watcher.desktop mentions the D-Bus name in its 'Comment=', but that's probably not something you can count on to find what your desktop is (theoretically) using to provide a given D-Bus name. I found xapp-sn-watcher somewhat through luck.

There are probably a number of ways to see what D-Bus names are currently registered and active. The one that I used when looking at this is 'dbus-send --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.ListNames'. As far as I know, there's no easy way to go from an error message about 'AppIndicator' to knowing that you want 'org.kde.StatusNotifierWatcher'; in my case I read the source of the thing complaining which was helpfully in Python.

(I used the error message to find the relevant section of code, which showed me what it wasn't finding.)

I have no idea how to actually fix the problem, or if there is a program that implements org.kde.StatusNotifierWatcher as a generic, more or less desktop independent program the way that stalonetray does for system tray stuff (or one generation of system tray stuff, I think there have been several iterations of it, cf).

(Yes, I filed a Fedora bug, but I believe Cinnamon isn't particularly supported by Fedora so I don't expect much. I also built the latest upstream xapps tree and it also appears to fail in the same way. Possibly this means something in the rest of the system isn't working right.)

Some notes on DMARC policy inheritance and a gotcha

By: cks
19 August 2025 at 03:06

When you use DMARC, you get to specify a policy that people should apply to email that claims to be from your domain but doesn't pass DMARC checks (people are under no obligation to pay attention to this and they may opt to be stricter). These policies are set in DNS TXT records, and in casual use we can say that the policies of subdomains in your domain can be 'inherited'. This recently confused me and now I have some answers.

Your top level domain can specify a separate policy for itself (eg 'user@example.org') and subdomains (eg 'user@foo.example.org'); these are the 'p=' and 'sp=' bits in a DMARC DNS TXT record. Your domain's subdomain policy is used only for subdomains that don't set a policy themselves; an explicitly set subdomain policy overrides the domain policy, for better or worse. If your organization wants to force some minimum DMARC policy, you can't do it with a simple DNS record; you have to somehow forbid subdomains from publishing their own conflicting DMARC policies in your DNS.

The flipside of this is that it's not as bad as it could be to set a strict subdomain policy in your domain DMARC record, because subdomains that care can override it (and may already be doing so implicitly if they've published DMARC records themselves).

However, strictly speaking DMARC policies aren't inherited as we usually think about it. Instead, as I once knew but forgot since then, people using DMARC will check for an applicable policy in only two places: on the direct domain or host name that they care about, and on your organization's top level domain. What this means in concrete terms is that if example.org and foo.example.org both have DMARC records and someone sends email as 'user@bar.foo.example.org', the foo.example.org DMARC record won't be checked. Instead, people will look for DMARC only at 'bar.foo.example.org' (where any regular 'p=' policy will be used) and at 'example.org' (where the subdomain policy, 'sp=', will be used).

(As a corollary, a 'sp=' policy setting in the foo.example.org DMARC record will never be used.)

One place this gets especially interesting is if people send email using the domain 'nonexistent.foo.example.org' in the From: header (either from inside or outside your organization). Since this host name isn't in DNS, it has no DMARC policy of its own, and so people will go straight to the 'example.org' subdomain policy without even looking at the policy of 'foo.example.org'.

(Since traditional DNS wildcard records can only wildcard the leftmost label and DMARC records are looked up on a special '_dmarc.' DNS sub-name, it's not simple to give arbitrary names under your subdomain a DMARC policy.)

How not to check or poll URLs, as illustrated by Fediverse software

By: cks
18 August 2025 at 02:43

Over on the Fediverse, I said some things:

[on April 27th:]
A bit of me would like to know why the Akkoma Fediverse software is insistently polling the same URL with HEAD then GET requests at five minute intervals for days on end. But I will probably be frustrated if I turn over that rock and applying HTTP blocks to individual offenders is easier.

(I haven't yet blocked Akkoma in general, but that may change.)

[the other day:]
My patience with the Akkoma Fediverse server software ran out so now all attempts by an Akkoma instance to pull things from my techblog will fail (with a HTTP redirect to a static page that explains that Akkoma mis-behaves by repeatedly fetching URLs with HEAD+GET every few minutes). Better luck in some future version, maybe, although I doubt the authors of Akkoma care about this.

(The HEAD and GET requests are literally back to back, with no delay between them that I've ever observed.)

Akkoma is derived from Pleroma and I've unsurprisingly seen Pleroma also do the HEAD then GET thing, but so far I haven't seen any Pleroma server showing up with the kind of speed and frequency that (some) Akkoma servers do.

These repeated HEADs and GETs are for Wandering Thoughts entries that haven't changed. DWiki is carefully written to supply valid HTTP Last-Modified and ETag, and these values are supplied in replies to both HEAD and GET requests. Despite all of this, Akkoma is not doing conditional GETs and is not using the information from the HEAD to avoid doing a GET if neither header has changed its value from the last time. Since Akkoma is apparently completely ignoring the result of its HEAD request, it might as well not make the HEAD request in the first place.

If you're going to repeatedly poll a URL, especially every five or ten minutes, and you want me to accept your software, you must do conditional GETs. I won't like you and may still arrange to give you HTTP 429s for polling so fast, but I most likely won't block you outright. Polling every five or ten minutes without conditional GET is completely unacceptable, at least to me (other people probably don't notice or care).

My best guess as to why Akkoma is polling the URL at all is that it's for "link previews". If you link to something in a Fediverse post, various Fediverse software will do the common social media thing of trying to embed some information about the target of the URL into the post as it presents it to local people; for plain links with no special handling, this will often show the page title. As far as the (rapid) polling goes, I can only guess that Akkoma has decided that it is extremely extra special and it must update its link preview information very rapidly should the linked URL do something like change the page title. However, other Fediverse server implementations manage to do link previews without repeatedly polling me (much less the HEAD then immediately a GET thing).

(On the global scale of things this amount of traffic is small beans, but it's my DWiki and I get to be irritated with bad behavior if I want to, even if it's small scale bad behavior.)

Getting Linux nflog and tcpdump packet filters to sort of work together

By: cks
17 August 2025 at 02:38

So, suppose that you have a brand new nflog version of OpenBSD's pflog, so you can use tcpdump to watch dropped packets (or in general, logged packets). And further suppose that you specifically want to see DNS requests to your port 53. So of course you do:

# tcpdump -n -i nflog:30 'port 53'
tcpdump: NFLOG link-layer type filtering not implemented

Perhaps we can get clever by reading from the interface in one tcpdump and sending it to another to be interpreted, forcing the pcap filter to be handled entirely in user space instead of the kernel:

# tcpdump --immediate-mode -w - -U -i nflog:30 | tcpdump -r - 'port 53'
tcpdump: listening on nflog:30, link-type NFLOG (Linux netfilter log messages), snapshot length 262144 bytes
reading from file -, link-type NFLOG (Linux netfilter log messages), snapshot length 262144
tcpdump: NFLOG link-layer type filtering not implemented

Alas we can't.

As far as I can determine, what's going on here is that the netfilter log system, 'NFLOG', uses a 'packet' format that isn't the same as any of the regular formats (Ethernet, PPP, etc) and adds some additional (meta)data about the packet to every packet you capture. I believe the various attributes this metadata can contain are listed in the kernel's nfnetlink_log.h.

(I believe it's not technically correct to say that this additional stuff is 'before' the packet; instead I believe the packet is contained in a NFULA_PAYLOAD attribute.)

Unfortunately for us, tcpdump (or more exactly libpcap) doesn't know how to create packet capture filters for this format, not even ones that are interpreted entirely in user space (as happens when tcpdump reads from a file).

I believe that you have two options. First, you can use tshark with a display filter, not a capture filter:

# tshark -i nflog:30 -Y 'udp.port == 53 or tcp.port == 53'
Running as user "root" and group "root". This could be dangerous.
Capturing on 'nflog:30'
[...]

(Tshark capture filters are subject to the same libpcap inability to work on NFLOG formatted packets as tcpdump has.)

Alternately and probably more conveniently, you can tell tcpdump to use the 'IPV4' datalink type instead of the default, as mentioned in (opaque) passing in the tcpdump manual page:

# tcpdump -i nflog:30 -L
Data link types for nflog:30 (use option -y to set):
  NFLOG (Linux netfilter log messages)
  IPV4 (Raw IPv4)
# tcpdump -i nflog:30 -y ipv4 -n 'port 53'
tcpdump: data link type IPV4
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on nflog:30, link-type IPV4 (Raw IPv4), snapshot length 262144 bytes
[...]

Of course this is only applicable if you're only doing IPv4. If you have some IPv6 traffic that you want to care about, I think you have to use tshark display filters (which means learning how to write Wireshark display filters, something I've avoided so far).

I think there is some potentially useful information in the extra NFLOG data, but to get it or to filter on it I think you'll need to use tshark (or Wireshark) and consult the NFLOG display filter reference, although that doesn't seem to give you access to all of the NFLOG stuff that 'tshark -i nflog:30 -V' will print about packets.

(Or maybe the trick is that you need to match 'nflog.tlv_type == <whatever> and nflog.tlv_value == <whatever>'. I believe that some NFLOG attributes are available conveniently, such as 'nflog.prefix', which corresponds to NFULA_PREFIX. See packet-nflog.c.)

PS: There's some information on the NFLOG format in the NFLOG linktype documentation and tcpdump's supported data link types in the link-layer header types documentation.

An interesting thing about people showing up to probe new DNS resolvers

By: cks
16 August 2025 at 02:48

Over on the Fediverse, I said something:

It appears to have taken only a few hours (or at most a few hours) from putting a new resolving DNS server into production to seeing outside parties specifically probing it to see if it's an open resolver.

I assume people are snooping activity on authoritative DNS servers and going from there, instead of spraying targeted queries at random IPs, but maybe they are mass scanning.

There turns out to be some interesting aspects to these probes. This new DNS server has two network interfaces, both firewalled off from outside queries, but only one is used as the source IP on queries to authoritative DNS servers. In addition, we have other machines on both networks, with firewalls, so I can get a sense of the ambient DNS probes.

Out of all of these various IPs, the IP that the new DNS server used for querying authoritative DNS servers, and only that IP, very soon saw queries that were specifically tuned for it:

124.126.74.2.54035 > 128.100.X.Y.53: 16797 NS? . (19)
124.126.74.2.7747 > 128.100.X.Y.7: UDP, length 512
124.126.74.2.54035 > 128.100.X.Y.53: 17690 PTR? Y.X.100.128.in-addr.arpa. (47)

This was a consistent pattern from multiple IPs; they all tried to query for the root zone, tried to check the UDP echo port, and then tried a PTR query for the machine's IP itself. Nothing else saw this pattern; not the machine's other IP on a different network, not another IP on the same network, and so on. This pattern and the lack of this pattern to other IPs is what's led me to assume that people are somehow identifying probe targets based on what source IPs they seem making upstream queries.

(There are a variety of ways that you could do this without having special access to DNS servers. APNIC has long used web ad networks and special captive domains and DNS servers for them to do various sorts of measurements, and you could do similar things to discover who was querying your captive DNS servers.)

How you want to have the Unbound DNS server listen on all interfaces

By: cks
15 August 2025 at 03:30

Suppose, not hypothetically, that you have an Unbound server with multiple network interfaces, at least two (which I will call A and B), and you'd like Unbound to listen on all of the interfaces. Perhaps these are physical interfaces and there are client machines on both, or perhaps they're virtual interfaces and you have virtual machines on them. Let's further assume that these are routed networks, so that in theory people on A can talk to IP addresses on B and vice versa.

The obvious and straightforward way to have Unbound listen on all of your interfaces is with a server stanza like this:

server:
  interface: 0.0.0.0
  interface: ::0
  # ... probably some access-control statements

This approach works 99% of the time, which is probably why it appears all over the documentation. The other 1% of the time is when a DNS client on network A makes a DNS request to Unbound's IP address on network B; when this happens, the network A client will not get any replies. Well, it won't get any replies that it accepts. If you use tcpdump to examine network traffic, you will discover that Unbound is sending replies to the client on network A using its network A IP address as the source address (which is the default behavior if you send packets to a network you're directly attached to; you normally want to use your IP on that network as the source IP). This will fail with almost all DNS client libraries because DNS clients reject replies from unexpected sources, which is to say any IP other than the IP they sent their query to.

(One way this might happen is if the client moves from network B to network A without updating its DNS configuration. Or you might be testing to see if Unbound's network B IP address answers DNS requests.)

The other way to listen on all interfaces in modern Unbound is to use 'interface-automatic: yes' (in server options), like this:

server:
  interface-automatic: yes

The important bit of what interface-automatic does for you is mentioned in passing in its documentation, and I've emphasized it here:

Listen on all addresses on all (current and future) interfaces, detect the source interface on UDP queries and copy them to replies.

As far as I know, you can't get this 'detect the source interface' behavior for UDP queries in any other way if you use 'interface: 0.0.0.0' to listen on everything. You get it if you listen on specific interfaces, perhaps with 'ip-transparent: yes' for safety:

server:
  interface: 127.0.0.1
  interface: ::1
  interface: <network A>.<my-A-IP>
  interface: <network B>.<my-B-IP>
  # insure we always start
  ip-transparent: yes

Since 'interface-automatic' is marked as an experimental option I'd love to be wrong, but I can't spot an option in skimming the documentation and searching on some likely terms.

(I'm a bit surprised that Unbound doesn't always copy the IP address it received UDP packets on and use that for replies, because I don't think things work if you have the wrong IP there. But this is probably an unusual situation and so it gets papered over, although now I'm curious how this interacts with default routes.)

Another reason to use expendable email addresses for everything

By: cks
14 August 2025 at 01:38

I'm a long time advocate of using expendable email addresses any time you have to give someone an email address (and then making sure you can turn them off or more broadly apply filters to them). However, some of the time I've trusted the people who were asking for an email address, didn't have an expendable address already prepared for them, and gave them my regular email address. Today I discovered (or realized) another reason to not do this and to use expendable addresses for absolutely everything, and it's not the usual reason of "the people you gave your email address to might get compromised and have their address collection extracted and sold to spammers". The new problem is mailing service providers, such as Mailchimp.

It's guaranteed that some amount of spammers make use of big mailing service providers, so you will periodically get spam email to any exposed email address, most likely including your real, primary one from such MSPs. At the same time, these days it's quite likely that anyone you give your email address to will at some point wind up using an MSP, if only to send out a cheerful notification of, say, "we moved from street address A to street address B, please remember for planning your next appointment" (because if you want to send out such a mass mailing, you basically have to outsource it to an MSP to get it done, even if you normally use, eg, GMail for your regular organizational activities).

If you've given innocent trustworthy organizations your main email address, it's potentially dangerous or impossible to block a particular MSP from sending email to it. In searching your email archive, you may find that such an organization is already using the MSP to send you stuff that you want, or for big MSPs you might decide that the odds are too bad. But if you've given separate expendable email addresses to all such organizations, you know that they're not going to be sending anything to your main email address, including through some MSP that you've just got spam from, and it's much safer to block that MSP's access to your main email address.

This issue hadn't occurred to me back when I apparently gave one organization my main email address, but it became relevant recently. So now I'm writing it down, if only for my future self as a reminder of why I don't want to do that.

Implementing a basic equivalent of OpenBSD's pflog in Linux nftables

By: cks
13 August 2025 at 02:38

OpenBSD's and FreeBSD's PF system has a very convenient 'pflog' feature, where you put in a 'log' bit in a PF rule and this dumps a copy of any matching packets into a pflog pseudo-interface, where you can both see them with 'tcpdump -i pflog0' and have them automatically logged to disk by pflogd in pcap format. Typically we use this to log blocked packets, which gives us both immediate and after the fact visibility of what's getting blocked (and by what rule, also). It's possible to mostly duplicate this in Linux nftables, although with more work and there's less documentation on it.

The first thing you need is nftables rules with one or two log statements of the form 'log group <some number>'. If you want to be able to both log packets for later inspection and watch them live, you need two 'log group' statements with different numbers; otherwise you only need one. You can use different (group) numbers on different nftables rules if you want to be able to, say, look only at accepted but logged traffic or only dropped traffic. In the end this might wind up looking something like:

tcp port ssh counter log group 30 log group 31 drop;

As the nft manual page will tell you, this uses the kernel 'nfnetlink_log' to forward the 'logs' (packets) to a netlink socket, where exactly one process (at most) can subscribe to a particular group to receive those logs (ie, those packets). If we want to both log the packets and be able to tcpdump them, we need two groups so we can have ulogd getting one and tcpdump getting the other.

To see packets from any particular log group, we use the special 'nflog:<N>' pseudo-interface that's hopefully supported by your Linux version of tcpdump. This is used as 'tcpdump -i nflog:30 ...' and works more or less like you'd want it to. However, as far as I know there's no way to see meta-information about the nftables filtering, such as what rule was involved or what the decision was; you just get the packet.

To log the packets to disk for later use, the default program is ulogd, which in Ubuntu is called 'ulogd2'. Ulogd(2) isn't as automatic as OpenBSD's and FreeBSD's pf logging; instead you have to configure it in /etc/ulogd.conf, and on Ubuntu make sure you have the 'ulogd2-pcap' package installed (along with ulogd2 itself). Based merely on getting it to work, what you want in /etc/ulogd.conf is the following three bits:

# A 'stack' of source, handling, and destination
stack=log31:NFLOG,base1:BASE,pcap31:PCAP

# The source: NFLOG group 31, for IPv4 traffic
[log31]
group=31
# addressfamily=10 for IPv6

# the file path is correct for Ubuntu
[pcap31]
file="/var/log/ulog/ulogd.pcap"
sync=0

(On Ubuntu 24.04, any .pcap files in /var/log/ulog will be automatically rotated by logrotate, although I think by default it's only weekly, so you might want to make it daily.)

The ulogd documentation suggests that you will need to capture IPv4 and IPv6 traffic separately, but I've only used this on IPv4 traffic so I don't know. This may imply that you need separate nftables rules to log (and drop) IPv6 traffic so that you can give it a separate group number for ulogd (I'm not sure if it needs a separate one for tcpdump or if tcpdump can sort it out).

Ulogd can also log to many different things than PCAP format, including JSON and databases. It's possible that there are ways to enrich the ulogd pcap logs, or maybe just the JSON logs, with additional useful information such as the network interface involved and other things. I find the ulogd documentation somewhat opaque on this (and also it's incomplete), and I haven't experimented.

(According to this, the JSON logs can be enriched or maybe default to that.)

Given the assorted limitations and other issues with ulogd, I'm tempted to not bother with it and only have our nftables setups support live tcpdump of dropped traffic with a single 'log group <N>'. This would save us from the assorted annoyances of ulogd2.

PS: One reason to log to pcap format files is that then you can use all of the tcpdump filters that you're already familiar with in order to narrow in on (blocked) traffic of interest, rather than having to put together a JSON search or something.

The 'nft' command may not show complete information for iptables rules

By: cks
12 August 2025 at 03:04

These days, nftables is the Linux network firewall system that you want to use, and especially it's the system that Ubuntu will use by default even if you use the 'iptables' command. The nft command is the official interface to nftables, and it has a 'nft list ruleset' sub-command that will list your NFT rules. Since iptables rules are implemented with nftables, you might innocently expect that 'nft list ruleset' will show you the proper NFT syntax to achieve your current iptables rules.

Well, about that:

# iptables -vL INPUT
[...] target prot opt in  out  source   destination         
[...] ACCEPT tcp  --  any any  anywhere anywhere    match-set nfsports dst match-set nfsclients src
# nft list ruleset
[...]
      ip protocol tcp xt match "set" xt match "set" counter packets 0 bytes 0 accept
[...]

As they say, "yeah no". As the documentation tells you (eventually), somewhat reformatted:

xt TYPE NAME

TYPE := match | target | watcher

This represents an xt statement from xtables compat interface. It is a fallback if translation is not available or not complete. Seeing this means the ruleset (or parts of it) were created by iptables-nft and one should use that to manage it.

Nftables has a native set type (and also maps), but, quite reasonably, the old iptables 'ipset' stuff isn't translated to nftables sets by the iptables compatibility layer. Instead the compatibility layer uses this 'xt match' magic that the nft command can only imperfectly tell you about. To nft's credit, it prints a warning comment (which I've left out) that the rules are being managed by iptables-nft and you shouldn't touch them. Here, all of the 'xt match "set"' bits in the nft output are basically saying "opaque stuff happens here".

This still makes me a little bit sad because it makes it that bit harder to bootstrap my nftables knowledge from what iptables rules convert into. If I wanted to switch to nftables rules and nftables sets (for example for my now-simpler desktop firewall rules), I'd have to do that from relative scratch instead of getting to clean up what the various translation tools would produce or report.

(As a side effect it makes it less likely that I'll convert various iptables things to being natively nft/nftables based, because I can't do a fully mechanical conversion. If they still work with iptables-nft, I'm better off leaving them as is. Probably this also means that iptables-nft support is likely to have a long, long life.)

Servers will apparently run for a while even when quite hot

By: cks
11 August 2025 at 03:26

This past Saturday (yesterday as I write this), a university machine room had an AC failure of some kind:

It's always fun times to see a machine room temperature of 54C and slowly climbing. It's not our machine room but we have switches there, and I have a suspicion that some of them will be ex-switches by the time this is over.

This machine room and its AC has what you could call a history; in 2011 it flooded partly due to an AC failure, then in 2016 it had another AC issue, and another in 2024 (and those are just the ones I remember and can find entries for).

Most of this machine room is a bunch of servers from another department, and my assumption is that they are what created all of the heat when the AC failed. Both we and the other department have switches in the room, but networking equipment is usually relatively low-heat compared to active servers. So I found it interesting that the temperature graph rises in a smooth arc to its maximum temperature (and then drops abruptly, presumably as the AC starts to get fixed). To me this suggests that many of the servers in the room kept running, despite the ambient temperature hitting 54C (and their internal temperatures undoubtedly being much higher). If some servers powered off from the heat, it wasn't enough to stabilized the heat level of the room; it was still increasing right up to when it started dropping rapidly.

(Servers may well have started thermally throttling various things, and it's possible that some of them crashed without powering off and thus potentially without reducing the heat load. I have second hand information that some UPS units reported battery overheating.)

It's one thing to be fairly confident that server thermal limits are set unrealistically high. It's another thing to see servers (probably) keep operating at 54C, rather than fall over with various sorts of failures. For example, I wouldn't have been surprised if power supplies overheated and shut down (or died entirely).

(I think desktop PSUs are often rated as '0C to 50C', but I suspect that neither end of that rating is actually serious, and this was over 50C anyway.)

I rather suspect that running at 50+C for a while has increased the odds of future failures and shortened the lifetime of everything in this machine room (our switches included). But it still amazes me a bit that things didn't fall over and fail, even above 50C.

(When I started writing this entry I thought I could make some fairly confident predictions about the servers keeping running purely from the temperature graph. But the more I think about it, the less I'm sure of that. There are a lot of things that could be going on, including server failures that leave them hung or locked up but still with PSUs running and pumping out heat.)

My policy of semi-transience and why I have to do it

By: cks
10 August 2025 at 03:05

Some time back I read Simon Tatham's Policy of transience (via) and recognized both points of similarity and points of drastic departure between Tatham and I. Both Tatham and I use transient shell history, transient terminal and application windows (sort of for me), and don't save our (X) session state, and in general I am a 'disposable' usage pattern person. However, I depart from Tatham in that I have a permanently running browser and I normally keep my login sessions running until I reboot my desktops. But broadly I'm a 'transient' or 'disposable' person, where I mostly don't keep inactive terminal windows or programs around in case I might want them again, or even immediately re-purpose them from one use to another.

(I do have some permanently running terminal windows, much like I have permanently present other windows on my desktop, but that's because they're 'in use', running some program. And I have one inactive terminal window but that's because exiting that shell ends my entire X session.)

The big way that I depart from Tatham is already visible in my old desktop tour, in the form of a collection of iconified browser windows (in carefully arranged spots so I can in theory keep track of them). These aren't web pages I use regularly, because I have a different collection of schemes for those. Instead they're a collection of URLs that I'm keeping around to read later or in general to do something with. This is anathema to Tatham, who keeps track of URLs to read in other ways, but I've found that it's absolutely necessary for me.

Over and over again I've discovered that if something isn't visible to me, shoved in front of my nose, it's extremely likely to drop completely out of my mind. If I file email into a 'to be dealt with' or 'to be read later' or whatever folder, or if I write down URLs to visit later and explanations of them, or any number of other things, I almost might as well throw those things away. Having a web page in an iconified Firefox window in no way guarantees that I'll ever read it, but writing its URL down in a list guarantees that I won't. So I keep an optimistic collection of iconified Firefox windows around (and every so often I look at some of them and give up on them).

It would be nice if I didn't need to do this and could de-clutter various bits of my electronic life. But by now I've made enough attempts over a long enough period of time to be confident that my mind doesn't work that way and is unlikely to ever change its ways. I need active, ongoing reminders for things to stick, and one of the best forms is to have those reminders right on my desktop.

(And because the reminders need to be active and ongoing, they also need to be non-intrusive. Mailing myself every morning with 'here are the latest N URLs you've saved to read later' wouldn't work, for example.)

PS: I also have various permanently running utility programs and their windows, so my desktop is definitely not minimalistic. A lot of this is from being a system administrator and working with a bunch of systems, where I want various sorts of convenient fast access and passive monitoring of them.

The problem of Python's version dependent paths for packages

By: cks
9 August 2025 at 03:00

A somewhat famous thing about Python is that more or less all of the official ways to install packages put them into somewhere on the filesystem that contains the Python series version (which is things like '3.13' but not '3.13.5'). This is true for site packages, for 'pip install --user' (to the extent that it still works), and for virtual environments, however you manage them. And this is a problem because it means that any time you change to a new release, such as going from 3.12 to 3.13, all of your installed packages disappear (unless you keep around the old Python version and keep your virtual environments and so on using it).

In general, a lot of people would like to update to new Python releases. Linux distributions want to ship the latest Python (and usually do), various direct users of Python would like the new features, and so on. But these versions dependent paths and their consequences make version upgrades more painful and so to some extent cause them to be done less often.

In the beginning, Python had at least two reasons to use these version dependent paths. Python doesn't promise that either its bytecode (and thus the .pyc files it generates from .py files) or its C ABI (which is depended on by any compiled packages, in .so form on Linux) are stable from version to version. Python's standard installation and bytecode processing used to put both bytecode files and compiled files along side the .py files rather than separating them out. Since pure Python packages can depend on compiled packages, putting the two together has a certain sort of logic; if a compiled package no longer loads because it's for a different Python release, your pure Python packages may no longer work.

(Python bytecode files aren't so tightly connected so some time ago Python moved them into a '__pycache__' subdirectory and gave them a Python version suffix, eg '<whatever>.cpython-312.pyc'. Since they're in a subdirectory, they'll get automatically removed if you remove the package itself.)

An additional issue is that even pure Python packages may not be completely compatible with a new version of Python (and often definitely not with a sufficiently old version). So updating to a new Python version may call for a package update as well, not just using the same version you currently have.

Although I don't like the current situation, I don't know what Python could do to make it significantly better. Putting .py files (ie, pure Python packages) into a version independent directory structure would work some of the time (perhaps a lot of the time if you only went forward in Python versions, never backward) but blow up at other times, sometimes in obvious ways (when a compiled package couldn't be imported) and sometimes in subtle ones (if a package wasn't compatible with the new version of Python).

(It would probably also not be backward compatible to existing tools.)

Abuse systems should handle email reports that use MIME message/rfc822 parts

By: cks
8 August 2025 at 03:19

Today I had reason to report spam to Mailchimp (some of you are laughing already, I know). As I usually do, I forwarded the spam message we'd received to them as a message/rfc822 MIME part, with a prequel plain text part saying that it was spam. Forwarding email as a MIME message/rfc822 part is unambiguously the correct way to do so. It's in the MIME RFCs, if done properly (by the client) it automatically includes all headers, and because it's a proper MIME part, tools can recognize the forwarded email message, scan over just it, and so on.

So of course Mailchimp sent me back an autoreply to the effect that they couldn't find any spam mail message in my report. They're not the only people who've replied this way, although sometimes the reply says "we couldn't handle this .eml attachment". So I had to re-forward the spam message in what I called literal plaintext format. This time around either some human or some piece of software found the information and maybe correctly interpreted it.

I think it's perfectly fine and maybe even praiseworthy when email abuse handling systems (and people) are willing to accept these literal plaintext format forwarded spam messages. The more formats you accept abuse reports in, the better. But every abuse handling system should accept MIME message/rfc822 format messages too, as a minimum thing. Not just because it's a standard, but also because it's what a certain amount of mail clients will produce by default if you ask them to forward a message. If you refuse to accept these messages, you're reducing the amount of abuse reports you'll accept, for arbitrary (but of course ostensibly convenient for you) reasons.

I know, I'm tilting at windmills. Mailchimp and all of the other big places doing this don't care one bit what I want and may or may not even do anything when I send them reports.

(I suspect that many places see reducing the number of 'valid' abuse reports they receive as a good thing, so the more hoops they can get away with and the more reports they can reject, the better. In theory this is self-defeating in the long run, but in practice that hasn't worked with the big offenders so far.)

Responsibility for university physical infrastructure can be complicated

By: cks
7 August 2025 at 02:56

One of the perfectly sensible reactions to my entry on realizing that we needed two sorts of temperature alerts is to suggest that we directly monitor the air conditioners in our machine rooms, so that we don't have to try to assess how healthy they are from second hand, indirect sources like the temperature of the rooms. There are some practical problems, but a broader problem is that by and large they're not 'our' air conditioners. By this I mean that while the air conditioners and the entire building belongs to the university, neither 'belong' to my department and we can't really do stuff to them.

There are probably many companies who have some split between who's responsible for maintaining a building (and infrastructure things inside it) and who is currently occupying (parts of) the building, but my sense is that universities (or at least mine) take this to a more extreme level than usual. There's an entire (administrative) department that looks after buildings and other physical infrastructure, and they 'own' much of the insides of buildings, including the air conditioning units in our machine rooms (including the really old one). Because those air conditioners belong to the building and the people responsible for it, we can't go ahead and connect monitoring up to the AC units or tap into any native monitoring they might have.

(Since these aren't our AC units, we haven't even asked. Most of the AC units are old enough that they probably don't have any digital monitoring, and for the new units the manufacturer probably considers that an extra cost option. Nor can we particularly monitor their power consumption; these are industrial units, with dedicated high-power circuits that we're not even going to get near. Only university electricians are supposed to touch that sort of stuff.)

I believe that some parts of the university have a multi-level division of responsibility for things. One organization may 'own' the building, another 'owns' the network wiring in the walls and is responsible for fixing it if something goes wrong, and a third 'owns' the space (ie, gets to use it) and has responsibility for everything inside the rooms. Certainly there's a lot of wiring within buildings that is owned by specific departments or organizations; they paid to put it in (although possibly through shared conduits), and now they're the people who control what it can be used for.

(We have run a certain amount of our own fiber between building floors, for example. I believe that things can get complicated when it comes to renovating space for something, but this is fortunately not one of the areas we have to deal with; other people in the department look after that level of stuff.)

I've been inside the university for long enough that all of this feels completely normal to me, and it even feels like it makes sense. Within a university, who is using space is something that changes over time, not just within an academic department but also between departments. New buildings are built, old buildings are renovated, and people move around, so separating maintaining the buildings from who occupies them right now feels natural.

(In general, space is a constant struggle at universities.)

My approach to testing new versions of Exim for our mail servers

By: cks
6 August 2025 at 03:39

When I wrote about how Exim's ${run ...} string expansion operator changed how it did quoting, I (sort of) mentioned that I found this when I tested a new version of Exim. Some people would do testing like this in a thorough, automated manner, but I don't go that far. Instead I have a written down test plan, with some resources set up for it in advance. Well, it's more accurate to say that I have test plans, because I have a separate test plan for each of our important mail servers because they have different features and so need different things tested.

In the beginning I simply tested all of the important features of a particular mail server by hand and from memory when I rebuilt it on a new version of Ubuntu. Eventually I got tired of having to reinvent my test process from scratch (or from vague notes) every time around (for each mail server), so I started writing it down. In the process of writing my test process down the natural set of things happened; I made it more thorough and systematic, and I set up various resources (like saved copies of the EICAR test file) to make testing more cut and paste. Having an organized, written down test plan, even as basic as ours is, has made it easier to test new builds of our Exim servers and made that testing more comprehensive.

I test most of our mail servers primarily by using swaks to send various bits of test email to them and then watching what happens (both in the swaks SMTP session and in the Exim logs). So a lot of the test plan is 'run this swaks command and ...', with various combinations of sending and receiving addresses, starting with the very most basic test of 'can it deliver from a valid dummy address to a valid dummy address'. To do some sorts of testing, such as DNS blocklist tests, I take advantage of the fact that all of the IP-based DNS blocklists we use include 127.0.0.2, so that part of the test plan is 'use swaks on the mail machine itself to connect from 127.0.0.2'.

(Some of our mail servers can apply different filtering rules to different local addresses, so I have various pre-configured test addresses set up to make it easy to test that per-address filtering is working.)

The actual test plans are mostly a long list of 'run more or less this swaks command, pointing it at your test server, to test this thing, and you should see the following result'. This is pretty close to cut and paste, which makes it relatively easy and fast for me to run through.

One qualification is that these test plans aren't attempting to be an exhaustive check of everything we do in our Exim configurations. Instead, they're mostly about making sure that the basics work, like delivering straightforward email, and that Exim can interact properly with the outside world, such as talking to ClamAV and rspamd or running external programs (which also tests that the programs themselves work on the new Ubuntu version). Testing every corner of our configurations would be exhausting and my feeling is that it would generally be pointless. Exim is stable software and mostly doesn't change or break things from version to version.

(Part of this is pragmatic experience with Exim and knowledge of what our configuration does conditionally and what it checks all of the time. If Exim does a check all of the time and basic mail delivery works, we know we haven't run into, say, an issue with tainted data.)

The unusual way I end my X desktop sessions

By: cks
5 August 2025 at 03:47

I use an eccentric X 'desktop' that is not really a desktop as such in the usual sense but instead a window manager and various programs that I run (as a sysadmin, there's a lot of terminal windows). One of the ways that my desktop is unusual is in how I exit from my X session. First, I don't use xdm or any other graphical login manager; instead I run my session through xinit. When you use an xinit based session, you give xinit a program or a script to run, and when the program exits, xinit terminates the X server and your session.

(If you gave xinit a shell script, whatever foreground program the script ended with was your keystone program.)

Traditionally, this keystone program for your X session was your window manager. At one level this makes a lot of sense; your window manager is basically the core of your X session anyway, so you might as well make quitting from it end the session. However, for a very long time I've used a do-nothing iconified xterm running a shell as my keystone program.

(If you look at FvwmIconMan's strip of terminal windows in my (2011) desktop tour, this is the iconified 'console-ex' window.)

The minor advantage to having an otherwise unused xterm as my session keystone program is that I can start my window manager basically at the start of my (rather complex) session startup, so that I can immediately have it manage all of the other things I start (technically I run a number of commands to set up X settings before I start fvwm, but it's the first program I start that will actually show anything on the screen). The big advantage is that using something else as my keystone program means that I can kill and restart my window manager if something goes badly wrong, and more generally that I don't have to worry about restarting it. This doesn't happen very often, but when it does happen I'm very glad that I can recover my session instead of having to abruptly terminate everything. And should I have to terminate fvwm, this 'console' xterm is a convenient idle xterm in which to restart it (or in general, any other program of my session that needs restarting).

(The 'console' xterm is deliberately placed up at the top of the screen, in an area that I don't normally put non-fvwm windows in, so that if fvwm exits and everything de-iconifies, it's highly likely that this xterm will be visible so I can type into it. If I put it in an ordinary place, it might wind up covered up by a browser window or another xterm or whatever.)

I don't particularly have to use an (iconified) xterm with a shell in it; I could easily have written a little Tk program that displayed a button saying 'click me to exit'. However, the problem with such a program (and the advantage of my 'console' xterm) is that it would be all too easy to accidentally click the button (and force-end my session). With the iconified xterm, I need to do a bunch of steps to exit; I have to deiconify that xterm, focus the window, and Ctrl-D the shell to make it exit (causing the xterm to exit). This is enough out of the way that I don't think I've ever done it by accident.

PS: I believe modern desktop environments like GNOME, KDE, and Cinnamon have moved away from making their window manager be the keystone program and now use a dedicated session manager program that things talk to. One reason for this may be that modern desktop shells seem to be rather more prone to crashing for various reasons, which would be very inconvenient if that ended your session. This isn't all bad, at least if there's a standard D-Bus protocol for ending a session so that you can write an 'exit the session' thing that will work across environments.

Understanding reading all available things from a Go channel (with a timeout)

By: cks
4 August 2025 at 03:33

Recently I saw this example Go code (via), and I had to stare at it a while in order to understand what it was doing and how it worked (and why it had to be that way). The goal of waitReadAll() is to either receive (read) all currently available items from a channel (possibly a buffered one) or to time out if nothing shows up in time. This requires two nested selects, with the inner one in a for loop.

The outer select has this form:

select {
  case v, ok := <- c:
    if !ok {
      return ...
    }
    [... inner code ...]

  case <- time.After(dur) // wants go 1.23+
    return ...
}

This is doing three things. First (and last in the code), it's timing out of the duration expires before anything is received on the channel. Second, it's returning right away if the channel is closed and empty; in this case the channel receive from c will succeed, but ok will be false. And finally, in the code I haven't put in, it has received the first real value from the channel and now it has to read the rest of them.

The job of the inner code is to receive any (additional) currently ready items from the channel but to give up if the channel is closed or when there are no more items. It has the following form (trimmed of the actual code to properly accumulate things and so on, see the playground for the full version):

.. setup elided ..
for {
  select {
    case v, ok := <- c:
      if ok {
        // accumulate values
      } else {
        // channel closed and empty
        return ...
      }
    case default:
      // out of items
      return ...
  }
}

There's no timeout in this inner code because the 'case default' means that we never wait for the channel to be ready; either the channel is ready with another item (or it's been closed), or we give up.

One of the reasons this Go code initially confused me is that I started out misreading it as receiving as much as it could from a channel until it reached a timeout. Code that did that would do a lot of the same things (obviously it needs a timeout and a select that has that as one of the cases), and you could structure it somewhat similarly to this code (although I think it's more clearly written without a nested loop).

(This is one of those entries that I write partly to better understand something myself. I had to read this code carefully to really grasp it and I found it easy to mis-read on first impression.)

Starting scripts with '#!/usr/bin/env <whatever>' is rarely useful

By: cks
3 August 2025 at 02:09

In my entry on getting decent error reports in Bash for 'set -e', I said that even if you were on a system where /bin/sh was Bash and so my entry worked if you started your script with '#!/bin/sh', you should use '#!/bin/bash' instead for various reasons. A commentator took issue with this direct invocation of Bash and suggested '#!/usr/bin/env bash' instead. It's my view that using env this way, especially for Bash, is rarely useful and thus is almost always unnecessary and pointless (and sometimes dangerous).

The only reason to start your script with '#!/usr/bin/env <whatever>' is if you expect your script to run on a system where Bash or whatever else isn't where you expect (or when it has to run on systems that have '<whatever>' in different places, which is probably most common for third party packages). Broadly speaking this only happens if your script is portable and will run on many different sorts of systems. If your script is specific to your systems (and your systems are uniform), this is pointless; you know where Bash is and your systems aren't going to change it, not if they're sane. The same is true if you're targeting a specific Linux distribution, such as 'this is intrinsically an Ubuntu script'.

(In my case, the script I was doing this to is intrinsically specific to Ubuntu and our environment. It will never run on anything else.)

It's also worth noting that '#!/usr/bin/env <whatever>' only works if (the right version of) <whatever> can be found on your $PATH, and in fact the $PATH of every context where you will run the script (including, for example, from cron). If the system's default $PATH doesn't include the necessary directories, this will likely fail some of the time. This makes using 'env' especially dangerous in an environment where people may install their own version of interpreters like Python, because your script's use of 'env' may find their Python on their $PATH instead of the version that you expect.

(These days, one of the dangers with Python specifically is that people will have a $PATH that (currently) points to a virtual environment with some random selection of Python packages installed and not installed, instead of the system set of packages.)

As a practical matter, pretty much every mainstream Linux distribution has a /bin/bash (assuming that you install Bash, and I'm sorry, Nix and so on aren't mainstream). If you're targeting Linux in general, assuming /bin/bash exists is entirely reasonable. If a Linux distribution relocates Bash, in my view the resulting problems are on them. A lot of the time, similar things apply for other interpreters, such as Python, Perl, Ruby, and so on. '#!/usr/bin/python3' on Linux is much more likely to get you a predictable Python environment than '#!/usr/bin/env python3', and if it fails it will be a clean and obvious failure that's easy to diagnose.

Another issue is that even if your script is fixed to use 'env' to run Bash, it may or may not work in such an alternate environment because other things you expect to find in $PATH may not be there. Unless you're actually testing on alternate environments (such as Nix or FreeBSD), using 'env' may suggest more portability than you're actually able to deliver.

My personal view is that for most people, '#!/usr/bin/env' is a reflexive carry-over that they inherited from a past era of multi-architecture Unix environments, when much less was shipped with the system and so was in predictable locations. In that past Unix era, using '#!/usr/bin/env python' was a reasonably sensible thing; you could hope that the person who wanted to run your script had Python, but you couldn't predict where. For most people, those days are over, especially for scripts and programs that are purely for your internal use and that you won't be distributing to the world (much less inviting people to run your 'written on X' script on a Y, such as a FreeBSD script being run on Linux).

The XLibre project is explicitly political and you may not like the politics

By: cks
2 August 2025 at 02:52

A commentator on my 2024 entry on the uncertain possible futures of Unix graphical desktops brought up the XLibre project. XLibre is ostensibly a fork of the X server that will be developed by a new collection of people, which on the surface sounds unobjectionable and maybe a good thing for people (like me) who want X to keep being viable; as a result it has gotten a certain amount of publicity from credulous sources who don't look behind the curtain. Unfortunately for everyone, XLibre is an explicitly political project, and I don't mean that in the sense of disagreements about technical directions (the sense that you could say that 'forking is a political action', because it's the manifestation of a social disagreement). Instead I mean it in the regular sense of 'political', which is that the people involved in XLibre (especially its leader) have certain social values and policies that they espouse, and the XLibre project is explicitly manifesting some of them.

(Plus, a project cannot be divorced from the people involved in it.)

I am not going to summarize here; instead, you should read the Register article and its links, and also the relevant sections of Ariadne Conill's announcement of Wayback and their links. However, even if you "don't care" about politics, you should see this correction to earlier XLibre changes where the person making the earlier changes didn't understand what '2^16' did in C (I would say that the people who reviewed the changes also missed it, but there didn't seem to be anyone doing so, which ought to raise your eyebrows when it comes to the X server).

Using, shipping it as part of a distribution, or advocating for XLibre is not a neutral choice. To do so is to align yourself, knowingly or unknowingly, with the politics of XLibre and with the politics of its leadership and the people its leadership will attract to the project. This is always true to some degree with any project, but it's especially true when the project is explicitly manifesting some of its leadership's values, out in the open. You can't detach XLibre from its leader .

My personal view is that I don't want to have anything to do with XLibre and I will think less of any Unix or Linux distribution that includes it, especially ones that intend to make it their primary X server. At a minimum, I feel those distributions haven't done their due diligence.

In general, my personal guess is that a new (forked) standalone X server is also the wrong approach to maintaining a working X server environment over the long term. Wayback combined with XWayland seems like a much more stable base because each of them has more support in various ways (eg, there are a lot of people who are going to want old X programs to keep working for years or decades to come and so lots of demand for most of XWayland's features).

(This elaborates on my comment on XLibre in this entry. I also think that a viable X based environment is far more likely to stop working due to important programs becoming Wayland-only than because you can no longer get a working X server.)

Some practical challenges of access management in 'IAM' systems

By: cks
1 August 2025 at 03:14

Suppose that you have a shiny new IAM system, and you take the 'access management' part of it seriously. Global access management is (or should be) simple; if you disable or suspect someone in your IAM system, they should wind up disabled everywhere. Well, they will wind up unable to authenticate. If they have existing credentials that are used without checking with your IAM system (including things like 'an existing SSH login'), you'll need some system to propagate the information that someone has been disabled in your IAM to consumers and arrange that existing sessions, credentials, and so on get shut down and revoked.

(This system will involve both IAM software features and features in the software that uses the IAM to determine identity.)

However, this only covers global access management. You probably have some things that only certain people should have access to, or that treat certain people differently. This is where our experiences with a non-IAM environment suggest to me that things start getting complex. For pure access, the simplest thing probably is if every separate client system or application has a separate ID and directly talks to the IAM, and the IAM can tell it 'this person cannot authenticate (to you)' or 'this person is disabled (for you)'. This starts to go wrong if you ever put two or more services or applications behind the same IAM client ID, for example if you set up a web server for one application (with an ID) and then host another application on the same web server because of convenience (your web server is already there and already set up to talk to the IAM and so on).

This gets worse if there is a layer of indirection involved, so that systems and application don't talk directly to your IAM but instead talk to, say, a LDAP server or a Radius server or whatever that's fed from your IAM (or is the party that talks to your IAM). I suspect that this is one reason why IAM software has a tendency to directly support a lot of protocols for identity and authentication.

(One thing that's sort of an extra layer of indirection is what people are trying to do, since they may have access permission for some things but not others.)

Another approach is for your IAM to only manage what 'groups' people are in and provide that information to clients, leaving it up to clients to make access decisions based on group membership. On the one hand, this is somewhat more straightforward; on the other hand, your IAM system is no longer directly managing access. It has to count on clients doing the right thing with the group information it hands them. At a minimum this gives you much less central visibility into what your access management rules are.

People not infrequently want complicated access control conditions for individual applications (including things like privilege levels). In any sort of access management system, you need to be able to express these conditions in rules. There's no uniform approach or language for expressing access control conditions, so your IAM will use one, your Unix systems will use one (or more) that you probably get to craft by hand using PAM tricks, your web applications will use one or more depending on what they're written in, and so on and so forth. One of the reasons that these languages differ is that the capabilities and concepts of each system will differ; a mesh VPN has different access control concerns than a web application. Of course these differences make it challenging to handle all of their access management in one single spot in an IAM system, leaving you with the choice of either not being able to do everything you want to but having it all in the IAM or having partially distributed access management.

A change in how Exim's ${run ...} string expansion operator does quoting

By: cks
31 July 2025 at 03:09

The Exim mail server has, among other features, a string expansion language with quite a number of expansion operators. One of those expansion operators is '${run}', which 'expands' by running a command and substituting in its output. As is commonly the case, ${run} is given the command to run and all of its command line arguments as a single string, without any explicit splitting into separate arguments:

${run {/some/command -a -b foo -c ...} [...]}

Any time a program does this, a very important question to ask is how this string is split up into separate arguments in order to be exec()'d. In Exim's case, the traditional answer is that it was rather complicated and not well documented, in a way that required you to explicitly quote many arguments that came from variables. In my entry on this I called Exim's then current behavior dangerous and wrong but also said it was probably too late to change it. Fortunately, the Exim developers did not heed my pessimism.

In Exim 4.96, this behavior of ${run} changed. To quote from the changelog:

The ${run} expansion item now expands its command string elements after splitting. Previously it was before; the new ordering makes handling zero-length arguments simpler. The old ordering can be obtained by appending a new option "preexpand", after a comma, to the "run".

(The new way is more or less the right way to do it, although it can create problems with [[some sorts of command string expansions.)

This is an important change because this change is not backward compatible if you used deliberate quoting in your ${run} command string. For example, if you ever expanded a potentially dangerous Exim variable in a ${run} command (for example, one that might have a space in it), you previously had to wrap it in ${quote}:

${run {/some/command \
         --subject ${quote:$header_subject:} ...

(As seen in my entry on our attachment type logging with Exim.)

In Exim 4.96 and later, this same ${run} string expansion will add spurious quote marks around the email message's Subject: header as your program sees it. This is because ${quote:...} will add them, since you asked it to generate a quoted version of its argument, and then ${run} won't strip them out as part of splitting the command string apart into arguments because the command string has already been split before the ${quote:} was done. What this shows is that you probably don't need explicit quoting in ${run} command strings any more, unless you're doing tricky expansions with string expressions (in which case you'll have to switch back to the old way of doing it).

To be clear, I'm all for this change. It makes straightforward and innocent use of ${run} much safer and more reliable (and it plays better with Exim's new rules about 'tainted' strings from the outside world, such as the subject header). Having to remote my use of ${quote:...} is a minor price to pay, and learning this sort of stuff in advance is why I build test servers and have test plans.

(This elaborates on a Fediverse post of mine.)

❌
❌