❌

Reading view

There are new articles available, click to refresh the page.

FediMeteo, timezones, and the art of not breaking what already works

FediMeteo, timezones, and the art of not breaking what already works

I have already written about how FediMeteo was born, and about how HAProxy helps reduce the number of requests that reach snac.

Seen from the outside, FediMeteo almost seems still. There is a static homepage, regenerated every hour. There are the city pages, with their forecasts. There are RSS feeds waiting to be fetched, JSON objects waiting to be requested, Fediverse instances refreshing data, subscribing, unsubscribing, retrieving profiles, and reading notes.

That is the visible part.

Behind it, however, FediMeteo is much more than a homepage, a few ActivityPub accounts, and a well-behaved reverse proxy. It is a chain of small pieces, in proper Unix style, each trying to do one thing and do it as well as possible.

That chain, although almost invisible from the outside, was not born already tidy. It changed, was rewritten, adapted to new countries, timezones, ambiguous city names, external service limits, and also to my own mistakes.

Some mistakes were small. Others were much less so.

Because FediMeteo is a human project and, as such, imperfect. Imperfect in the way humans are imperfect, which today almost seems unfashionable. I like that.

The first version of the bot was almost embarrassingly simple, and I was proud of that.

It took a city name as input, asked Nominatim for the coordinates through geopy, called the Open-Meteo API for the current weather and the next several days, and printed a markdown block with current conditions, the forecast for today, the next twelve hours, and the coming days. The text was in Italian. The cities were Italian. The timezone was Europe/Rome. There was nothing to calculate.

Around the script, a small sh wrapper read a list of cities and, for each one, ran the Python program and piped its output into snac note_unlisted. A cron job ran the wrapper every six hours. The output was loose markdown, which snac happily renders, and the integration was: standard output goes into standard input. Nothing fancier than that.

I like this kind of design. It is the part of the Unix philosophy that survives even when fashions change.

When I started adding other European countries, I did not need to change much. I separated the operational logic from the localized strings, moved the strings into one JSON file per country, and spread the cron entries so that not every country posted in the same minute. Each country had its own snac instance, in its own FreeBSD jail, with its own dataset. The bot, internally, was almost the same script as before.

This worked because Europe is, in essence, two or three timezones across most of the countries I cared about.

Then I added Germany, and Germany taught me my first lesson about names.

There are several places called Neustadt in Germany. There is a Frankfurt am Main, and a Frankfurt an der Oder, and they are not the same city. There is a Halle in Saxony-Anhalt and a Halle in North Rhine-Westphalia. Asking Nominatim for "Frankfurt, Germany" produced one of the two, consistently, but not always the one I wanted. Some German users wrote to me, politely, to point out that the forecast for "their" Frankfurt was, in fact, for the other one.

I started thinking about disambiguation, but only enough to fix the immediate cases. The bot still took a single city name. The ambiguous ones I worked around by editing the cities file and hoping for the best.

In hindsight, this was the seed of what would happen later.

The United States broke every assumption the bot had grown up with.

The first problem was the number of cities. I wanted reasonable coverage at state level, which meant identifying the main cities for each of the fifty states. The list ended up at more than 1200 entries. That alone is more cities than every other country in the project combined.

The second problem was timezones. The contiguous United States covers four of them, and Alaska and Hawaii bring the total to six. A "current weather at 12:00" line generated at the same instant for New York and for Los Angeles is technically the same instant, but the two cities are living different parts of the day, and the forecast for "today" is not even quite the same window. A bot that pretended every city was on the same clock would be wrong, sometimes embarrassingly so, every single day.

The third problem was the name thing again, only larger. There are dozens of Springfields. There is a Portland in Oregon and a Portland in Maine. The Germany workaround - editing the cities file by hand and hoping Nominatim picked the right city - was clearly not going to scale to a country where the same name is also a state.

I sat with this for a couple of days before admitting what I already knew.

The bot needed to be rewritten.

What made this hard was not the rewriting itself. It was the requirement to do it without breaking everything else.

By the time I decided to add the United States, the infrastructure around the bot had grown into something I trusted. Jails, snapshots, backup jobs, cron schedules, snac instances on production paths, the HAProxy layer, the homepage cron that aggregated follower counts, and a long list of cities being processed in series every six hours. None of that knew or cared about the bot's internal shape. All of it cared, very much, about the bot's external behavior: a city name and a country code go in, valid markdown comes out, and that markdown ends up in a timeline.

So the contract was clear, even if I had never written it down anywhere. The command-line interface, the output format, the exit codes, the way the wrapper script invoked it, the structure of the JSON country configs - all of it had to keep working. Italian had to keep working. German had to keep working. The cron job that ran every six hours had to keep producing the same shape of output, just with new countries added.

What I changed was almost everything below the surface.

The city argument grew an optional __state suffix, with a double underscore as separator:

python3 main.py springfield__illinois us
python3 main.py springfield__massachusetts us
python3 main.py new_york__new_york us

A city without the suffix continued to work exactly as before, which is what every European country needed. The country config gained a timezone field that could be a fixed string or the literal "auto"; when it was "auto", the bot used timezonefinder against the resolved coordinates to determine the right zone for that specific city. Internally I separated the weather provider behind an interface, so Open-Meteo could remain the primary while MET Norway and wttr.in sat behind as alternatives, with automatic fallback when the primary failed. Units became configurable per country: temperature, wind speed, precipitation. The United States needed Fahrenheit, miles per hour, and inches. Most of Europe wanted Celsius, kilometers per hour, and millimeters. The bot now does either, on a per-country basis, without caring which is which.

I am skipping a lot of small detail here, but the principle was always the same: every new degree of freedom had to be expressible as an optional field in the config or as an optional CLI flag. If a country did not set the new field, the old behavior continued, identical to before.

I tested this by running the new bot against the old country configs and comparing the output line by line. Where it differed, it was a bug in the new bot. Not in the test.

The first cycle after deploying the rewrite was, for every country except the United States, indistinguishable from the cycle before. That was the point.

This is the part of the story I dislike telling, which is precisely why I should tell it.

At some point during the development, while debugging an Open-Meteo response that did not look right, I added a print statement to the error path that dumped the full request URL whenever something went wrong. The full URL of the Open-Meteo customer endpoint includes the apikey query parameter. The print was meant for development. I forgot to remove it.

I deployed.

The next time Open-Meteo had an outage - and small ones happen, sometimes for several minutes at a time - the bot dutifully printed the failing request URL into the post body. For every city. For every cycle that ran during the outage. The wrapper script piped the output into snac note_unlisted without complaint. The posts went out, federated across the Fediverse, with my API key sitting in the text for anyone who cared to read.

Some users were kind enough to write me and tell me. Others were less kind, and made fun of me. Both groups were correct. This should not have happened.

I reported the incident to the Open-Meteo team, who were extremely understanding. They rotated the key immediately and gave me a fresh one. I removed the debug print, and then I did the slightly more useful thing, which was to add redaction at multiple layers - in the bot's output, in the daemon's logging, and in the debug helpers themselves. URL query parameters that look like API keys are masked. Environment variables and config keys named apikey or OPEN_METEO_APIKEY are redacted before any string reaches stdout or a log file. Even JSON-like fields that include open_meteo_apikey are scrubbed if they ever appear in something the program prints.

The lesson is not "be more careful." The lesson is that debug paths leak, sooner or later, so the secrets have to be unreachable from the debug paths in the first place. Now they are.

That afternoon, when I realised what was happening, I closed everything for a minute and looked out of the window. Then I started fixing.

Nominatim is a public service, and it is generous, but it is not infinite. Every city in the project needs coordinates, and at the start of the project every cycle would re-ask Nominatim for every city. Most of the time this worked. Sometimes it did not.

There was one cycle, before I added caching, when Nominatim simply did not respond for one of my queries. The geopy call timed out. The bot raised an exception. The wrapper script gave up on that city and moved on to the next one. A few users noticed that a particular city had not received its forecast that day, and asked what had happened.

I added a coordinate cache, and I am still grateful that I did.

The cache is intentionally boring. The first time the bot resolves a city, it writes the latitude and longitude into a small file under /tmp, named after the city, and the state when present. Every subsequent run reads the file. If the file exists, no Nominatim call is made. If the file is missing, the bot calls Nominatim and writes the file. After the first successful lookup, the cache becomes the source of truth for the coordinates of that city.

This is lighter on Nominatim, faster for every cycle, and much more resilient against transient failures. It is also nice for a reason I did not anticipate.

Nominatim is a geocoder, and like every geocoder it has opinions.

I live in Ferrara, so when I added Italy I made sure Ferrara was in the list, and I checked the first cycle to make sure everything looked right. The forecast came out fine. The temperature was reasonable. The icon matched the sky outside my window. I closed the laptop and forgot about it.

Then, one evening months later, I looked more carefully at the coordinates Nominatim had returned for "Ferrara, Italy", and I realised they did not point to the city. They pointed to a location closer to the centroid of the province, which is a much larger area and mostly countryside. The forecast had been, on average, for a field somewhere outside town, not for the city center.

I am not entirely sure why I had not noticed earlier. Probably because the weather in Ferrara and the weather in the fields outside Ferrara is, on most days, indistinguishable to anyone who is not paying attention. But this is the kind of detail I do not want to leave wrong, especially for my own city.

There are other places where geocoding lands slightly off. Sometimes it is a few kilometers, sometimes a different neighborhood, sometimes genuinely the wrong place.

Because the cache is just a file per city, the fix is also just a file per city. I open the cache file, replace the latitude and longitude with the correct values, save. The next cycle uses the corrected coordinates. No code change, no redeploy, no special tooling. I keep a small list of patched cities in a separate text file, so that if I ever rebuild the cache, I do not lose the manual corrections.

This is the kind of operational simplicity I like. A cache made of plain files costs almost nothing and quietly pays back every time a small problem appears.

For every report it generates, the bot also writes a simplified English text snapshot to /tmp/<city>.txt, or /tmp/<city>__<state>.txt when there is a state.

This is intentional, and it is not a debug artifact. I am not ready to say what I am doing with it yet, but it is part of a future direction for the project. Text is a useful intermediate format, and having a clean, language-neutral representation of every forecast sitting on disk costs almost nothing and might be worth a great deal later.

I prefer to let ideas mature in private before I commit to them in public. So I will leave it at this for the moment.

A full cycle for the United States takes hours.

It is not because the work is heavy. It is because I deliberately inserted a small sleep between cities, to give snac time to dispatch the previous post before the next one is generated. With more than 1200 cities in series, even a short pause adds up. I am not in a hurry. Forecasts that arrive a few minutes apart from each other are not a problem, and the bot was already a polite citizen elsewhere. A polite cycle is fine.

The problem with a slow cycle is not the duration. The problem is what happens to it.

In the original design, the cycle was launched by cron. Every six hours, cron called the wrapper script, the wrapper iterated through the cities file, and for each city it ran the bot and piped the output into snac. There was no scheduler in the project at all. Cron was the scheduler. The wrapper was just a loop.

Restarting snac was harmless. The wrapper would call snac note_unlisted per city, and if snac happened to be unavailable for a moment, that single call might fail, but the loop kept moving and snac was usually back within seconds. Snac itself was not what held the cycle together.

What held the cycle together was the wrapper process. And the wrapper process lived inside the jail.

If the FreeBSD jail was restarted while the wrapper was running, the loop stopped wherever it happened to be. The cron schedule did not care. Six hours later, the next cron tick started a new cycle from the first city, and the cities that had been about to be processed at the moment of the restart were simply skipped for that window. For the United States, this could mean several hundred cities going without an update.

There was a worse case, and it took me longer than it should have to recognise it. If the host was rebooting exactly in the minute when cron should have fired, cron simply did not fire. There was no daemon waiting to pick up the missed tick. The cycle never even started. Six hours of forecasts would be lost, in silence, with nothing in any log to suggest anything had gone wrong.

I lived with this for a long time. Reboots were rare, the impact was limited, and adding state was the kind of thing I always meant to do "next week."

What finally changed it was not a dramatic incident. It was the slow accumulation of small ones. A scheduled VPS reboot. A jail restart after an upgrade. Each one on its own was nothing. Together, they were a steady drip of missed cycles.

So I wrote a daemon.

The crontab entries for the bot went away. There is now a long-running process inside the jail, started at boot, and it does the scheduling itself. The schedule is a list of hours and a minute, read from a JSON config. The daemon wakes up once a minute, checks whether it is time to start a cycle, and either starts one or waits.

The interesting part is the state file.

As the daemon walks through the cities file, it writes its position to a small JSON file: which cities file it is processing, and the index of the next city to handle. The write happens at the boundary between one city and the next, because that is the only place where resuming makes sense. If the daemon is interrupted mid-city, that city is retried on resume; no half-finished post escapes.

When the daemon starts, it reads the state file. If it finds one matching the current cities file, it resumes from the saved index. If the cities file has changed since the state was written, the daemon starts fresh. The check is deliberately conservative: a renamed or modified cities file is treated as a different cycle, because the indices would otherwise be meaningless.

The result is the behavior I should have had from the start. If the host reboots while the United States cycle is running, the daemon comes back up with the jail, reads the state, and continues from where it left off. Every city still gets its update, just with a small gap corresponding to the reboot itself. The cycle finishes. The state file is reset. Life goes on.

And the worst case from the cron days is gone. The daemon does not need anyone to fire it. As long as the jail is running, the daemon is running, and the next scheduled cycle will happen when its hour comes, regardless of what was happening at any specific minute.

Of all the changes I have made to the project, this is the one I like most. It is not exciting work. It is the kind of thing that earns no applause because, when it works, it produces no visible event. But it removes a whole class of small daily annoyances, and it makes a slow process robust against the boring kind of failure: the kind nobody plans for, but that always eventually happens.

The current bot does considerably more than the original Italian script. It handles per-city timezones, three weather providers with automatic fallback, unit conversion for temperature, wind, and precipitation, optional air quality, pressure trend indicators when the provider supplies pressure data, a simplified English text snapshot for future use, a coordinate cache that can be patched by hand, secret redaction at multiple layers, a heartbeat that adapts to whichever HTTP client is installed on the host, and a scheduler-and-resume daemon that survives reboots.

But from the outside, almost nothing has changed.

The European country configs work the same way they always did. The wrapper scripts are unchanged. The snac integration is the same one-line pipe. The HAProxy layer in front does not know or care that the bot was rewritten. The homepage cron that counts followers and regenerates the static page works exactly as before.

The original Italian script does not exist as a file anymore, but it survives as a default. A country config with timezone set to Europe/Rome and no special options behaves, today, exactly as the first version of the bot would have. Everything else is opt-in.

I like this kind of work.

FediMeteo, HAProxy, and the art of not wasting snac threads

FediMeteo, HAProxy, and the art of not wasting snac threads

When I wrote about FediMeteo for the first time, I told the story from the beginning: the idea born almost by chance while checking the weather for a holiday, the memory of my grandfather, who for years had been my personal meteorologist, the decision to build something small and useful, and then the surprise of seeing people actually use it. What began as a personal experiment quickly became a small global service, still running with the same philosophy: FreeBSD, jails, simple scripts, snac, text, emoji, and a lot of small pieces doing their work quietly.

That article was mostly about the birth and growth of the project. This one is about one of the less romantic parts of the same story, although I have to admit that I find a certain beauty in it too: keeping the service light as it grows.

FediMeteo is still intentionally simple from the outside. A homepage, some numbers, a list of countries, and many ActivityPub accounts publishing weather forecasts. The posts are text and emoji. There is no JavaScript requirement to read the pages, no heavy frontend, no unnecessary media attached to every forecast, and no dynamic homepage recalculated at every visit just to show the same numbers. This is not accidental. It is the way I wanted the service to behave from the beginning.

But the more the service is used, the more the small details matter. A request that looks harmless when there are ten followers may become a repeated request when there are thousands of followers, remote instances, crawlers, previews, and other servers fetching the same public objects. In the Fediverse, the same small thing can be asked many times by many different places, each one with a perfectly legitimate reason. The backend doesn't care: it just needs to deal with the requests.

And in FediMeteo, the backend is snac.

I like snac very much precisely because it is small, clear, and efficient. It is not a giant application that tries to be everything. It does a focused job and does it well. But this also means that I want to respect its shape. I do not want to waste its threads on work that the reverse proxy can safely do. A snac thread serving the same public avatar again and again is not a tragedy, but it is still a waste. A snac thread answering the same public ActivityPub object several times in the same minute is doing real work, but often not necessary work.

This is the reason behind the HAProxy tuning I am currently using in front of FediMeteo.

It is not about making the configuration look clever. It is about keeping snac quiet.

A continuation of the same idea

I had already explored the same problem with snac and nginx in two previous posts: Improving snac Performance with Nginx Proxy Cache and Caching snac Proxied Media with Nginx. In both cases, the idea was that the reverse proxy should absorb repeated public requests instead of letting them consume snac resources.

This is especially important because snac uses a limited number of threads. I like that. Limits are healthy. They force us to understand what the service is doing, and they prevent a small program from pretending to be an infinite resource. But limits also make waste visible. If a few threads are busy serving files that could have been served from cache, those threads are not available for something more useful.

With FediMeteo the implementation is different because the reverse proxy is HAProxy, but the reasoning is the same. I have many small snac instances, each one in its own FreeBSD (Bastille) jail, and one public entry point that has to route, terminate TLS, compress, cache, and generally remove as much repetitive work as possible from the backends.

This is, in a way, the natural continuation of the original FediMeteo design. In the first article I wrote that I wanted to manage everything according to the Unix philosophy: small pieces working together. This is another piece of that same puzzle. HAProxy does the edge work. snac does the ActivityPub work. Scripts generate forecasts. cron launches updates. ZFS gives me snapshots. FreeBSD jails keep countries separated. Nothing is particularly heroic by itself, but the whole system becomes pleasant because each part has a clear responsibility.

Why there is almost no media

Before talking about HAProxy, it is worth mentioning one of the most important optimizations, which is not in the proxy configuration at all.

FediMeteo does not use media in its forecasts.

No images attached to the posts, no generated weather cards, no maps for each city, no decorative banners. The forecasts are text and emoji. This was a deliberate decision. Weather information does not become more useful just because it is put inside an image, and every media file used by the service would become something to store, serve, cache, federate, expire, back up, and occasionally debug.

Text and emoji are enough. They are accessible, light, readable in text browsers, friendly to timelines, and understandable even when someone does not know the local language perfectly. This was one of the original design principles of FediMeteo, and it also helps the infrastructure. Less media means less work, fewer cache entries, fewer repeated fetches, fewer surprises.

There is one exception: the avatar.

All FediMeteo accounts use the same avatar, and this is also intentional. I could have used a different avatar for each country, or for each city, or created something visually richer. It would have been nicer in some screenshots, perhaps. It would also have been operationally worse.

With one shared avatar, the reverse proxy has one very useful object to cache. It is public, identical for everyone, small, requested often, and therefore almost always hot in cache. HAProxy can serve it directly instead of asking each snac instance to return the same file. Since avatars are requested by remote instances, browsers, profile previews, and all sorts of federation-related fetches, this single decision removes a surprising amount of pointless backend traffic.

So the avatar is not only a visual identity. It is part of the architecture.

This is the kind of optimization I like most, because it starts before the software. It starts with deciding not to create a problem.

The homepage is static because it can be static

The main homepage follows the same logic.

It is a static HTML page generated from a template. Once per hour, a cron script updates the numbers and statistics. It counts the data I want to show, regenerates the page, and then the page remains static until the next run.

This is not because I cannot make a dynamic page. It is because I do not need one. Boring is good.

The homepage does not need to query all the country instances on every visit. It does not need a database request for each user who opens it. It does not need to ask snac anything in real time. The numbers are useful, but they do not need to be updated every second. Once per hour is enough, and it also fits the spirit of the whole project: do the work when it is needed, then serve the result cheaply.

I have seen too many small services become heavy because the first implementation was convenient rather than appropriate. A cron job and a template are not fashionable, but they are often exactly what a page like this needs.

Many countries, one entry point

FediMeteo is made of many country instances. Each one runs in its own jail and listens on its own internal address and port. From the outside, however, they all live under the same domain structure:

fedimeteo.com
www.fedimeteo.com
it.fedimeteo.com
uk.fedimeteo.com
jp.fedimeteo.com
us.fedimeteo.com
usa.fedimeteo.com
can.fedimeteo.com
canada.fedimeteo.com

And many more.

At the beginning, it is always tempting to write one ACL after another in the HAProxy frontend. It is quick, it is explicit, and for five hostnames it is perfectly fine. But FediMeteo did not remain at five hostnames. As countries and aliases grew, a long chain of ACLs would have turned the frontend into a list of names instead of a description of how the proxy behaves.

So I moved the hostname to backend mapping into a map file:

fedimeteo.com        backend_fedimeteo
www.fedimeteo.com    backend_fedimeteo
it.fedimeteo.com     backend_it
uk.fedimeteo.com     backend_uk
jp.fedimeteo.com     backend_jp
us.fedimeteo.com     backend_us
usa.fedimeteo.com    backend_us
can.fedimeteo.com    backend_ca
canada.fedimeteo.com backend_ca

The frontend then needs only one rule:

use_backend %[req.hdr(host),field(1,:),lower,map(/usr/local/etc/fedimeteo.map,backend_fedimeteo)]

This reads the Host header, removes the port if present, lowercases the result, and looks it up in /usr/local/etc/fedimeteo.map. If nothing matches, it falls back to the main FediMeteo backend.

I like this because it keeps the configuration honest. The frontend contains the policy. The map contains the data. Adding a country means adding an entry to the map and defining a backend. I do not need to make the frontend more complicated every time the service grows.

Backends as small compartments

The country backends are deliberately plain:

backend backend_it
    mode http
    http-reuse safe
    server srv1 10.0.0.2:8001 maxconn 30

backend backend_uk
    mode http
    http-reuse safe
    server srv1 10.0.0.7:8001 maxconn 30

backend backend_jp
    mode http
    http-reuse safe
    server srv1 10.0.0.32:8001 maxconn 30

One backend, one jail, one snac instance. This is exactly the same organizational principle as the rest of the project. If I need to reason about Italy, I look at the Italian jail. If I need to reason about the United Kingdom, I look at the UK jail. If one day I need to move a country elsewhere, the separation is already there.

The maxconn 30 value is not a magic number. It is a ceiling. I want each small backend to have a visible limit in front of it. If something starts hammering a country instance, I prefer the pressure to appear at the HAProxy layer instead of becoming unlimited concurrent work inside snac.

http-reuse safe lets HAProxy reuse backend connections where appropriate. This is another small reduction in unnecessary work. Opening connections repeatedly is not the biggest problem in the world, but avoiding it is still better, especially when many small services sit behind the same proxy.

The front door

The HTTPS frontend listens on IPv4 and IPv6 and offers both HTTP/2 and HTTP/1.1:

frontend https_in
    bind :::443 v4v6 ssl crt /usr/local/etc/certs/ alpn h2,http/1.1
    mode http
    option http-keep-alive

TLS defaults are set globally:

ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets

Port 80 only redirects to HTTPS, except for Let's Encrypt challenges:

acl letsencrypt-acl path_beg /.well-known/acme-challenge/
http-request redirect scheme https code 301 unless letsencrypt-acl
use_backend letsencrypt-backend if letsencrypt-acl

In the HTTPS frontend I also set the usual forwarding headers:

http-request set-header X-Real-IP %[src]
http-request set-header X-Forwarded-Proto https

And I add HSTS:

http-response set-header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"

None of this is unusual, and that is fine. The interesting parts of an infrastructure are not always the parts that should be unusual.

Two caches, because the requests are different

The HAProxy configuration defines two caches:

cache mediacache
  total-max-size 128
  max-object-size 10000000
  max-age 3600
  process-vary on
  max-secondary-entries 12

cache jsoncache
  total-max-size 16
  max-object-size 1000000
  max-age 60
  process-vary on
  max-secondary-entries 12

I keep media and ActivityPub JSON separate because they are not the same kind of traffic.

The media cache is larger and has a longer maximum age. In FediMeteo, this mostly means the shared avatar and a few static-looking objects. Since there is intentionally almost no media, the important cached object is requested very often and remains warm.

The JSON cache is smaller and short-lived. It is there for public ActivityPub GET requests, not to store federation state forever. A 60 second cache is enough to collapse many repeated requests that arrive close together in time, without pretending that ActivityPub responses should be treated like immutable files.

This distinction is important. Caching is not one decision. It is a set of small decisions about what a response means, who can see it, how often it changes, and what happens if it is served again.

Recognizing media

For media, the ACL is based on file extensions:

acl is_media path_end -i .jpg .jpeg .png .gif .webp .svg .ico .mp4 .webm .mp3 .ogg .wav .flac .mov .avi .mkv .m4v

Then I store the result in a transaction variable:

http-request set-var(txn.is_media) bool(true) if is_media

The cache lookup is straightforward:

http-request cache-use mediacache if { var(txn.is_media) -m bool true }

And on the response side:

http-response set-header Cache-Control "max-age=3600, public" if { var(txn.is_media) -m bool true }
http-response del-header Set-Cookie if { var(txn.is_media) -m bool true }
http-response del-header Vary if { var(txn.is_media) -m bool true }
http-response cache-store mediacache if { var(txn.is_media) -m bool true }

The Cache-Control header makes the intent explicit. Set-Cookie is removed because a public media object should not carry session information. Vary is removed because I do not want the same avatar to fragment into many cache entries because of harmless header differences.

This is aggressive only if removed from its context. In this service, with this media policy, it is a reasonable choice. FediMeteo is not serving private media under these paths. It is mostly serving the same public avatar over and over.

For the same reason, I clean the request before it reaches the backend:

http-request del-header Authorization if { var(txn.is_media) -m bool true }
http-request del-header Cookie        if { var(txn.is_media) -m bool true }

I would not do this globally. I do it after deciding that the request is media. Scope is what makes these rules safe.

The result is exactly what I want: the shared avatar becomes an almost perfect cache object. Small, public, repeatedly requested, and served by HAProxy instead of snac.

ActivityPub JSON microcaching

The ActivityPub side starts from the Accept header:

acl is_ap_json   req.hdr(Accept),lower -m sub application/activity+json
acl is_ap_ldjson req.hdr(Accept),lower -m sub application/ld+json
acl is_outbox    path_end /outbox
acl is_get       method GET
acl has_auth     req.hdr(Authorization) -m found
acl has_cookie   req.hdr(Cookie) -m found

This part matters because ActivityPub uses content negotiation. The same path may return HTML to a browser and JSON to a remote instance. If the proxy pretends that a URL is always one thing, it will eventually cache the wrong representation.

So I only mark public ActivityPub GET requests as cacheable:

http-request set-var(txn.is_activitypub) bool(true) if is_get !is_outbox is_ap_json !has_auth !has_cookie
http-request set-var(txn.is_activitypub) bool(true) if is_get !is_outbox is_ap_ldjson !has_auth !has_cookie

There are several decisions here, all important.

It must be a GET, because I am not caching deliveries or anything that changes state. It must not be /outbox, because outbox collections are not the traffic I want to cache here. It must not have Authorization, and it must not have cookies, because authenticated or user-specific requests do not belong in a shared public cache.

Then the cache can be used and populated:

http-request cache-use jsoncache if { var(txn.is_activitypub) -m bool true }

http-response set-header Cache-Control "max-age=60, public" if { var(txn.is_activitypub) -m bool true }
http-response cache-store jsoncache if { var(txn.is_activitypub) -m bool true }

Sixty seconds is short, but useful. Federation often creates small clusters of identical requests. A remote server fetches an actor, another fetches the same actor, something asks for the same object, something retries. I do not need to cache these responses for hours. I only need HAProxy to answer the second and third identical request during the same small burst.

This is microcaching in the most practical sense. It reduces repeated work without changing the nature of the service.

Static media paths

There is also a rule for static paths:

acl is_short_path path_reg ^/[^/]+/s/
http-request cache-use mediacache if is_short_path

This comes from the same observation that led me to cache snac media with nginx. snac uses static media paths, and those paths often represent the kind of public, repeatable traffic that should not consume backend threads if the proxy can serve it. I call them "short", not because they are, but because the first time I saw them, I thought the 's' stood for "short", not "static". The name just stuck.

In FediMeteo this is less central than on a normal social instance, because I deliberately do not use media except for the avatar and basic static objects. Still, the rule fits the general policy: let HAProxy handle repeatable edge work, and let snac spend its threads where they are actually needed.

Vary, but not without limits

Both caches have:

process-vary on
max-secondary-entries 12

I want HAProxy to process Vary, because content negotiation is real, especially when ActivityPub is involved. But I also want variation to be bounded. If every slightly different header creates another cache entry, the cache becomes a complicated way to miss.

For media, I remove Vary before storing the response. A shared avatar does not need to vary by Accept. For ActivityPub JSON, I am more careful because the representation matters.

Again, the important thing is not the number itself. It is the decision to make variation explicit and limited.

Seeing whether it works

During rollout, I like to expose a very small diagnostic header:

http-response set-header X-Cache-Status HIT if !{ srv_id -m found }
http-response set-header X-Cache-Status MISS if { srv_id -m found }

This is intentionally simple. If HAProxy selected a backend server, I call it a miss. If no backend server was selected, the response came from cache, so I call it a hit. It is not a complete observability system, but it is enough to answer the first question I usually have after changing a cache rule.

Did this request reach snac?

A test can be as simple as:

curl -I https://it.fedimeteo.com/path/to/avatar.png
curl -I https://it.fedimeteo.com/path/to/avatar.png

The second request should be a hit.

For ActivityPub JSON, the test must use the right Accept header:

curl -I \
  -H 'Accept: application/activity+json' \
  https://it.fedimeteo.com/some/activitypub/object

And I also want to verify that cookies and authorization prevent public caching:

curl -I \
  -H 'Cookie: test=value' \
  -H 'Accept: application/activity+json' \
  https://it.fedimeteo.com/some/activitypub/object

curl -I \
  -H 'Authorization: Bearer fake' \
  -H 'Accept: application/activity+json' \
  https://it.fedimeteo.com/some/activitypub/object

A cache that works should be visible. A cache that is invisible can be correct, but it can also be silently wrong. I prefer to know.

Compression and operational paths

HAProxy also handles gzip compression:

filter compression
compression algo gzip
compression type text/css text/html text/javascript application/javascript text/plain text/xml application/json application/activity+json

This keeps another common responsibility at the edge. The country instances can stay focused on snac and the forecast data, while HAProxy deals with client-facing compression for HTML, JSON, and ActivityPub responses.

There is also a local Prometheus exporter:

frontend prometheus
  bind 127.0.0.1:8405
  mode http
  http-request use-service prometheus-exporter
  no log

And I keep internal operational paths, such as statistics and Grafana, handled before the hostname map. These are small details, but ordering matters. Special paths should be explicit and early. The hostname map is for FediMeteo routing, not for every internal tool I happen to expose behind the same proxy.

What this changes in practice

The nice thing about this configuration is that none of its parts is particularly surprising.

The map keeps hostname routing manageable. The backend definitions keep each country isolated and limited. The static homepage avoids dynamic work for something that changes once per hour. The shared avatar gives HAProxy one very hot media object to serve directly. The media cache keeps public files away from snac. The JSON microcache absorbs short ActivityPub bursts. Header cleanup prevents useless variation. Connection reuse avoids unnecessary backend connection churn.

But all of this is only a longer way of saying one thing:

fewer requests reach snac.

That is the metric I care about here.

Not because snac is slow. If anything, FediMeteo exists in its current form because snac is efficient enough to make this kind of project possible on a very small VPS. But precisely because the whole architecture is small and pleasant, I do not want to waste resources where there is no need.

This is also consistent with the rest of the project. Forecasts are serialized by scripts. Updates happen every six hours. The homepage is regenerated hourly. Countries live in separate jails. Snapshots and backups are handled outside the application. No single component tries to be the entire system.

HAProxy is just another small piece, but it sits in the right place to remove a lot of repeated work.

Caveats

This configuration is not a universal HAProxy recipe for ActivityPub services.

It matches FediMeteo as it is now: almost no media, one shared avatar, static homepage, public forecasts, many small snac instances, and ActivityPub traffic that can benefit from a short public cache when there are no cookies or authorization headers.

If I decide one day to use media in forecasts, the media cache rules will need to be reviewed. If I use different avatars for each city or country, the cache will still work, but I will lose the very nice property of one shared, always-hot avatar. If ActivityPub responses become actor-dependent, public JSON caching must be reconsidered. If one country grows a very different traffic pattern from the others, it may deserve a different limit or policy.

This is why I do not like presenting configurations as magic. A good configuration is a written form of the assumptions behind a service. When the assumptions change, the configuration must change too.

Conclusion

FediMeteo started as a small idea and became larger than I expected, but I still want it to feel small in the right ways. Small does not mean fragile. Small means understandable. It means that each part has a reason to exist, and that unnecessary work is removed before it becomes a problem.

The HAProxy layer follows this idea. It terminates TLS, routes hostnames through a map, reuses backend connections, serves the shared avatar from cache, microcaches public ActivityPub JSON, avoids authenticated and cookie-based traffic, and gives me a small diagnostic header to see what is happening.

There is no single brilliant directive here. There is only the usual work of matching infrastructure to reality.

FediMeteo publishes weather forecasts as text and emoji. The homepage is static HTML updated every hour. The accounts share the same avatar because it is enough, and because it is better for the cache. Each country has its own snac instance in its own FreeBSD jail. HAProxy stands in front of them and tries, quietly, not to bother them unless it has to.

I like this kind of infrastructure.

Not because it is invisible, but because when it works well, it leaves very little to say.

Anti-robot techniques can be nice but the problem is, they're not static

By: cks

I've recently come up with what I expect would be a quite good anti-robot, anti-crawler tactic, which I will give the snappy label and summary of "robots don't POST". Simply require a HTTP cookie to see your web pages and then if visitors don't have the cookie, put up an interstitial page with a HTML form that requires them to POST it to get the cookie. All the form need is a "click me to get your entrance cookie", because right now, few or no robots or crawlers will make that HTTP POST request; they only do HTTP GETs. To distract bad crawlers you might need some other links on the interstitial page, optionally going to content tarpits.

(If you're going to do this in practice you'll want to exempt syndication feed requests and perhaps requests from bingbot, Googlebot, and so on. Although maybe not Googlebot any more.)

The obvious problem with this technique is that if people start doing it in any quantity, the "robots don't POST" thing won't last. Bad crawlers will start hitting POST endpoints for forms that just have a "click me" button, and then POST endpoints for forms that have an "I am human" tick box to mark or a field to fill in or whatever the elaboration people come with is, and so on. Bad crawlers are in an arms race with websites and this is a problem.

Arms races require two active participants. An inactive participant in an arms race usually loses by default. In today's environment with aggressively bad crawlers, you can't simply set up a website and walk away from it, not if you want it to survive; you're forced to participate in the arms race. Your website may be static but your operation of your website increasingly can't be, not unless you want to wake up one day and discover that you don't have a website, you have a smoking hole in the ground and perhaps a big bandwidth bill from your hosting provider.

I don't have any answers to this. Instead, it feels like this whole situation is another obstacle in the way of people having their own low-attention websites (after the comment spammers made it impossible to have your own low-attention comment system). Someone has to pay attention, so that's either you or someone you outsource it to, and that someone is most likely going to need to be paid sooner or later.

(There are exceptions, but they're rare. Also, if you run your own website you sort of have to maintain the software involved, but automatic updates (and static websites) have mostly made that easier.)

Notes on respectfully getting a personal copy of a website's contents

By: cks

Suppose, hypothetically, that you want to have a personal copy of the content of some website that you feel is important (to you). There are perfectly good reasons to want such a copy; websites go away all the time on the Internet, and not everyone is online all of the time. It's generally possible to do this (and it's certainly possible to do this with Wandering Thoughts), but there's some things the hypothetical you is going to more or less need to do. These things will be work, but that's the difference between successfully getting a personal copy and turning a brute force crawler lose and then getting ratelimited and blocked. It's also the difference between being polite and being rude, and hopefully you care about that.

(With the increasing decay of Internet search engines, you might also want to build your own personal index of useful website content.)

First, you need to work out the URLs for the real content of the website. Many websites of interest have some mixture of real pages and various sorts of indexes and other aggregations of those real pages, and it's not uncommon for the index pages to outnumber the real pages, sometimes vastly. Your personal copy of the website contents doesn't need all of those index pages, you probably don't want them because they'll inflate the size of your copy, and the website itself will probably be unhappy that you're fetching a ton of redundant index pages.

(The amount of index pages varies with site design. Static sites are usually much friendlier than dynamic sites because it's more work to have a lot of index pages in a static site.)

If you're extremely lucky, the website will have an accurate, up to date (XML) sitemap and will put a tag mentioning this in the HTTML <head> of its pages. If you're not so lucky you will have to manually look around to see if it has any particular index pages that you can mine for URLs (eg) and then work out what additional links and pages you need to also fetch to get what you consider a full copy (for example, to also get comments or 'talk' pages or the like, or to fetch images used in the web pages). In less friendly cases you'll have to go through a whole collection of category pages to accumulate the URLs.

(It's possible that the website supports paged syndication feeds and you can go back through its syndication feed to collect a full set of initial URLs, but I suspect that's not any more likely than a discoverable sitemap.)

Having accumulated your list of URLs, it's time to start fetching them, respectfully. Respectful fetching means doing two things: working slowly, and having an honest HTTP User-Agent. Working slowly means that getting a full copy will take a significant amount of time, but unless you think the website is going to go away tomorrow, you have that time. By 'slowly' I mean a request rate of one every 30 seconds or every minute, and if you get HTTP 429s or other indications of rate limits, you should slow down, even if you think this is absurdly slow. In my view, an honest HTTP User-Agent admits to what you're doing and optionally names the software you're using to do the fetching, because the web site operator cares much more about why these requests are happening than that you're using curl, wget, or whatever to make them.

(You especially shouldn't pretend to be a regular browser, or directly use a headless one. In these days of aggressive stealth crawlers, that makes you look extremely suspicious and may well get you blocked rapidly.)

Once you start fetching, you should monitor your fetching for problem indicators. Basically anything other than a HTTP 200 success may be a sign that either you have the wrong URLs or that you're in some way not welcome to do what you're doing. Continuing despite a spate of HTTP redirections or HTTP errors isn't particularly useful for your content copying project; you're only going to have to weed the results out of your copy.

(Also, continuing when a website is telling you 'no' is being rude. You're saying that your desires are more important than the website's views, and this generally makes you a certain sort of person.)

What all of this will get you is a personal copy of the website's content, possibly in addition to a skeletal set of index pages that you can use to navigate through it (you collected these pages when you built the initial URL set). It won't get you a complete archive of the website in HTML form that you could stick up somewhere else. A full website archive is a different thing, one that websites may be much more hostile to depending (in part) on how much redundant content you will wind up crawling in order to assemble your 'complete' version.

(Even if what you want is a full archive of everything, including index pages, starting with the important content first gets you the important content if something goes wrong.)

PS: Wandering Thoughts has a sitemap, which I bashed together many years ago to make Google happy and then found it was convenient for testing because it gave me a list of all pages that I really cared about the HTML rendering of. Interested parties can access it by putting a '?sitemap' on any directory URL. It's not (currently) in the HTML <head> of any pages because when I set it up, that wasn't really a thing. Given the modern web environment, I'm not certain I'll ever make it visible in the HTML <head> because I'm not certain I want to hand every abusive crawler a nice obvious map to the juicy bits.

(I have no idea how long it's been since Google accessed the sitemap; I suspect it's been years. But then, I increasingly don't care about Googlebot, although that's another entry.)

Browsers, OCSP, and a view of the web in practice

By: cks

I recently read Geoff Huston's Revocation of X.509 certificates, which in part talks about OCSP's failure. One of the pragmatic reasons for OCSP being dead is that Chrome dropped support for it more than a decade ago. Specifically, Chrome's replacement for certificate revocation was for Chrome to have an internal set of revoked certificates. Recently, Firefox has adopted a similar approach (with a different technical implementation).

One of my views of this is that it shows browsers recognizing and accepting that if they want something, they have to do it themselves and they can't rely on the behavior of outside parties, especially the behavior of a lot of outside parties. Another way to put it is that browsers can change themselves to get something done but they often have a hard time getting other people to change.

OCSP had two groups of outside parties; Certificate Authorities for direct CA OCSP checks, and web servers for OCSP stapling, and in the end browsers clearly couldn't rely on either group. In my own experience, direct use of CA OCSP checks by Firefox failed so often because of problems with CA OCSP servers that turning it off was my first reaction any time I ran into a TLS problem (cf). When you think about it, browsers clearly couldn't count on other parties to run high volume, critical services with no economic model that were guaranteed to be both reliable and private.

(The kindest thing you can say about OCSP is that it was created in a long ago world where probably no one expected that HTTPS would become as prevalent and as critical as it has. In a world where HTTPS was only used when paying for your shopping cart and interacting with parts of your bank, both the volume and the privacy impacts of OCSP would be much, much lower.)

The answer to the problems with direct OCSP checks with Certificate Authorities was supposed to be OCSP Stapling. However, this had its own problem, which was that for it to really work, all (HTTPS) web servers had to upgrade. This was never really likely to happen, especially on a timely basis, and it probably became obvious fairly soon that it wasn't going to happen in practice (partly because it's hard, also).

So one way to view Chrome's decision to drop support for OCSP (and quite early) was a recognition that they couldn't count on any other party to handle certificate revocation for them. If Chrome wanted certificate revocation to work, they had to own their own mechanism for it (even if that mechanism was only used to a limited extent for high priority revocations). Browsers building their own mechanism also meant that browsers could handle the situation where a Certificate Authority was slow to handle a revocation for one reason or another, since the revocation data doesn't have to come only from CAs.

(The browsers require Certificate Authorities to promptly handle revocations, but if a CA doesn't do it in practice, resolving this is generally a long process involving people arguing over things, not an immediate thing where browsers remove the Certificate Authority. Immediate removal is reserved for a crisis, such as the Certificate Authority being compromised entirely.)

PS: For similar reasons I think that browsers relying on DNSSEC for TLS security properties in modern web PKI is a non-starter, even beyond all of the other DNSSEC problems in practice.

Hiding the option to leave comments from some visitors to here

By: cks

In a comment on a recent entry, Verisimilitude noticed a feature that I quietly added to here not too long ago:

I've noticed the Add Comment button is now conditionally excluded; that's a neat trick.

I've long had precautions against comment spam and they've mostly worked. But not entirely, and so there have always been some network areas that I disallowed comments from even if they didn't run into those precautions. And if a (bad) network area was a sufficiently high source of automatically blocked comment spam attempts, I would add it to the list of blocked areas in case the software doing the comment spam got smart enough to get past my other precautions.

For a long time the only thing this blocked was direct access to the specific URLs used to write comments here (where the 'add comment' links point to). Then, recently, I realized that it made very little sense to give people and their software the link then block them when they used the link, and it would be better not to give them the link in the first place (as well as still blocking direct access). Among other things, I can hope that this stops software from crawling Wandering Thoughts to collect all the 'add comment' links that it will hit later through, for example, a proxy network.

Adding this feature was made easier because DWiki, the wiki software behind Wandering Thoughts, already had a permissions system for whether or not people could leave comments (and who could). As part of that permission system, DWiki had always done the obvious thing of not generating an 'Add Comment' link unless you had commenting permission. So all I had to do was extend the permissions check a little bit.

(The actual implementation has a collection of markers that can be set during processing of the request to influence what additional links are provided and not provided. For example, if you're a known robot, you don't see links to my syndication feeds because I don't allow known robots to request those. So I have a whole set of what is effectively middleware that scrutinizes the request and decides what should be allowed and not allowed, and then the final, low level dynamic page rendering looks at the result and includes or doesn't include various things.)

So if you visit Wandering Thoughts entries and they don't include an 'add comment' link, that's a sign that something about your request is making my anti-various-things precautions block comments (it might be your IP address or it might be something else).

The general idea strikes me as obvious in retrospect. If you're going to block direct use of something for some request source, you almost certainly want to not serve links to it either. And it's probably a better and less frustrating for any innocent bystanders caught up in a 'you can't comment' area. Previously it would have looked like they can comment, but any attempt would fail; now, they don't see the link at all so they can't get mysterious failures.

(Fortunately, DWiki always blocked all access to the 'add comment' link, even the initial one, so no one ever faced the really frustrating experience of writing a comment only to have posting it fail mysteriously.)

Apache 2.4, ETag values, and (HTTP) response compression

By: cks

One of the things that Apache and other web servers have been able to do for a long time is to compress responses when the requesting agent indicates that it supports this. Accepting compressed responses is so common that not doing so is potentially an bad sign, although a distressing number of syndication feed fetchers don't request (or accept) compressed responses. Apache is sophisticated enough that it can compress output on the fly and do it for unpredictable sources of dynamic content, such as CGIs and Django web applications (and requests it acts as a reverse proxy for, as far as I know).

Another thing that the web has is the ETag header. An ETag header is supposed to be a unique identifier for a specific version of a 'resource', ie a URL. The place I normally think of ETags being used is in conditional GETs, but it also has a lesser appreciated (by me) role in HTTP caching, and as I understand it, that creates a little problem.

An opportunistic cache is allowed to use the same ETag and If-None-Match headers for cache validation. When an ETag value is only used by the origin server for conditional GET, we generally would prefer that the ETag value not vary based on the compression. However, when an intermediate cache uses an ETag for validation, it's apparently more convenient if the ETag is specific to the compression. As a result, RFC 9110's specification for ETag specifically requires that the ETag vary based on the response compression, not just its contents.

In Apache 2.2, Apache ignored this requirement (at least by default). Especially, it ignored this requirement if you provided the ETag in dynamically generated content, such as CGI output. Apache 2.2 would give your ETag to everyone regardless of the compression it did, and then everyone would make the same If-None-Match query to you and you'd be happy because the ETag you (re)generated was matching their If-None-Match and so lots of people were making, for example, conditional syndication feed fetches.

In Apache 2.4, Apache apparently decided that this was no good and it needed to do better. In ETag values you provide (and at least sometimes ETag values it generates), Apache 2.4 sticks on a suffix, such as '-gzip', to make them unique to the Apache-chosen compression. People who receive these altered ETag values then dutifully copy them into their If-None-Match header, which Apache 2.4 passes back unaltered to your CGI or other web application, and then if you're unaware of this you will compare their modified value to your unmodified value and conclude that almost no one is making valid conditional requests any more (for some reason, starting when you upgraded to Apache 2.4).

This behavior is actually something you can change if you want to, through the mod_deflate directive DeflateAlterEtag and the mod_brotli directive BrotliAlterEtag. But it's more correct and probably better in the long run to adjust your CGI or web application code to deal with these altered If-None-Match values, although it would be nice if Apache did it for you somehow. Since I looked it up in relatively current Apache 2.4 source code, the two ETag suffixes you're likely to see in the wild are '-gzip' and '-br'.

Web server ratelimits are a precaution to let me stop worrying

By: cks

These days, Wandering Thoughts has some hacked together HTTP request rate limits. They don't exist for strong technical reasons; my blog engine setup here can generally stand up to even fairly extreme traffic floods (through an extensive series of hacks). It's definitely possible to overwhelm Wandering Thoughts with a high enough request volume, and HTTP rate limits will certainly help with that, but that's not really why they exist. My HTTP rate limits exist for ultimately social reasons and because they let me stop worrying and stop caring about certain sorts of abuse.

As we all know by now here in 2026, abuse definitely happens, even if it isn't killing your web servers. There are things out there who think nothing of making thousands or tens of thousands of requests to your web server a day. Some of them are people running crawlers and other undesired things, and some of them are syndication feed fetchers with very fast polling intervals (which is why the first ratelimits I implemented where syndication feed rate limits). Usually the level of excess requests is moderate. Large abuse doesn't happen very often on typical sites like mine, but it does happen every so often.

The advantage of having HTTP request rate limiting, even in the fallible form I have on Wandering Thoughts, is that I don't have to worry or really care about it. I'll never wake up in the morning to discover that something has made tens of thousands of requests overnight, because these days all but the first few of those requests will have been choked off. I also don't have to be annoyed by badly behaved syndication feed readers and consider various things to maybe get them to behave better, because all that sort of excessive, antisocial behavior gets blocked now.

(I have had the experience of discovering thousands of requests from a single source in the past and not particularly enjoyed it, even if nothing noticed in terms of load and response time and so on.)

For me, HTTP ratelimits have become something that give me peace of mind. I don't expect them to trigger very often (and generally they don't), but despite their infrequent activation I find them valuable and reassuring. They're a precaution against something that I hope is infrequent or, ideally, nonexistent.

(The corollary of this is that I don't regret the programming effort to add them to DWiki, the engine behind Wandering Thoughts, or even how moderately messy and hacked in the change is. For some changes you do care how often they get used and feel annoyed if they aren't used as often as you expected, but for me ratelimits aren't one of those.)

Making empirical decisions about web access (here in 2026)

By: cks

Recently, Denis Warburton wrote in a comment on my entry on how HTTP results today depend on what HTTP User-Agent you use:

Making decisions based on user-provided information is unwise in 2026. The originating ip address is the only source of "truth" ... and even then, that information needs to be further examined before discerning whether or not it is a valid piece of communication.

It's absolutely true that everything except the source IP address is under the control of an attacker (and it always has been), and in one sense you can't trust it. But this doesn't mean you can't use information that's under the attacker's control in making decisions about whether to allow access to something; instead, it means that you have to be thoughtful about how you use the information and what for.

In practice, web agents emit a lot of data in their HTTP headers and requests. Some of these signals are complicated, such as browser version numbers, and some of them require work to use, but this doesn't mean that there's no signal at all that can be derived from all of the data that a web agent emits. For example, consider a web agent that uses the HTTP User-Agent of:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

This web agent is telling you that it's claiming to be Googlebot. Under the right circumstances this can be a valuable signal of malfeasance and worth denying access.

Similarly, a web agent that emits user agent hints while its HTTP User-Agent is claiming to be an authentic version of Firefox 147 is giving you the signal that it's not an unaltered, standard version of Firefox, because standard versions of Firefox 147 don't do that. It's most likely something built on Chromium, but in any case you might decide that this signal means it is suspicious enough to be denied access. Neither the User-Agent nor the Sec-CH-UA headers create true facts to definitively identify the browser and both could be faked by the attacker, but the inconsistency is real.

What an attacker tells you (deliberately or accidentally) is a signal, and it's up to you to interpret and use that signal (which I think you should these days). This is an empirical thing, something that depends on the surrounding environment (for example, you have to interpret the attacker's signal in terms of its difference from the signals of legitimate visitors), what you're doing, and what you care about, but then security is always ultimately people, not math, even though tech loves to avoid this sort of empiricism (which is a bad thing).

As a pragmatic thing, it's usually easier to use attacker signals if you allow things by default rather than deny them by default. If you allow by default, your primary concern is false positives (legitimate visitors who are emitting signals you find too suspicious), rather than false negatives, because an attacker that wants to work hard enough can always obtain access. Conveniently, public web sites (such as Wandering Thoughts) are exactly such an allow by default environment, which is why these days I use a lot of signals here when deciding what to accept or block (including IP addresses and networks).

(If you need a deny by default environment with real security, you need to use something that attackers can't fake. IP addresses can be one option in the right circumstances, but they aren't the only one.)

On today's web, HTTP results depend on the HTTP User-Agent you use

By: cks

Back in the old days, search engines mostly crawled your sites with their regular, clearly identifying HTTP User-Agent headers, but once in a while they would switch up to fetching with a browser's User-Agent. What they were trying to detect was if you served one set of content to "Googlebot" but another set of content to "Firefox", and if you did they tended to penalize you; you were supposed to serve the same content to both, not SEO-bait to Googlebot and wall to wall ads to browsers. Googlebot identified itself as a standard courtesy, not so you could handle it differently.

Obviously those days are long over. It's now routine and fully accepted to serve different things to Googlebot and to regular browsers. Generally websites offer Googlebot more access and plain text, and browsers less access (even paywalls) and JavaScript encrusted content (leading to people setting their User-Agent to Googlebot to bypass paywalls). Since people give Googlebot special access, people impersonate it and other well accepted crawlers and other people (like me) block that impersonation.

This is part of an increasingly common general pattern, which is that different HTTP User-Agents get different results for the same URL. Especially, some HTTP User-Agents will get errors, HTTP redirections, or challenge pages, and other User-Agents won't; instead they'll get the real content. What this means in concrete terms is it's increasingly bad to take the results from one HTTP User-Agent and assume they apply for another. This isn't just me and Wandering Thoughts; for example, if a site has a standard configuration of Anubis, having a User-Agent that includes 'Mozilla' will cause you to get a challenge page instead of the actual page (cf).

(One of the amusing effects of this is what it does to 'link previews', which require the website displaying the preview to fetch a copy of the URL from the original site. On the Fediverse, fairly often the link preview I see is just some sort of a challenge page.)

In practice, you're probably reasonably safe if you're doing close variations of what's fundamentally the same distinctive User-Agent. But you're living dangerously if you try this with browser-like User-Agent values, either two different ones or a browser-like User-Agent and a distinctive non-browser one, because those are the ones that are most frequently forged and abused by covert web crawlers and other malware. Everyone who wants to look normal is imitating a browser, which means looking like a browser is a bad idea today.

Unfortunately, however bad an idea it is, people seem to keep trying fetches with multiple User-Agent header values and then taking a result from one User-Agent and using it in the context of another. Especially, feed reader companies seem to do it, first Feedly and now Inoreader.

If there are URLs in your HTTP User-Agent, they should exist and work

By: cks

One of the things people put in their HTTP User-Agent header for non-browser software is a URL for their software, project, or whatever (I'm all for this). This is a a good thing, because it allows people operating web servers to check out who and what you are and decide for themselves if they're going to allow it. Increasingly (and partly for social reasons), I block many 'generic' User-Agent values that come to my attention, for example through their volume.

(I don't block all of them, but if your User-Agent shows up and I can't figure out what it is and whether or not it's legitimate and used by real people, that's probably a block.)

However, there's an important and obvious thing about any URLs in your HTTP User-Agent, which is that they should actually work. The domain or host should exist, the URL should exist in the web server, and the URL's contents should actually explain the software, project, or organization involved. Plus, if you use a HTTPS website, the TLS certificate should be valid.

(A related thing is a generic URL that doesn't give me anything to go on. For example, your URL on a code forge, and either it's not obvious which one of your repositories is doing things or you don't have any public repositories.)

For me, a non-working URL is much more suspicious than a missing URL. HTTP User-Agents without URLs are reasonably common (especially in feed readers), so I don't find them immediately suspicious. Non-working URLs in mysterious User-Agents certainly look like you're attempting to distract me with the appearance of a proper web agent but without the reality of it. If a User-Agent with such a non-working URL comes to my attention, I'm very likely to block it in some way (unless it's very clear that it's a legitimate program used by real people, and it merely has bad habits with its User-Agent).

You would think that people wouldn't make this sort of mistake, but I regret to say that I've seen it repeatedly, in all of the variations. One interesting version I've seen is User-Agent strings with the various 'example.<TLD>' domains in their URLs. I suspect that this comes from software that has some sort of 'operator URL' setting and provides a default value if you don't set one explicitly. I've also seen .lan and .local URLs in User-Agents, which takes somewhat more creativity.

As usual, my view is that software shouldn't provide this sort of default value; instead, it should refuse to work until you configure your own value. However, this makes it slightly more annoying to use, so it will be less popular than more accommodating software. Of course, we can change that calculation by blocking everything that mentions 'example.com', 'example.org', 'example.net' and so on in its User-Agent.

The importance of limiting syndication feed requests in some way

By: cks

People sometimes wonder why I care so much about HTTP conditional GETs and rate limiting for syndication feed fetchers. There are multiple reasons, including social reasons to establish norms, but one obvious one is transfer volumes. To illustrate that, I'll look at the statistics for yesterday for feed fetches of the main syndication feed for Wandering Thoughts.

Yesterday there were 7492 feed requests that got HTTP 200 responses, 9419 feed requests that got HTTP 304 Not Modified responses, and 11941 requests that received HTTP 429 responses. The HTTP 200 responses amounted to about 1.26 GBytes, with the average response size being 176 KBytes. This average response size is actually a composite; typical compressed syndication feed responses are on the order of 160 KBytes, while uncompressed ones are on the order of 540 KBytes (but there look to have been only 313 of them, which is fortunate; even still they're 12% of the transfer volume).

If feed readers didn't do any conditional GETs and I didn't have any rate limiting (and all of the requests that got HTTP 429s would still have been made), the additional feed requests would have amounted to about another 3.5 GBytes of responses sent out to people. Obviously feed readers did do conditional GETS, and 66% of their non rate limited requests were successful conditional GETs. A HTTP 200 response ratio of 44% is probably too pessimistic once we include rate limited requests, so as an extreme approximation we'll guess that 33% of the rate limited requests would have received HTTP 200 responses with a changed feed; that would amount to another 677 MBytes of response traffic (which is less than I expected). If we use the 44% HTTP 200 ratio, it's still only 903 MBytes more.

(This 44% rate may sound high but my syndication feed changes any time someone leaves a comment on a recent entry, because the syndication feed of entries includes a comment count for every entry.)

Another statistic is that 41% of syndication feed requests yesterday got HTTP 429 responses. The most prolific single IP address received 950 HTTP 429s, which maps to an average request interval of less than two minutes between requests. Another prolific source made 779 requests, which again amounts to an interval of just less than two minutes. There are over 20 single IPs that received more than 96 HTTP 429 responses (which corresponds to an average interval of 15 minutes). There is a lot of syndication feed fetching software out there that is fetching quite frequently.

(Trying to figure out how many HTTP 429 sources did conditional requests is too complex with my current logs, since I don't directly record that information.)

You can avoid the server performance impact of lots of feed fetching by arranging to serve syndication feeds from static files instead of a dynamic system (and then you can limit how frequently you update those files, effectively forcing a maximum number of HTTP 200 fetches per time interval on anything that does conditional GETs). You can't avoid the bandwidth effects, and serving from static files generally leaves you with only modest tools for rate limiting.

PS: The syndication feeds for Wandering Thoughts are so big because I've opted to default to 100 entries in them, but I maintain you should be able to do this sort of thing without having your bandwidth explode.

Sometimes giving syndication feed readers good errors is a mistake

By: cks

Yesterday I wrote about the problem of giving feed readers error messages that people will actually see, because you can't just give them HTML text; in practice you have to wrap your HTML text up in a stub, single-entry syndication feed (and then serve it with a HTTP 200 success code). In many situations you're going to want to do this by replying to the initial feed request with a HTTP 302 temporary redirection that winds up on your stub syndication feed (instead of, say, a general HTML page explaining things, such as "this resource is out of service but you might want to look at ...").

Yesterday I put this into effect for certain sorts of problems, including claimed HTTP User-Agents that are for old browser. Then several people reported that this had caused Feedly to start presenting my feed as the special 'your feed reader is (claiming to be) a too-old browser' single entry feed. The apparent direct cause of this is that Feedly made some syndication feed requests with HTTP User-Agent headers of old versions of Chrome and Firefox, which wound up getting a series of HTTP 302 temporary redirections to my new 'your feed reader is a too-old browser' stub feed. Feedly then decided to switch its main feed fetcher over to directly using this new URL for various feeds, despite the HTTP redirections being temporary (and not served for its main feed fetcher, which uses "Feedly/1.0" for its User-Agent).

Feedly has been making these fake browser User-Agent syndication feed fetch attempts for some time, and for some time they've been getting HTTP 302 redirections. However, up until late yesterday, what Feedly wound up on was a regular HTML web page. I have to assume that since this wasn't a valid syndication feed, Feedly ignored it. Only when I did the right thing to give syndication feed readers a good, useful error result did Feedly receive a valid syndication feed and go over the cliff.

Providing a stub syndication feed to communicate errors and problems to syndication feed fetchers is clearly the technically correct answer. However, I'm now somewhat less convinced that it's the most useful answer in practice. In practice, plenty of syndication feed fetchers keep fetching and re-fetching these stub feeds from me, suggesting that people either aren't seeing them or aren't doing anything about it. And now I've seen a feed reader malfunction spectacularly and in a harmful way because I gave it a valid syndication feed result at the end of a temporary HTTP redirection.

(I will probably stick to the current situation, partly because I no longer feel like accepting bad behavior from web agents.)

PS: If you're a feed fetching system, please give your feeds IDs that you put in the User-Agent, so that when they all wind up shifted to the same URL through some misfortune, the website involved can sort them out and redirect them back to the proper URLs.

The problem of delivering errors to syndication feed readers

By: cks

Suppose, not hypothetically, that there are some feed readers (or at least things fetching your syndication feeds) that are misbehaving or blocked for one reason or another. You could just serve these feed readers HTTP 403 errors and stop there, but you'd like to be more friendly. For regular web browsers, you can either serve a custom HTTP error page that explains the situation or answer with a HTTP 302 temporary redirection to a regular HTML page with the explanation. Often the HTTP 302 redirection will be easier because you can use various regular means to create the HTML pages (and even host them elsewhere if you want). Unfortunately, this probably leaves syndication feed readers out in the cold.

(This can also come up if, for example, you decommission a syndication feed but want to let people know more about the situation than a simple HTTP 404 would give them.)

As far as I know, most syndication feed readers expect that the reply to their HTTP feed fetching request is in some syndication feed format (Atom, RSS, etc), which they will parse, process, and display to the person involved. If they get a reply in a different format, such as text/html, this is an error and it won't be shown to the person. Possible the HTML <title> element will make it through, or the HTTP status code response for an error, or maybe both. But your carefully written HTML error page is unlikely to be seen.

(Since syndication feed readers need to be able to display HTML in general, they could do something to show people at least the basic HTML text they got back. But I don't think this is very common.)

As a practical thing, if you want people using blocked syndication feed readers to have a chance to see your explanation, you need to reply with a syndication feed with an entry that is your (HTML) message to them (either directly or through HTTP 302 redirections). Creating this stub feed and properly serving it to appropriate visitors may be anywhere from annoying to challenging. Also, you can't reply with HTTP error statuses (and the feed) even though that's arguably the right thing to do. If you want syndication feed readers to process your stub feed, you need to provide it as part of a HTTP 200 reply.

(Speaking from personal experience I can say that hand-writing stub Atom syndication feeds is a pain, and it will drive you to put very little HTML in the result. Which is okay, you can make it mostly a link to your regular HTML page about whatever issue it is.)

If you're writing a syndication feed reader, I urge you to optionally display the HTML of any HTTP error response or regular HTML page that you receive. If I was writing some sort of blog system today, I would make it possible to automatically generate a syndication feed version of any special error page the software could serve to people (probably through some magic HTTP redirection). That way people can write each explanation only once and have it work in both contexts.

A surprising path to accessing localhost URLs and HTTP services

By: cks

One of the classic challenges in web security is DNS rebinding. The simple version is that you put some web service on localhost in order to keep outside people from accessing it, and then some joker out in the world makes 'evil.example.org' resolve to 127.0.0.1 and arranges to get you to make requests to it. Sometimes this is through JavaScript in a browser, and sometimes this is by getting you to fetch things from URLs they supply (because you're running a service that fetches and processes things from external URLs, for example).

One way people defend against this is by screening out 127.0.0.0/8, IPv6's ::1, and other dangerous areas of IP address space from DNS results (either in the DNS resolver or in your own code). And you can also block URLs with these as explicit IP addresses, or 'localhost' or the like. Sometimes you might add extra security restrictions to a process or an environment through means like Linux eBPF to screen out which IP addresses you're allowed to connect to (cf, and I don't know whether systemd's restrictions would block this).

As I discovered the other day, if you connect to INADDR_ANY, you connect to localhost (which any number of people already knew). Then in a comment Kevin Lyda reminded me that INADDR_ANY is also known as 0.0.0.0, and '0' is often accepted as a name that will turn into it, resulting in 'ssh 0' working and also (in some browsers) 'http://0:<port>/'. The IPv6 version of INADDR_ANY is also an all-zero address, and '::0' and '::' are both accepted as names for it, and then of course it's easy to create DNS records that resolve to either the IPv4 or IPv6 versions. As I said on the Fediverse:

Surprise: blocking DNS rebinding to localhost requires screening out more than 127/8 and ::1 answers. This is my face.

It turns out that this came up in mid 2024 in the browser context, as '0.0.0.0 Day' (cf). Modern versions of Chrome and Safari apparently explicitly block requests to 0.0.0.0 (and presumably also the IPv6 version), while Firefox will still accept it. And of course your URL-fetching libraries will almost certainly also accept it, especially through DNS lookups of ordinary looking but attacker controlled hostnames.

In my view, it's not particularly anyone's fault that this slipped through the cracks, both in browsers and in tools that handle fetching content from potentially hostile URLs. The reality of life is that how IP behaves in practice is complicated and some of it is historical practice that's been carried forward and isn't necessarily obvious or well known (and certainly isn't standardized). Then URLs build on top of this somewhat rickety foundation and surprises happen.

(This is related to the issue of browsers being willing to talk to 'local' IPs, which Chrome once attempted to start blocking (and I believe that shipped, but I don't use Chrome any more so I don't know what the current state is).)

Single sign on systems versus X.509 certificates for the web

By: cks

Modern single sign on specifications such as OIDC and SAML and systems built on top of them are fairly complex things with a lot of moving parts. It's possible to have a somewhat simple surface appearance for using them in web servers, but the actual behind the scenes implementation is typically complicated, and of course you need an identity provider server and its supporting environment as well (which can get complicated). One reaction to this is to suggest using X.509 certificates to authenticate people (as a recent comment on this entry did).

There are a variety of technical considerations here, like to what extent browsers (and other software) might support personal X.509 certificates and make them easy to use, but to my mind there's also an overriding broad consideration that makes the two significantly different. Namely, people can remember passwords but they have to store X.509 certificates. OIDC and SAML may pass around tokens and programs dealing with them may store tokens, but the root of everything is in passwords, and you can recover all the tokens from there. This is not true with X.509 certificates; the certificate is the thing.

(There are also challenges around issuing, managing, checking, and revoking personal X.509 certificates, but let's ignore them.)

To make using X.509 certificate practical for authenticating people, people have to be able to use them on multiple devices and move them between browsers. Many people have multiple devices and people do change what browsers they use (for all that browser and platform vendors like them not to, or at least the ones that are currently popular are often all for that). Today, there is basically nothing that helps people deal with this, and as a result X.509 certificates are at best awkward for people to use (and remember, security is people).

(In common use, it's easy to move passwords between browsers and devices because they're in your head (excluding password managers, which are still not used by a lot of people).)

Of course you could develop standards and software for moving and managing X.509 certificates. In many ways, passkeys show what's possible here, and also show many of the hazards of using things for authentication that can't be memorized (or copied) by people in order to transport them between environments. However, no such standards and software exist today, and no one has every shown much interest in developing them, even back in the days when personal X.509 certificates were close to your only game in town.

(You could also develop much better browser UIs for dealing with personal X.509 certificates, something that was extremely under-developed back in the days when they were sometimes in use. Even importing such a certificate into your browser could be awkward, never mind using it.)

In the past, people have authenticated web applications through the use of personal X.509 certificates (as a more secure form of passwords). As far as I know, pretty much everyone has given up on that and moved to better options, first passwords (sometimes plus some form of additional confirmation) and then these days trying to get people to use passkeys. One reason they gave up was that actually using X.509 certificates in practice was awkward and something that people found quite annoying.

(I had to use a personal X.509 certificate for a while in order to get free TLS certificates for our servers. It wasn't a particularly great experience and I'm not in the least bit surprised that everyone ditched it for single sign on systems.)

PS: It's no good saying that X.509 certificates would be great if all of the required technology was magically developed, because that's not going to just happen. If you want personal X.509 certificates to be a thing, you have a great deal of work ahead of you and there is no guarantee you'll be successful. No one else is going to do that work for you.

PPS: You can imagine a system where people use their passwords and other multi-factor authentication to issue themselves new personal X.509 certificates signed by your local Certificate Authority, so they can recover from losing the X.509 certificate blob (or get a new certificate for a new device). Congratulations, you have just re-invented a manual version of OIDC tokens (also, it's worse in various ways).

What 24 hours of traffic looks like to our main web server in January 2026

By: cks

One of the services we operate for the department is a traditional Apache-based shared web server, with things like people's home pages (eg), pages for various groups, and so on (we call this our departmental web server). This web server has been there for a very long time and its URLs have spread everywhere, and in the process it's become quite popular for some things. These days there are a lot of things crawling everything in sight, and our server has no general defenses against them (we don't even have much of a robots.txt).

(Technically our perimeter firewall has basic HTTP and HTTPS brute-force connection rate limits, but people typically have to really work to trigger them and they mostly don't. Although now that I look at yesterday, more IPs wound up listed than I expected, although listings normally last at most five minutes.)

The first, very noticeable thing that we have is people who do very slow downloads from us. Our server rolls over the logs at midnight, but Apache only writes a log record when a HTTP request completes, possibly to the old log file. Yesterday (Tuesday), the last log record was written at 05:24, for a request that started at 22:44. Over the 24 hours that requests were initiated in, we saw 1.2 million requests.

The two most active User-Agents were (in somewhat rounded numbers):

426000 "Mozilla/5.0 (iPhone; CPU iPhone OS 18_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.0 Mobile/15E148 Safari/604.1"
424000 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0 Safari/537.36"

The most active thing that was willing to admit it wasn't a human with a browser was "ChatGPT-User", with just under 20,000 requests. After that came "GoogleOther" and "Amazonbot", at about 12,000 requests each, then "Googlebot" with 10,000 and bingbot with about 6,000. Of course, some of those could be people impersonating the real Googlebot and bingbot.

To my surprise, the most popular HTTP result code by far was HTTP 301 Moved Permanently, at 844,000 responses (HTTP 200s were 347,000, everything else was small by comparison). And most of the requests by the those two most active User-Agents got HTTP 301 responses (roughly 418,000 each). I don't know what's going on there, but someone seems to have latched on to a lot of URLs that require redirects (which include things like directory URLs without the '/' on the end). On the positive side, most of those requests will have been pretty cheap for Apache to handle.

A single DigitalOcean IP claiming to be running Chrome 61 on 'Windows NT 10.0' made 11,000 requests, most of which got HTTP 404 errors because it was requesting URLs like '/wp-login.php'. There's no point complaining to hosting providers about this sort of thing, it's just background noise. No other single IP stood out to that degree (well, our monitoring system made over 10,000 requests, but that's expected). Google mostly crawled from a few IPs, with large counts, but other crawlers were more spread out.

To find out more traffic information, we need to go to looking at Autonomous System Numbers (ASNs), using asncounter. This reports:

 count   percent ASN     AS
 463536  36.55   210906  BITE-US, LT
 152237  12.0    212286  LONCONNECT, GB
 65064   5.13    3257    GTT-BACKBONE GTT, US
 53927   4.25    7385    ABUL-14-7385, US
 45255   3.57    8075    MICROSOFT-CORP-MSN-AS-BLOCK, US
 32557   2.57    7029    WINDSTREAM, US
 32101   2.53    55286   SERVER-MANIA, CA
 30037   2.37    15169   GOOGLE, US
 24412   1.92    239     UTORONTO-AS, CA
 21745   1.71    7015    COMCAST-7015, US
 16311   1.29    64200   VIVIDHOSTING, US
 [...]

And then for prefixes:

 count   percent prefix  ASN     AS
 64312   5.07    138.226.96.0/20 3257    GTT-BACKBONE GTT, US
 43459   3.43    85.254.128.0/22 210906  BITE-US, LT
 43161   3.4     185.47.92.0/22  210906  BITE-US, LT
 43111   3.4     45.131.216.0/22 212286  LONCONNECT, GB
 43040   3.39    45.145.136.0/22 212286  LONCONNECT, GB
 42998   3.39    45.138.248.0/22 212286  LONCONNECT, GB
 42870   3.38    185.211.96.0/22 210906  BITE-US, LT
 32365   2.55    85.254.112.0/22 210906  BITE-US, LT
 26937   2.12    66.249.64.0/20  15169   GOOGLE, US
 23785   1.88    128.100.0.0/16  239     UTORONTO-AS, CA
 23088   1.82    45.154.148.0/22 212286  LONCONNECT, GB
 21767   1.72    85.254.42.0/23  210906  BITE-US, LT
 [and then five more BITE-US prefixes at the same
  volume level, then many more prefixes]

Given that we have two extremely prolific User-Agents, let's look at where those requests came from in specific, and you will probably not be surprised at the results:

 count   percent ASN     AS
 462925  54.37   210906  BITE-US, LT
 152155  17.87   212286  LONCONNECT, GB
 64321   7.55    3257    GTT-BACKBONE GTT, US
 53649   6.3     7385    ABUL-14-7385, US
 32287   3.79    7029    WINDSTREAM, US
 31955   3.75    55286   SERVER-MANIA, CA
 21710   2.55    7015    COMCAST-7015, US
 16304   1.92    64200   VIVIDHOSTING, US
 [...]

If you have the ability to block traffic by ASN and you don't need to accept requests from clouds and your traffic is anything like this, you can probably drop a lot of it quite easily.

I can ask a different question: if we exclude those two popular User-Agents and look only at successful requests (HTTP 200 responses), where do they come from?

 count   percent ASN     AS
 38821   11.61   8075    MICROSOFT-CORP-MSN-AS-BLOCK, US
 25510   7.63    15169   GOOGLE, US
 16968   5.07    239     UTORONTO-AS, CA
 12816   3.83    14618   AMAZON-AES, US
 11529   3.45    396982  GOOGLE-CLOUD-PLATFORM, US
 [...]

(There are about 334,000 of these in total.)

The 'UTORONTO-AS' listing includes our own monitoring, with its 10,000 odd requests. Much of Google's requests come from their 66.249.64.0/20 prefix, which is mostly or entirely used by various Google crawlers.

Around 138,000 requests were for a set of commonly used ML training data, and they probably account for most of the bandwidth used by this web server (which typically averages 40 Mbytes/sec of outgoing bandwidth all of the time on weekdays).

(I've previously done HTTP/2 stats for this server as of mid 2025.)

Some notes on using the Sec-CH-UA HTTP headers that Chrome supports

By: cks

A while back, Chrome proposed and implemented what are called user agent hints, which are a collection of Sec-CH-UA HTTP headers that can provide you with additional information about the browser beyond what the HTTP User-Agent header provides. As mentioned, only Chrome and browsers derived from Chromium (or if you prefer, 'Blink') support these headers, and only since early 2021 (for Chrome; later for some others). However, Chrome is what a lot of people use. More to the point, Chrome is what a lot of bad crawlers claim to be in their User-Agent header. As has been written up by other people, you can use these headers to detect inconsistencies that give away crawlers.

In an ideal world, it would be enough to detect a recent enough Chrome version and then require it to be consistent between the User-Agent, the platform from Sec-CH-UA-Platform, and the version information from Sec-CH-UA. We don't live in an ideal world. The first issue is that some versions of Chrome don't send these user agent hints by default (I've seen this specifically from Android Pixel devices). To get them to do so, you must reply with a HTTP 307 redirection that includes Accept-CH and Critical-CH headers for the Sec-CH-UA headers you care about. I'm not sure if you can redirect the browser to the current URL; I opt to redirect to the URL with a special query parameter added, which then redirects back to the original version of the URL.

(One advantage of this is that in my HTTP request handling, I can reject a request with the special query parameter if it still doesn't including the Sec-CH-UA headers I ask for. This avoids infinite redirect loops and lets me log definite failures. Chrome browser setups that refuse to provide them even when requested are currently redirected to an error page explaining the situation.)

Cross checking the browser version from Sec-CH-UA against the 'browser version' in the User-Agent is complicated by the question of what is a browser version. This is especially the case because the 'brand names' used in Sec-CH-UA aren't necessarily the '<whatever>/<ver>' names used in the User-Agent; for example, Microsoft Edge will report itself as 'Microsoft Edge' in Sec-CH-UA but 'Edg/' in the User-Agent. Some browsers based on Chrome will report a Chrome version that is the same as their brand name version (this appears to be true for Edge, for example), but others definitely won't, so you may need a mapping table from brand name to User-Agent name if you want to go that far. Sometimes the best you can do is verify the claimed 'Chromium' version against the 'Chrome/' version from the User-Agent.

Platform names definitely require a mapping from the Sec-CH-UA-Platform value to what appears in the User-Agent. On top of that, sometimes browsers will change their User-Agent platform name without changing Sec-CH-UA-Platform. One case I know of is that some versions of Android Opera (and perhaps Chrome) will change their User-Agent to say they're on Linux if you have them ask for the 'desktop' version of a site, but still report the Android values in their Sec-CH-UA headers (and say that they aren't a mobile device in Sec-CH-UA-Mobile, which is fair enough). It's hard to object to this behavior in a world where User-Agent sniffing is one way that websites decide on regular versus 'mobile' versions.

My use of Sec-CH-UA checks so far here on Wandering Thoughts has turned up several sorts of bad behavior in crawlers (so far). As I sort of expected, the most common behavior is crawlers that claim to be Chrome in their User-Agent (or something derived from it) but don't supply any Sec-CH-UA headers; this is now a straightforward bad idea even if you mention your crawler in your User-Agent. Some crawlers report one Chrome version in Sec-CH-UA but another one in their User-Agent, usually with the User-Agent version being older. I suspect that these crawlers are based on Chromium and periodically update their Chromium version, but statically configure their User-Agent and don't update it. Some of these crawlers also report a different platform between Sec-CH-UA-Platform and their User-Agent (so far all of them have been running on macOS but saying they were Windows 10 or 11 machines in their User-Agent). The third case is things that report they are headless Chrome in their Sec-CH-UA header (and I reject them).

(This is where the Internet Archive gets a dishonorable mention; currently their crawling often has mismatched User-Agent and Sec-CH-UA headers. Sometimes they have a special marker in the User-Agent and sometimes it's just mismatched Chrome information.)

I've also seen some weird cases so far where a crawler provided Sec-CH-UA headers despite claiming to be Firefox in its User-Agent. My data so far is incomplete, but some of these have had mismatches between Sec-CH-UA-Platform and the User-Agent, while another claimed to be Chrome 88 (which in theory is before Chrome supported them) while saying it was Firefox 120 in its User-Agent. I've improved my logging and error reporting so I may get slightly better data on this in a while.

At the same time, checking Sec-CH-UA headers (and checking them against User-Agent headers) will definitely not defeat all bad crawlers. Some crawlers are clearly using either real browsers or software that fakes everything together properly. I suspect the latter because the most recent case involves a horde of IPs claiming to be Chrome 142 on macOS 10.15.7, which I doubt is so universal a configuration (especially on datacenter VPSes and servers). As with email spam, all of this is a constant race of heuristics against the bad actors.

(It's hard to judge my new Sec-CH-UA checks compared to my existing header checks because of check ordering. If I was sufficiently energetic I'd try to do all of the checks before rejecting anything and log all failed checks, but as it is I do checks one by one and reject (or redirect with Critical-CH) at the first failed one.)

Browser version numbers are a bit complicated (for server code)

By: cks

Suppose, not entirely hypothetically, that you're writing code that for some reason wants to determine a 'browser version' from something and then cross-check it against other sources of browser version information. Possibly you also want to notice when you're not working with real browsers and not apply your version consistency checks to them. When you're starting out, it looks like what your code should do is return a browser name and version number. Unfortunately, this is a naive view, partly because of all of the browsers based on Chrome (or Chromium) and partly because of mobile device WebViews, which reuse a browser engine without being the browser.

The theoretically correct and maximally flexible approach would be to parse all possible version indicators of everything from whatever source of information you're using, such as the browser User-Agent or user agent client hints, and return them as a big map, possibly augmented with your best guess at what the 'browser' as such is. If applied to a User-Agent string such as this:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36 OPR/125.0.0.0

Parsing this might give you identifiers and versions of AppleWebKit 537.36, Chrome 141, Safari 537.36, and OPR 125, and you'd guess that the browser is Opera and it's based on Chromium 141 (which is potentially important for what features and behavior should be present). There are complications in parsing this, because sometimes you'll see "Mobile Safari/537.36", and sometimes you'll see mysterious additions like 'Version/4.0' or 'ABB/133.0.6943.51' (and I haven't even gone into what you might see on iOS). Simply fully parsing the User-Agent string is complicated (although there are projects that do this for you, such as the User Agent String Parser and the Python user-agents package).

(For instance, did you know that Firefox reports its Gecko version in at least two ways? On desktop Firefox, it's always 'Gecko/20100101'. On Android Firefox, it can be 'Gecko/146.0', perhaps always matching the Firefox/ version.)

One problem is that a giant map is not necessarily entirely useful to code that wants to use browser version information, especially since the browser names in data may not match the common names you know them by. For example, on iOS devices Firefox reports 'FxiOS' and Chrome reports 'CriOS', which is in one sense accurate because these two iOS browsers don't have the behavior of their regular counterparts since they're built on top of Apple's WebKit, not their own browser engines (and as a result Chrome on iOS doesn't report user agent client hints). Do you want to treat FxiOS as a different browser from Firefox or not? That depends.

Currently, the minimum information I think you want to provide is the name and version of both the browser engine and the 'browser' itself. Given WebViews, Chromium, and other similar situations, you may not be able to reliably determine the browser, and sometimes you won't have either. When parsing the User-Agent string for Chrome, you don't get an explicit version for Chromium, so you have to assume it's the same as the Chrome version; for Chrome derived browsers I think you can assume that the 'Chrome/...' version reported is the version of their underlying Chromium. If present, the HTTP Sec-CH-UA header can give you the Chromium version directly and also perhaps tell you if you have a genuine Chrome or another brand where you (or your User-Agent parser) don't recognize their User-Agent marker.

It's now a bad idea to look like a browser in your HTTP User-Agent

By: cks

Once upon a time, something like the following was a perfectly decent User-Agent header string for a web crawler or a web fetching agent:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36 (compatible; Yourbot; +https://some/url)

You weren't hiding, after all, you called yourself 'Yourbot', and for the rest, you were asking for people to serve you pages like you were Chrome. Well, I'm not too sad to say, those days are over.

They're over because an increasing number of websites are increasingly requiring that anything that looks like a browser in its User-Agent also act like a browser, in specific the browser and browser version it's saying it is, and there are a lot of picky details around other HTTP headers (also). For example, often simply having 'Mozilla' in your User-Agent will cause Anubis to challenge your crawler (cf). And the version of Chrome being asserted here is new enough that it should be reporting a Sec-CH-UA-Platform header, among other Sec-CH- headers.

(Claiming to be a really old version of Chrome without those features is likely to be worse.)

Now, you can certainly pin your hopes on the idea that people who are writing header checking code will pay attention to the presence of the 'compatible;' and the URL in your User-Agent, and realize that you're not actually a browser despite you having a fairly good imitation of a Chrome User-Agent. However, you're not Google(bot). People have to make exceptions for Googlebot (to some degree), but they don't have to make exceptions for you and they probably won't.

The User-Agent you should instead use today is something like, for example:

Fedithing/4.5.1 (library/1.2.3; +https://some/url)

You don't start with a superstitious invocation of 'Mozilla/5.0', you don't claim to be be like any version of any browser, and you put in the basics of identifying your software and yourself so no one can accuse you of hiding. No one is going to match your User-Agent against detectors for old versions of browsers, or things claiming to be browser but lacking their headers, and so on, because you haven't put in the names of any browsers.

PS: Googlebot and Bingbot and a few others still use User-Agent strings very much like my first example, but they're Googlebot (and Bingbot) and to a fair extent they do get their HTTP headers relatively authentic.

Fake "web browsers" and their (lack of) HTTP headers: some notes

By: cks

It's hopefully not news to people that there is a plague of disguised web crawlers that are imitating web browsers (and not infrequently crawling from residential IPs, through various extremely questionable methods). However, many of these crawlers have only a skin-deep imitation of browsers, primarily done through their HTTP User-Agent header. This creates a situation where some of these crawlers can currently be detected (and blocked) because they either lack entirely or have non-browser values for other HTTP headers. I've been engaged in a little campaign to reduce the crawler presence here on Wandering Thoughts, so I've been experimenting with a number of HTTP header checks.

Headers I'm currently looking at include:

  • The CF-Worker header is set for all requests from Cloudflare Workers. Anubis blocks all requests with this header set by default (cf), and I decided to copy it. This occasionally blocks things trying to scrape Wandering Thoughts.

  • As I discovered, you can't block requests with X-Forwarded-For headers because people really do set these headers on real, non-malicious requests.

  • The Sec-Fetch-Mode header is sent by every modern browser and is sent by almost no bad crawlers. However, checking things claiming to be Safari is a little bit complicated, since Sec-Fetch-Mode support was only added in early 2023 (in 16.4) and there are still older Safari versions out there (including earlier 16.x versions). This is a quite effective check in my environment.

    (I got this trick from here, although apparently there may be trouble with mobile WebView interfaces, which might come about through in-app navigation if someone sends a URL around.)

  • Every mainstream browser sends an Accept-Encoding header and has for a long time. If it's missing for a fetch of a regular HTML page, you have an imposter. Unless you like maintaining a list of old browsers and other programs that don't send Accept-Encoding, you probably want to limit requiring the header to things claiming to be at least a bit like mainstream browsers.

  • Some bad bots are sending an Accept-Encoding of 'identity' in what is apparently an attempt to avoid being fed compression bombs by people (I can't find my source for this). No mainstream browser should do this and in general most things fetching web pages from you should accept compressed responses if they advertise an Accept-Encoding at all.

    Sadly, the exception to this is syndication feed fetchers, some of which refuse to do compression. Whether you keep supporting such feed fetchers is up to you. Wandering Thoughts still does so far, although it's getting tempting to say that enough is enough, especially with the size of syndication feeds here.

  • Some or perhaps many bad crawlers set a HTTP Accept header of '*/*' on HTML requests, which isn't something that real browsers do (source). Unfortunately, browser-based syndication feed fetchers will send this value, so you can only do this check on HTML pages, and also bingbot and Googlebot (at least) will sometimes also send this Accept value. Some things seem to not end an Accept header at all, too.

    Based on monitoring the results so far, there may be something funny going on; I've seen the same IP and User-Agent making an initial request that is fine and then one or more re-requests for the same URL that have 'Accept: */*' and fail

  • A number of bad crawlers make HTTP/1.0 requests while claiming to be mainstream browsers, all of which have supported HTTP/1.1 for a very long time, and these days I block such requests. Although it's tempting to reject all HTTP/1.0 requests, some text-mode browsers still make them (the ones I know of are Lynx and w3m, including inside GNU Emacs). The HTTP version isn't really a HTTP header, but close enough.

Some of these checks overlap with each other. For example, the crawler with a bad Accept: HTTP header wasn't sending Sec-Fetch-Mode either.

Many of these HTTP headers are only sent by relatively mainstream browsers and environments that have added support for recent HTTP headers. For example, people still use text-based browsers and most of them don't send headers like Sec-Fetch-Mode; other programs that make HTTP requests through various packages and libraries probably won't either.

There are probably other useful header differences between crawlers imitating mainstream browsers and actual browsers (and, apparently, between headless browsers being driven by automation and real ones being used by people). You could probably discover some of them by collecting enough of a data set of request headers and then doing some sort of statistical analysis to discover correlations and clusters.

PS: The big offenders for requesting uncompressed syndication feeds appear to be Tiny Tiny RSS, Selfoss, and Nextcloud-News. Some browser based syndication feed readers also appear to do it, as do some curl-based syndication feed fetching that people are doing here.

Sidebar: What is a (mainstream) browser-like User-Agent?

It depends on how restrictive you want to be. There are a lot of options:

  • Just look for "Mozilla/5.0 (" at the start of the User-Agent.
  • Also look for " Chrome/", " Firefox/", or " AppleWebKit/" in the User-Agent
  • Try to specifically match a Firefox or Webkit based browser User-Agent format, which will cause you to learn a lot about what Webkit-based user agents appear in your logs.

  • Potentially exclude things that mark themselves as robots or crawlers, for example by having 'compatible;' in their User-Agent, or 'robot', or a URL. Anything with these markers is not trying to exactly be a browser User-Agent, although they may be looking generally like one.

I use different versions of these for different checks in DWiki's steadily growing pile of hacks to detect bad crawlers. Currently the most specific matching is reserved for blocking claimed browsers from cloud/server space, which catches a significant amount even with a limited selection of cloud and VPS provider space that it applies to.

(Some cloud space is blocked entirely; blocking only things that claim to be browsers is a lesser step.)

Self-hosting your Mastodon media with SeaweedFS

Self-hosting your Mastodon media with SeaweedFS

Mastodon 4.5.0 is here, and with it come some interesting changes that, in my opinion, might encourage more people to consider it for self-hosting their Fediverse community.

While it may not be as lightweight and simple as other solutions (like snac or GoToSocial or many others), I believe it remains one of the best platforms for managing a medium-sized Fediverse community, thanks in part to the direct feedback that many admins have provided to the developers.

I have previously written about how to install Mastodon in a FreeBSD jail and how to modify its character and poll limits.

One of the most critical initial decisions (which can be changed later, but with extra work) is where to store your media files. Mastodon downloads and re-processes all media it encounters from other instances for three main reasons:

  • Local Caching: Your users connect to your media server, reducing the load on the original instance.
  • Security: Re-processing media helps to remove any potential "impurities" before they reach the user's device.
  • Privacy: It prevents disclosing your users' IP addresses to other instances. A user will only connect to their own instance to fetch all data, including remote content.

At least initially, media files will be the largest part of your instance's storage footprint. It is therefore essential to plan where to store them and to add a regular cleanup script; otherwise, their growth will be exponential.

Mastodon supports uploading media to external S3-compatible solutions, and many admins use the usual commercial providers, paying for data uploads and transfers.

I am a firm believer in "Own Your Data", so I have always used my own self-hosted S3 servers. I initially started with Minio, but over time, I realized that, by design, it doesn't perform well with a multitude of small files (performance degrades). After running some tests, I decided to switch to SeaweedFS.

SeaweedFS "is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek..." - this, combined with the fact that it is a mature and proven piece of software, was enough for me to give it a try. The result? Excellent. The I/O and CPU load on my media server dropped drastically, making SeaweedFS an incredibly suitable solution. Furthermore, some of its features (like the ability to run a filer.sync) allow for efficient and fast replication to other storage, another host, or... anything else.

SeaweedFS works perfectly with Mastodon, and I will explain the steps to get it into production.

I will install SeaweedFS in a dedicated jail and use a dedicated subdomain. This ensures that the media server can be moved to another host at any time without reconfiguring everything or changing domains. SeaweedFS has its own FreeBSD package, installable via pkg, or can be downloaded directly from the project's website.

In either case, I will describe a "test" setup - which can also be used in production without issues. However, I highly recommend diving deeper into the tool, as it is incredibly powerful and flexible and can solve many more problems than one might imagine.

Setting up the SeaweedFS Jail

First, let's create a dedicated jail with BastilleBSD:

bastille create media 14.3-RELEASE 10.0.0.66 bastille0

Now, let's enter the jail and install SeaweedFS (and tmux, which can be useful):

bastille console media
pkg install -y tmux seaweedfs

I suggest launching SeaweedFS in a tmux session so you can monitor its output. Later, you should configure an automatic startup method, such as using the included rc.d file or any other method you prefer.

Create a directory for the data and start SeaweedFS as the "seaweedfs" user:

mkdir -p /seaweedfs/data
chown -R seaweedfs /seaweedfs
su -m seaweedfs
cd /seaweedfs/
/usr/local/bin/weed server -dir /seaweedfs/data -s3

At this point, SeaweedFS will start and create everything it needs to function, including the S3 server.

Configuring Buckets and Users

Now, let's open the weed shell to create the necessary bucket and users:

weed shell
s3.bucket.create -name mastomedia

Still in the weed shell, create a user for Mastodon and grant read permissions for unauthenticated users (which is necessary to serve media to the world):

s3.configure -access_key=mastomedia -secret_key=CHANGEME -buckets=mastomedia -user=mastodon -actions=Read,Write,List,Tagging,Admin -apply
s3.configure -buckets=mastomedia -user=anonymous -actions=Read -apply
s3.configure -buckets=mastomedia -actions=Read -apply

Security Tip: For the -secret_key, avoid using a simple password. You can generate a strong, random key directly from your shell with a command like openssl rand -base64 32.

Done. SeaweedFS is now ready to receive (and serve) media. The next step is to set up a reverse proxy to serve everything over HTTPS. My preferred approach is to configure the system as if it were external, even if the services are in adjacent jails. This might use slightly more resources, but the time and trouble it saves in the future are well worth it.

Nginx Reverse Proxy Configuration

The reverse proxy can be configured something like this:

[...]

server {
   server_name  media.mastodon.example.com;

   ignore_invalid_headers off;
   client_max_body_size 0; # Allow large file uploads without Nginx limits

   location / {
      proxy_set_header Host $http_host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header X-Forwarded-Proto $scheme;

      proxy_connect_timeout 300;
      proxy_http_version 1.1;
      proxy_set_header Connection "";
      chunked_transfer_encoding off;

      expires 1y;
      add_header Cache-Control public;

      add_header X-Cache-Status $upstream_cache_status;
      add_header X-Content-Type-Options nosniff;

      proxy_pass http://10.0.0.66:8333;
   }

# ... other server configurations like SSL ...

}

Mastodon Configuration

Now let's configure Mastodon. If you are running the setup wizard for the first time, here is a summary of the options:

[...]
Do you want to store uploaded files on the cloud? yes
Provider Minio
Minio endpoint URL: https://media.mastodon.example.com
Minio bucket name: mastomedia
Minio access key: mastomedia
Minio secret key: CHANGEME
Do you want to access the uploaded files from your own domain? Yes
Domain for uploaded files: media.mastodon.example.com

If Mastodon is already active, or once the setup is complete, the options in your .env.prod file should be modified to be consistent with what SeaweedFS expects:

S3_ENABLED=true
S3_PROTOCOL=https
S3_REGION=us-east-1
S3_ENDPOINT=https://media.mastodon.example.com
S3_HOSTNAME=media.mastodon.example.com
S3_BUCKET=mastomedia
AWS_ACCESS_KEY_ID=mastomedia
AWS_SECRET_ACCESS_KEY=CHANGEME
S3_FORCE_SINGLE_REQUEST=true
# remove the S3_ALIAS_HOST if it is set

IMPORTANT NOTE: If both services are in jails on the same host (i.e., SeaweedFS is on the same host as Mastodon), you should ensure that the Mastodon jail can reach the SeaweedFS jail through the reverse proxy and not via the external IP. To do this, add the following line to the /etc/hosts file of the Mastodon jail:

10.0.0.1        media.mastodon.example.com

In this example, the reverse proxy is at 10.0.0.1. If you are not using a separate reverse proxy but are exposing Nginx directly from the jail (as described in my Mastodon installation article), use the IP of the Mastodon jail itself instead (e.g., 10.0.0.42).

With this setup, Mastodon will be able to upload media to the SeaweedFS server and generate the correct links for other instances, public visitors, and users of your own instance.

Have fun with SeaweedFS!

Web Development Tip: Disable Pointer Events on Link Images

By: Nick Heer

Good tip from Jeff Johnson:

My business website has a number of β€œDownload on the App Store” links for my App Store apps. Here’s an example of what that looks like:

[…]

The problem is that Live Text, β€œSelect text in images to copy or take action,” is enabled by default on iOS devices (Settings β†’ General β†’ Language & Region), which can interfere with the contextual menu in Safari. Pressing down on the above link may select the text inside the image instead of selecting the link URL.

I love the Live Text feature, but it often conflicts with graphics like these. There is a good, simple, two-line CSS trick for web developers that should cover most situations. Also, if you rock a user stylesheet β€” and I think you should β€” it seems to work fine as a universal solution. Any issues I have found have been minor and not worth noting. I say give it a shot.

Update: Adding Johnson’s CSS to a user stylesheet mucks up the layout of Techmeme a little bit. You can exclude it by adding div:not(.ii) > before a:has(> img) { display: inline-block; }.

βŒ₯ Permalink

Do you care about (all) HTTP requests from cloud provider IP address space?

By: cks

About a month ago Mike Hoye wrote Raised Shields, in which Hoye said, about defending small websites from crawler abuse in this day and age:

If you only care about humans I strongly advise you to block every cloudhost subnet you can find, pretty easy given the effort they put into finding you. Most of the worst actors out there are living comfortably on Azure, GCP, Yandex and sometimes Huawei’s servers.

(As usual, there's no point in complaining about abusive crawlers to the cloud providers.)

I've said something similar on the Fediverse:

Today's idle thought: how many small web servers actually have any reason to accept requests from AWS or Google Cloud IP address space? If you search through your logs with (eg) grepcidr, you may find that there's little or nothing of value coming from there, and they sure are popular with LLM crawlers these days.

You definitely want to search your logs before doing this, and you may find that you want to make some exceptions even if you do opt for it. For example, you might want or need to let cloud-hosted things fetch your syndication feeds, because there are a fair number of people and feed readers that do their fetching from the cloud. Possibly you'll find that you have a significant number of real visitors that are using do it yourself personal VPN setups that have cloud exit points.

(How many exceptions you want to make may depend on how much of a hard line you want to take. I suspect that Mike Hoye's line is much harder than mine.)

However, I think that for a lot of small, personal web servers and web sites you'll find that almost nothing of genuine value comes from the big cloud provider networks, from AWS, Google Cloud, Azure, Oracle, and so on. You're probably not getting real visitors from these clouds, people who are interested in reading your work and engaging with it. Instead you'll most likely see an ever-growing horde of obvious crawlers, increasingly suspicious user agents, claims to be things that they aren't, and so on.

On the one hand, it's in some sense morally pure to not block these cloud areas unless they're causing your site active harm; it's certainly what the ethos was on the older Internet, and it was a good and useful ethos for those times. On the other hand, that view is part of what got us here. More and more, these days are the days of Raised Shields, as we react to the new environment (much as email had to react to the new environment of ever increasing spam).

If you're doing this, one useful trick you can play if you have the right web server environment is to do your blocking with HTTP 429 Too Many Requests responses. Using this HTTP code is in some sense inaccurate, but it has the useful effect that very few things will take it as a permanent error the way they may take, for example, HTTP 403 (or HTTP 404). This gives you a chance to monitor your web server logs and add a suitable exemption for traffic that you turn out to want after all, without your error responses doing anything permanent (like potentially removing your pages from search engine indexes). You can also arrange to serve up a custom error page for this case, with an explanation or a link to an explanation.

(My view is that serving a 400-series HTTP error response is better than a HTTP 302 temporary redirect to your explanation, for various reasons. Possibly there are clever things you can do with error pages in general.)

People are sending HTTP requests with X-Forwarded-For across the Internet

By: cks

Over on the Fediverse, I shared a discovery that came from turning over some rocks here on Wandering Thoughts:

This is my face when some people out there on the Internet send out HTTP requests with X-Forwarded-For headers, and maybe even not maliciously or lying. Take a bow, ZScaler.

The HTTP X-Forwarded-For header is something that I normally expect to see only on something behind a reverse proxy, where the reverse proxy frontend is using it to tell the backend the real originating IP (which is otherwise not available when the HTTP requests are forwarded with HTTP). As a corollary of this usage, if you're operating a reverse proxy frontend you want to remove or rename any X-Forwarded-For headers that you receive from the HTTP client, because it may be trying to fool your backend about who it is. You can use another X- header name for this purpose if you want, but using X-Forwarded-For has the advantage that it's a de-facto standard and so random reverse proxy aware software is likely to have an option to look at X-Forwarded-For.

(See, for example, the security and privacy concerns section of the MDN page.)

Wandering Thoughts doesn't run behind a reverse proxy, and so I assume that I wouldn't see X-Forwarded-For headers if I looked for them. More exactly I assumed that I could take the presence of an X-Forwarded-For header as an indication of a bad request. As I found out, this doesn't seem to be the case; one source of apparently legitimate traffic to Wandering Thoughts appears to attach what are probably legitimate X-Forwarded-For headers to requests going through it. I believe this particular place operates partly as a (forward) HTTP proxy; if they aren't making up the X-Forwarded-For IP addresses, they're willing to leak the origin IPs of people using them to third parties.

All of this makes me more curious than usual to know what HTTP headers and header values show up on requests to Wandering Thoughts. But not curious enough to stick in logging, because that would be quite verbose unless I could narrow things down to only some requests. Possibly I should stick in logging that can be quickly turned on and off, so I can dump header information only briefly.

(These days I've periodically wound up in a mood to hack on DWiki, the underlying engine behind Wandering Thoughts. It reminds me that I enjoy programming.)

Getting feedback as a small web crawler operator

By: cks

Suppose, hypothetically, that you're trying to set up a small web crawler for a good purpose. These days you might be focused on web search for text focused sites, or small human written sites, or similar things, and certainly given the bad things that are happening with the major crawlers we could use them. As a small crawler, you might want to get feedback and problem reports from web site operators about what your crawler is doing (or not doing). As it happens, I have some advice and views on this.

  • Above all, remember that you are not Google or even Bing. Web site operators need Google to crawl them, and they have no choice but to bend over backward for Google and to send out plaintive signals into the void if Googlebot is doing something undesirable. Since you're not Google and you need websites much more than they need you, the simplest thing for website operators to do with and about your crawler is to ignore the issue, potentially block you if you're causing problems, and move on.

    You cannot expect people to routinely reach out to you. Anyone who does reach out to you is axiomatically doing you a favour, at the expense of some amount of their limited time and at some risk to themselves.

  • Website operators have no reason to trust you or trust that problem reports will be well received. This is a lesson plenty of people have painfully learned from reporting spam (email or otherwise) and other abuse; a lot of the time your reports can wind up in the hands of people who aren't well intentioned toward you (either going directly to them or 'helpfully' being passed on by the ISP). At best you confirm that your email address is alive and get added to more spam address lists; at worst you get abused in various ways.

    The consequence of this is that if you want to get feedback, you should make it as low-risk as possible for people. The lowest risk way (to website operators) is for you to have a feedback form on your site that doesn't require email or other contact methods. If you require that website operators reveal their email addresses, social media handles, or whatever, you will get much less feedback (this includes VCS forge handles if you force them to make issue reports on some VCS forge).

    (This feedback form should be easy to find, for example being directly linked from the web crawler information URL in your User-Agent.)

  • As far as feedback goes, both your intentions and your views on the reasonableness of what your web crawler is doing (and how someone's website behaves) are irrelevant. What matters is the views of website operators, who are generally doing you a favour by not simply blocking or ignoring your crawler and moving on. If you disagree with their feedback, the best thing to do is be quiet (and maybe say something neutral if they ask for a reply). This is probably most important if your feedback happens through a public VCS forge issue tracker, where future people who are thinking about filing an issue the way you asked may skim over past issues to see how they went.

    (You may or may not ignore website operator feedback that you disagree with depending on how much you want to crawl (all of) their site.)

At the moment, most website operators who notice a previously unknown crawler will likely assume that it's an (abusive) LLM crawler. One way to lower the chances of this is to follow social conventions around crawlers for things like crawler User-Agents and not setting the Referer header. I don't think you have to completely imitate how Googlebot, bingbot, Applebot, the archive.org bot and so on format their User-Agent strings, but it's going to help to generally look like them and clearly put the same sort of information into yours. Similarly, if you can it will help to crawl from clearly identified IPs with reverse DNS. The more that people think you're legitimate and honest, the more likely they are to spend the time and take the risk to give you feedback; the more sketchy or even uncertain you look, the less likely you are to get feedback.

(In general, any time you make website operators uncertain about an aspect of your web crawler, some number of them will not be charitable in their guess. The more explicit and unambiguous you are in the more places, the better.)

Building and running a web crawler is not an easy thing on today's web. It requires both technical knowledge of various details of HTTP and how you're supposed to react to things (eg), and current social knowledge of what is customary and expected of web crawlers, as well as what you may need to avoid (for example, you may not want to start your User-Agent with 'Mozilla/5.0' any more, and in general the whole anti-crawling area is rapidly changing and evolving right now). Many website operators revisit blocks and other reactions to 'bad' web crawlers only infrequently, so you may only get one chance to get things right. This expertise can't be outsourced to a random web crawling library because many of them don't have it either.

(While this entry was sparked by a conversation I had on the Fediverse, I want to be explicit that it is in no way intended as a subtoot of that conversation. I just realized that I had some general views that didn't fit within the margins of Fediverse posts.)

Firefox's sudden weird font choice and fixing it

By: cks

Today, while I was in the middle of using my normal browser instance, it decided to switch from DejaVu Sans to Noto Sans as my default font:

Dear Firefox: why are you using Noto Sans all of a sudden? I have you set to DejaVu Sans (and DejaVu everything), and fc-match 'sans' and fc-match serif both say they're DejaVu (and give the DejaVu TTF files). This is my angry face.

This is a quite noticeable change for me because it changes the font I see on Wandering Thoughts, my start page, and other things that don't set any sort of explicit font. I don't like how Noto Sans looks and I want DejaVu Sans.

(I found out that it was specifically Noto Sans that Firefox was using all of a sudden through the Web Developer tools 'Font' information, and confirmed that Firefox should still be using DejaVu through the way to see this in Settings.)

After some flailing around, it appears that what I needed to do to fix this was explicitly set about:config's font.name.serif.x-western, font.name.sans-serif.x-western, and font.name.monospace.x-western to specific values instead of leaving them set to nothing, which seems to have caused Firefox to arrive on Noto Sans through some mysterious process (since the generic system font name 'sans' was still mapping to DejaVu Sans). I don't know if these are exposed through the Fonts advanced options in Settings β†’ General, which are (still) confusing in general. It's possible that these are what are used for 'Latin'.

(I used to be using the default 'sans', 'serif', and 'monospace' font names that cascaded through to the DejaVu family. Now I've specifically set everything to the DejaVu set, because if something in Fedora or Firefox decides that the default mapping should be different, I don't want Firefox to follow it, I want it to stay with DejaVu.)

I don't know why Firefox would suddenly decide these pages are 'western' instead of 'unicode'; all of them are served as or labeled as UTF-8, and nothing about that has changed recently. Unfortunately, as far as I know there's no way to get Firefox to tell you what font.name preference name it used to pick (default) fonts for a HTML document. When it sends HTTP 304 Not Modified responses, Wandering Thoughts doesn't include a Content-Type header (with the UTF-8 character set), but as far as I know that's a standard behavior and browsers presumably cope with it.

(Firefox does see 'Noto Sans' as a system UI font, which it uses on things like HTML form buttons, so it didn't come from nowhere.)

It makes me sad that Firefox continues to have no global default font choice. You can set 'Unicode' but as I've just seen, this doesn't make what you set there the default for unset font preferences, and the only way to find out what unset font preferences you have is to inspect about:config.

PS: For people who aren't aware of this, it's possible for Firefox to forget some of your about:config preferences. Working around this probably requires using Firefox policies (via), which can force-set arbitrary about:config preferences (among other things).

A HTTP User-Agent that claims to be Googlebot is now a bad idea

By: cks

Once upon a time, people seem to have had a little thing for mentioning Googlebot in their HTTP User-Agent header, much like browsers threw in claims to make them look like Firefox or whatever (the ultimate source of the now-ritual 'Mozilla/5.0' at the start of almost every browser's User-Agent). People might put in 'allow like Googlebot' or just say 'Googlebot' in their User-Agent. Some people are still doing this today, for example:

Gwene/1.0 (The gwene.org rss-to-news gateway) Googlebot

This is now an increasingly bad idea on the web and if you're doing it, you should stop. The problem is that there are various malicious crawlers out there claiming to be Googlebot, and Google publishes their crawler IP address ranges. Anything claiming to be Googlebot that is not from a listed Google IP is extremely suspicious and in this day and age of increasing anti-crawler defenses, blocking all 'Googlebot' activity that isn't from one of their listed IP ranges is an obvious thing to do. Web sites may go even further and immediately taint the IP address or IP address range involved in impersonating Googlebot, blocking or degrading further requests regardless of the User-Agent.

(Gwene is not exactly claiming to be Googlebot but they're trying to get simple Googlebot-recognizers to match them against Googlebot allowances. This is questionable at best. These days such attempts may do more harm than good as they get swept up in precautions against Googlebot forgery, or rules that block Googlebot from things it shouldn't be fetching, like syndication feeds.)

A similar thing applies to bingbot and the User-Agent of any other prominent web search engines, and Bing does publish their IP address ranges. However, I don't think I've ever seen someone impersonate bingbot (which probably doesn't surprise anyone). I don't know if anyone ever impersonates Archive.org (no one has in the past week here), but it's possible that crawler operators will fish to see if people give special allowances to them that can be exploited.

(The corollary of this is that if you have a website, an extremely good signal of bad stuff is someone impersonating Googlebot and maybe you could easily block that. I think this would be fairly easy to do in an Apache <If> clause that then Allow's from Googlebot's listed IP addresses and Denies everything else, but I haven't actually tested it.)

Trying to understand Firefox's approaches to tracking cookie isolation

By: cks

As I learned recently, modern versions of Firefox have two different techniques that try to defeat (unknown) tracking cookies. As covered in the browser addon JavaScript API documentation, in Tracking protection, these are called first-party isolation and dynamic partitioning (or storage partitioning, the documentation seems to use both). Of these two, first party isolation is the easier to describe and understand. To quote the documentation:

When first-party isolation is on, cookies are qualified by the domain of the original page the user visited (essentially, the domain shown to the user in the URL bar, also known as the "first-party domain").

(In practice, this appears to be the top level domain of the site, not necessarily the site's domain itself. For example, Cookie Manager reports that a cookie set from '<...>.cs.toronto.edu' has the first party domain 'toronto.edu'.)

Storage partitioning is harder to understand, and again I'll quote the Storage partitioning section of the cookie API documentation:

When using dynamic partitioning, Firefox partitions the storage accessible to JavaScript APIs by top-level site while providing appropriate access to unpartitioned storage to enable common use cases. [...]

Generally, top-level documents are in unpartitioned storage, while third-party iframes are in partitioned storage. If a partition key cannot be determined, the default (unpartitioned storage) is used. [...]

If you read non-technical writeups like Firefox rolling out Total Cookie Protection (from 2022), it certainly sounds like they're describing first-party isolation. However, if you check things like Status of partitioning in Firefox and the cookies API documentation on first-party isolation, as far as I can tell what Firefox actually normally uses for "Total Cookie Protection" is storage partitioning.

Based on what I can decode from the two descriptions and from the fact that Tor Browser defaults to first-party isolation, it appears that first-party isolation is better and stricter than storage partitioning. Presumably it also causes problems on more websites, enough so that Firefox either no longer uses it for Total Cookie Protection or never did, despite their description sounding like first-party isolation.

(So far I haven't run into any issues with first-party isolation in my cookie-heavy browser environment. It's possible that websites have switched how they do things to avoid problems.)

First-party isolation can be enabled in about:config by setting privacy.firstparty.isolate to true. If and when you do this, the normal Settings β†’ Privacy and Security will show a warning banner at the top to the effect of:

You are using First Party Isolation (FPI), which overrides some of Firefox’s cookie settings.

All of this is relevant to me because one of my add-ons, Cookie AutoDelete, probably works with first-party isolation but almost certainly doesn't work with storage isolation (ie, it will fail to delete some cookies under storage isolation, although I believe it can still delete unpartitioned cookies). Given what I've learned, I'm likely to turn on first-party isolation in my main browser environment soon.

If Cookie Manager is reporting correct information to me, it's possible to have cookies that are both first-party isolated and partitioned; the one I've seen so far is from Youtube. Cookie Manager can't seem to remove these cookies. Based on what I've read about (storage or dynamic) partitioned cookies, I suspect that these are created by embedded iframes.

(Turning on or off first-party isolation effectively drops all of the cookies you currently have, so it's probably best to do it when you restart your browser.)

Firefox, the Cookie AutoDelete add-on, and "Total Cookie Protection"

By: cks

In a comment on my entry on flailing around with Firefox's Multi-Account Containers, Ian Z aka nobrowser asked a good question:

The Cookie Autodelete instructions with respect to Total Cookie Protection mode are very confusing. Reading them makes me think this extension is not for me, as I have Strict Mode on in all windows, private or not. [...]

This is an interesting question (and, it turns out, relevant to my usage too) so I did some digging. The short answer is that I suspect the warning on Cookie AutoDelete's add-on page is out of date and it works fine. The long answer starts with the history of HTTP cookies.

Back in the old days, HTTP cookies were global, which is to say that browsers kept a global pool of HTTP cookies (both first party, from the website you were on, and third-party cookies), and it would send any appropriate cookie on any HTTP request to its site. This enabled third-party tracking cookies and a certain amount of CSRF attacks, since the browser would happily send your login cookies along with that request initiated by the JavaScript on some sketchy website you'd accidentally wound up on (or JavaScript injected through an ad network).

This was obviously less than ideal and people wound up working to limit the scope of HTTP cookies, starting with things like Firefox's containers and eventually escalating to first-party cookie isolation, where a cookie is restricted to whatever the first-party domain was when it was set. If you're browsing example.org and the page loads google.com/tracker, which sets a tracker cookie, that cookie will not be sent when you browse example.com and the page also loads google.com/tracker; the first tracking cookie is isolated to example.org.

(There is also storage isolation for cookies, but I think that's been displaced by first-party cookie isolation.)

However, first-party isolation has the possibility to break things you expect to work, as covered in this Firefox FAQ). As a result of this, my impression is that browsers have been cautious and slow to roll out first-party isolation by default. However, they have made it available as an option or part of an option. Firefox calls this Total Cookie Protection (also, also).

(Firefox is working to go even further, blocking all third-party cookies.)

Firefox add-ons have special APIs that allow them to do privileged things, and these include an API for dealing with cookies. When first-party cookie isolation came to pass, these APIs needed to be updated to deal with such isolated cookies (and cookie tracking protection in general). For instance, cookies.remove() has to be passed a special parameter to remove a first-party isolated cookie. As covered in the documentation, an add-on using the cookies APIs without the necessary updates would only see non-isolated cookies, if there were any. So at the time the message on Cookie AutoDelete's add-on page was written, I suspect that it hadn't been updated for first-party isolation. However, based on checking the source code of Cookie AutoDelete, I believe that it currently supports first-party isolation for cookies, and in fact may have done so for some time, perhaps v3.5.0, or v3.4.0 or even earlier.

(It's also possible that this support is incomplete or buggy, or that there are still some things that you can't easily do through it that matter to Cookie AutoDelete.)

Cookie AutoDelete itself is potentially useful even if you have Firefox set to block all third-party cookies, because it will also clean up unwanted first-party cookies (assuming that it truly works with first-party isolation). Part of my uncertainly is that I'm not sure how you reliably find out what cookies you have in a browser world with first-party isolation. There's theoretically some information about this in Settings β†’ Privacy & Security β†’ Cookies and Site Data β†’ "Manage Data...", but since that's part of the normal Settings UI that normal people use, I'm not sure if it's simplifying things.

PS: Now that I've discovered all of this, I'm not certain if my standard Cookie Quick Manager add-on properly supports first-party isolated cookies. There's this comment on an issue that suggests it does support first-party isolation but not storage partitioning (also). The available Firefox documentation and Settings UI is not entirely clear about whether first-party isolation is now on more or less by default.

(That comment points to Cookie Manager as a potential partition-aware cookie manager.)

My flailing around with Firefox's Multi-Account Containers

By: cks

I have two separate Firefox environments. One of them is quite locked down so that it blocks JavaScript by default, doesn't accept cookies, and so on. Naturally this breaks a lot of things, so I have a second "just make it work" environment that runs all the JavaScript, accepts all the cookies, and so on (although of course I use uBlock Origin, I'm not crazy). This second environment is pretty risky in the sense that it's going to be heavily contaminated with tracking cookies and so on, so to mitigate the risk (and make it a better environment to test things in), I have this Firefox set to discard cookies, caches, local storage, history, and so on when it shuts down.

In theory how I use this Firefox is that I start it when I need to use some annoying site I want to just work, use the site briefly, and then close it down, flushing away all of the cookies and so on. In practice I've drifted into having a number of websites more or less constantly active in this "accept everything" Firefox, which means that I often keep it running all day (or longer at home) and all of those cookies stick around. This is less than ideal, and is a big reason why I wish Firefox had a 'open this site in a specific profile' feature. Yesterday, spurred on by Ben Zanin's Fediverse comment, I decided to make my "accept everything" Firefox environment more complicated in the pursuit of doing better (ie, throwing away at least some cookies more often).

First, I set up a combination of Multi-Account Containers for the basic multi-container support and FoxyTab to assign wildcarded domains to specific containers. My reason to use Multi-Account Containers and to confine specific domains to specific containers is that both M-A C itself and my standard Cookie Quick Manager add-on can purge all of the cookies and so on for a specific container. In theory this lets me manually purge undesired cookies, or all cookies except desired ones (for example, my active Fediverse login). Of course I'm not likely to routinely manually delete cookies, so I also installed Cookie AutoDelete with a relatively long timeout and with its container awareness turned on, and exemptions configured for the (container-confined) sites that I'm going to want to retain cookies from even when I've closed their tab.

(It would be great if Cookie AutoDelete supported different cookie timeouts for different containers. I suspect it's technically possible, along with other container-aware cookie deletion, since Cookie AutoDelete applies different retention policies in different containers.)

In FoxyTab, I've set a number of my containers to 'Limit to Designated Sites'; for example, my 'Fediverse' container is set this way. The intention is that when I click on an external link in a post while reading my Fediverse feed, any cookies that external site sets don't wind up in the Fediverse container; instead they go either in the default 'no container' environment or in any specific container I've set up for them. As part of this I've created a 'Cookie Dump' container that I've assigned as the container for various news sites and so on where I actively want a convenient way to discard all their cookies and data (which is available through Multi-Account Containers).

Of course if you look carefully, much of this doesn't really require Multi-Account Containers and FoxyTab (or containers at all). Instead I could get almost all of this just by using Cookie AutoDelete to clean out cookies from closed sites after a suitable delay. Containers do give me a bit more isolation between the different things I'm using my "just make it work" Firefox for, and maybe that's important enough to justify the complexity.

(I still have this Firefox set to discard everything when it exits. This means that I have to re-log-in every so often even for the sites where I have Cookie AutoDelete keep cookies, but that's fine.)

I wish Firefox Profiles supported assigning websites to profiles

By: cks

One of the things that Firefox is working on these days is improving Firefox's profiles feature so that it's easier to use them. Firefox also has an existing feature that is similar to profiles, in containers and the Multi-Account Containers extension. The reason Firefox is tuning up profiles is that containers only separate some things, while profiles separate pretty much everything. A profile has a separate set of about:config settings, add-ons, add-on settings, memorized logins, and so on. I deliberately use profiles to create two separate and rather different Firefox environments. I'd like to have at least two or three more profiles, but one reason I've been lazy is that the more profiles I have, the more complex getting URLs into the right profile is (even with tooling to help).

This leads me to my wish for profiles, which is for profiles to support the kind of 'assign website to profile' and 'open website in profile' features that you currently have with containers, especially with the Multi-Account Containers extension. Actually I would like a somewhat better version than Multi-Account Containers currently offers, because as far as I can see you can't currently say 'all subdomains under this domain should open in container X' and that's a feature I very much want for one of my use cases.

(Multi-Account Containers may be able to do wildcarded subdomains with an additional add-on, but on the other hand apparently it may have been neglected or abandoned by Mozilla.)

Another way to get much of what I want would be for some of my normal add-ons to be (more) container aware. I could get a lot of the benefit of profiles (although not all of them) by using Multi-Account Containers with container aware cookie management in, say, Cookie AutoDelete (which I believe does support that, although I haven't experimented). Using containers also has the advantage that I wouldn't have to maintain N identical copies of my configuration for core extensions and bookmarklets and so on.

(I'm not sure what you can copy from one profile to a new one, and you currently don't seem to get any assistance from Firefox for it, at least in the old profile interface. This is another reason I haven't gone wild on making new Firefox profiles.)

What little I want out of web "passkeys" in my environment

By: cks

WebAuthn is yet another attempt to do an API for web authentication that doesn't involve passwords but that instead allows browsers, hardware tokens, and so on to do things more securely. "Passkeys" (also) is the marketing term for a "WebAuthn credential", and an increasing number of websites really, really want you to use a passkey for authentication instead of any other form of multi-factor authentication (they may or may not still require your password).

Most everyone that wants you to use passkeys also wants you to specifically use highly secure ones. The theoretically most secure are physical hardware security keys, followed by passkeys that are stored and protected in secure enclaves in various ways by the operating system (provided that the necessary special purpose hardware is available). Of course the flipside of 'secure' is 'locked in', whether locked in to your specific hardware key (or keys, generally you'd better have backups) or locked in to a particular vendor's ecosystem because their devices are the only ones that can possibly use your encrypted passkey vault.

(WebAuthn neither requires nor standardizes passkey export and import operations, and obviously security keys are built to not let anyone export the cryptographic material from them, that's the point.)

I'm extremely not interested in the security versus availability tradeoff that passkeys make in favour of security. I care far more about preserving availability of access to my variety of online accounts than about nominal high security. So if I'm going to use passkeys at all, I have some requirements:

Linux people: is there a passkeys implementation that does not use physical hardware tokens (software only), is open source, works with Firefox, and allows credentials to be backed up and copied to other devices by hand, without going through some cloud service?

I don't think I'm asking for much, but this is what I consider the minimum for me actually using passkeys. I want to be 100% sure of never losing them because I have multiple backups and can use them on multiple machines.

Apparently KeePassXC more or less does what I want (when combined with its Firefox extension), and it can even export passkeys in a plain text format (well, JSON). However, I don't know if anything else can ingest those plain text passkeys, and I don't know if KeePassXC can be told to only do passkeys with the browser and not try to take over passwords.

(But at least a plain text JSON backup of your passkeys can be imported into another KeePassXC instance without having to try to move, copy, or synchronize a KeePassXC database.)

Normally I would ignore passkeys entirely, but an increasing number of websites are clearly going to require me to use some form of multi-factor authentication, no matter how stupid this is (cf), and some of them will probably require passkeys or at least make any non-passkey option very painful. And it's possible that reasonably integrated passkeys will be a better experience than TOTP MFA with my janky minimal setup.

(Of course KeePassXC also supports TOTP, and TOTP has an extremely obvious import process that everyone supports, and I believe KeePassXC will export TOTP secrets if you ask nicely.)

While KeePassXC is okay, what I would really like is for Firefox to support 'memorized passkeys' right along with its memorized passwords (and support some kind of export and import along with it). Should people use them? Perhaps not. But it would put that choice firmly in the hands of the people using Firefox, who could decide on how much security they did or didn't want, not in the hands of websites who want to force everyone to face a real risk of losing their account so that the website can conduct security theater.

(Firefox will never support passkeys this way for an assortment of reasons. At most it may someday directly use passkeys through whatever operating system services expose them, and maybe Linux will get a generic service that works the way I want it to. Nor is Firefox ever going to support 'memorized TOTP codes'.)

We need to start doing web blocking for non-technical reasons

By: cks

My sense is that for a long time, technical people (system administrators, programmers, and so on) have seen the web as something that should be open by default and by extension, a place where we should only block things for 'technical' reasons. Common technical reasons are a harmful volume of requests or clear evidence of malign intentions, such as probing for known vulnerabilities. Otherwise, if it wasn't harming your website and wasn't showing any intention to do so, you should let it pass. I've come to think that in the modern web this is a mistake, and we need to be willing to use blocking and other measures for 'non-technical' reasons.

The core problem is that the modern web seems to be fragile and is kept going in large part by a social consensus, not technical things such as capable software and powerful servers. However, if we only react to technical problems, there's very little that preserves and reinforces this social consensus, as we're busy seeing. With little to no consequences for violating the social consensus, bad actors are incentivized to skate right up to and even over the line of causing technical problems. When we react by taking only narrow technical measures, we tacitly reward the bad actors for their actions; they can always find another technical way. They have no incentive to be nice or to even vaguely respect the social consensus, because we don't punish them for it.

So I've come to feel that if something like the current web is to be preserved, we need to take action not merely when technical problems arise but also when the social consensus is violated. We need to start blocking things for what I called editorial reasons. When software or people do things that merely shows bad manners and doesn't yet cause us technical problems, we should still block it, either soft (temporarily, perhaps with HTTP 429 Too Many Requests) or hard (permanently). We need to take action to create the web that we want to see, or we aren't going to get it or keep it.

To put it another way, if we want to see good, well behaved browsers, feed readers, URL fetchers, crawlers, and so on, we have to create disincentives for ones that are merely bad (as opposed to actively damaging). In its own way, this is another example of the refutation of Postel's Law. If we accept random crap to be friendly, we get random crap (and the quality level will probably trend down over time).

To answer one potential criticism, it's true that in some sense, blocking and so on for social reasons is not good and is in some theoretical sense arguably harmful for the overall web ecology. On the other hand, the current unchecked situation itself is also deeply harmful for the overall web ecology and it's only going to get worse if we do nothing, with more and more things effectively driven off the open web. We only get to pick the poison here.

A Firefox issue and perhaps how handling scaling is hard

By: cks

Over on the Fediverse I shared a fun Firefox issue I've just run into:

Today's fun Firefox bug: if I move my (Nightly) Firefox window left and right across my X display, the text inside the window reflows to change its line wrapping back and forth. I have a HiDPI display with non-integer scaling and some other settings, so I'm assuming that Firefox is now suffering from rounding issues where the exact horizontal pixel position changes its idea of the CSS window width, triggering text reflows as it jumps back and forth by a CSS pixel.

(I've managed to reproduce this in a standard Nightly, although so far only with some of my settings.)

Close inspection says that this isn't quite what's happening, and the underlying problem is happening more often than I thought. What is actually happening is that as I move my Firefox window left and right, a thin vertical black line usually appears and disappears at the right edge of the window (past a scrollbar if there is one). Since I can see it on my HiDPI display, I suspect that this vertical line is at least two screen pixels wide. Under the right circumstances of window width, text size, and specific text content, this vertical black bar takes enough width away from the rest of the window to cause Firefox to re-flow and re-wrap text, creating easily visible changes as the window moves.

A variation of this happens when the vertical black bar isn't drawn but things on the right side of the toolbar and the URL bar area will shift left and right slightly as the window is moved horizontally. If the window is showing a scrollbar, the position of the scroll target in the scrollbar will move left and right, with the right side getting ever so slightly wider or returning back to being symmetrical. It's easiest to see this if I move the window sideways slowly, which is of course not something I do often (usually I move windows rapidly).

(This may be related to how X has a notion of sizing windows in non-pixel units if the window asks for it. Firefox in my configuration definitely asks for this; it asserts that it wants to be resized in units of 2 (display) pixels both horizontally and vertically. However, I can look at the state of a Firefox window in X and see that the window size in pixels doesn't change between the black bar appearing and disappearing.)

All of this is visible partly because under X and my window manager, windows can redisplay themselves even during an active move operation. If the window contents froze while I dragged windows around, I probably wouldn't have noticed this for some time. Text reflowing as I moved a Firefox window sideways created a quite attention-getting shimmer.

It's probably relevant that I need unusual HiDPI settings and I've also set Firefox's layout.css.devPixelsPerPx to 1.7 in about:config. That was part of why I initially assumed this was a scaling and rounding issue, and why I still suspect that area of Firefox a bit.

(I haven't filed this as a Firefox bug yet, partly because I just narrowed down what was happening in the process of writing this entry.)

Apache .htaccess files are important because they enable delegation

By: cks

Apache's .htaccess files have a generally bad reputation. For example, lots of people will tell you that they can cause performance problems and you should move everything from .htaccess files into your main Apache configuration, using various pieces of Apache syntax to restrict what configuration directives apply to. The result can even be clearer, since various things can be confusing in .htaccess files (eg rewrites and redirects). Despite all of this, .htaccess files are important and valuable because of one property, which is that they enable delegation of parts of your server configuration to other people.

The Apache .htaccess documentation even spells this out in reverse, in When (not) to use .htaccess files:

In general, you should only use .htaccess files when you don't have access to the main server configuration file. [...]

If you operate the server and would be writing the .htaccess file, you can put the contents of the .htaccess in the main server configuration and make your life easier and Apache faster (and you probably should). But if the web server and its configuration isn't managed as a unitary whole by one group, then .htaccess files allow the people managing the overall Apache configuration to safely delegate things to other people on a per-directory basis, using Unix ownership. This can both enable people to do additional things and reduce the amount of work the central people have to do, letting people things scale better.

(The other thing that .htaccess files allow is dynamic updates without having to restart or reload the whole server. In some contexts this can be useful or important, for example if the updates are automatically generated at unpredictable times.)

I don't think it's an accident that .htaccess files emerged in Apache, because one common environment Apache was initially used in was old fashioned multi-user Unix web servers where, for example, every person with a login on the web server might have their own UserDir directory hierarchy. Hence features like suEXEC, so you could let people run CGIs without those CGIs having to run as the web user (a dangerous thing), and also hence the attraction of .htaccess files. If you have a bunch of (graduate) students with their own web areas, you definitely don't want to let all of them edit your departmental web server's overall configuration.

(Apache doesn't solve all your problems here, at least not in a simple configuration; you're still left with the multiuser PHP problem. Our solution to this problem is somewhat brute force.)

These environments are uncommon today but they're not extinct, at least at universities like mine, and .htaccess files (and Apache's general flexibility) remain valuable to us.

Syndication feed fetchers, HTTP redirects, and conditional GET

By: cks

In response to my entry on how ETag values are specific to a URL, a Wandering Thoughts reader asked me in email what a syndication feed reader (fetcher) should do when it encounters a temporary HTTP redirect, in the context of conditional GET. I think this is a good question, especially if we approach it pragmatically.

The specification compliant answer is that every final (non-redirected) URL must have its ETag and Last-Modified values tracked separately. If you make a conditional GET for URL A because you know its ETag or Last-Modified (or both) and you get a temporary HTTP redirection to another URL B that you don't have an ETag or Last-Modified for, you can't make a conditional GET. This means you have to insure that If-None-Match and especially If-Modified-Since aren't copied from the original HTTP request to the newly re-issued redirect target request. And when you make another request for URL A later, you can't send a conditional GET using ETag or Last-Modified values you got from successfully fetching URL B; you either have to use the last values observed for URL A or make an unconditional GET. In other words, saved ETag and Last-Modified values should be per-URL properties, not per-feed properties.

(Unfortunately this may not fit well with feed reader code structures, data storage, or uses of low-level HTTP request libraries that hide things like HTTP redirects from you.)

Pragmatically, you can probably get away with re-doing the conditional GET when you get a temporary HTTP redirect for a feed, with the feed's original saved ETag and Last-Modified information. There are three likely cases for a temporary HTTP redirection of a syndication feed that I can think of:

  • You're receiving a generic HTTP redirection to some sort of error page that isn't a valid syndication feed. Your syndication feed fetcher isn't going to do anything with a successful fetch of it (except maybe add an 'error' marker to the feed), so a conditional GET that fools you with "nothing changed" is harmless.

  • You're being redirected to an alternate source of the normal feed, for example a feed that's normally dynamically generated might serve a (temporary) HTTP redirect to a static copy under high load. If the conditional GET matches the ETag (probably unlikely in practice) or the Last-Modified (more possible), then you almost certainly have the most current version and are fine, and you've saved the web server some load.

  • You're being (temporarily) redirected to some kind of error feed; a valid syndication feed that contains one or more entries that are there to tell the person seeing them about a problem. Here, the worst thing that happens if your conditional GET fools you with "nothing has changed" is that the person reading the feed doesn't see the error entry (or entries).

The third case is a special variant of an unlikely general case where the normal URL and the redirected URL are both versions of the feed but each has entries that the other doesn't. In this general case, a conditional GET that fools you with a '304 Not Modified' will cause you to miss some entries. However, this should cure itself when the temporary HTTP redirect stops happening (or when a new entry is published to the temporary location, which should change its ETag and reset its Last-Modified date to more or less now).

A feed reader that keeps a per-feed 'Last-Modified' value and updates it after following a temporary HTTP redirect is living dangerously. You may not have the latest version of the non-redirected feed but the target of the HTTP redirection may be 'more recent' than it for various reasons (even if it's a valid feed; if it's not a valid feed then blindly saving its ETag and Last-Modified is probably quite dangerous). When the temporary HTTP redirection goes away and the normal feed's URL resumes responding with the feed again, using the target's "Last-Modified" value for a conditional GET of the original URL could cause you to receive "304 Not Modified" until the feed is updated again (and its Last-Modified moves to be after your saved value), whenever that happens. Some feeds update frequently; others may only update days or weeks later.

Given this and the potential difficulties of even noticing HTTP redirects (if they're handled by some underlying library or tool), my view is that if a feed provides both an ETag and a Last-Modified, you should save and use only the ETag unless you're sure you're going to handle HTTP redirects correctly. An ETag could still get you into trouble if used across different URLs, but it's much less likely (see the discussion at the end of my entry about Last-Modified being specific to the URL).

(All of this is my view as someone providing syndication feeds, not someone writing syndication feed fetchers. There may be practical issues I'm unaware of, since the world of feeds is very large and it probably contains a lot of weird feed behavior (to go with the weird feed fetcher behavior).)

The HTTP Last-Modified value is specific to the URL (technically so is the ETag value)

By: cks

Last time around I wrote about how If-None-Match values (which come from ETag values) must come from the actual URL itself, not (for example) from another URL that you were at one point redirected to. In practice, this is only an issue of moderate concern for ETag/If-None-Match; you can usually make a conditional GET using an ETag from another URL and get away with it. This is very much an issue if you make the mistake of doing the same thing with an If-Modified-Since header based on another URL's Last-Modified header. This is because the Last-Modified header value isn't unique to a particular document, in a way that ETag values can often be.

If you take the Last-Modified timestamp from URL A and perform a conditional GET for URL B with an 'If-Modified-Since' of that timestamp, the web server may well give you exactly what you asked for but not what you wanted by saying 'this hasn't been modified since then' even though the contents of those URLs are entirely different. You told the web server to decide purely on the basis of timestamps without reference to anything that might even vaguely specify the content, and so it did. This can happen even if the server is requiring an exact timestamp match (as it probably should), because there are any number of ways for the 'Last-Modified' timestamp of a whole bunch of URLs to be exactly the same because some important common element of them was last updated at that point.

(This is how DWiki works. The Last-Modified date of a page is the most recent timestamp of all of the elements that went into creating it, so if I change some shared element, everything will promptly take on the Last-Modified of that element.)

This means that if you're going to use Last-Modified in conditional GETs, you must handle HTTP redirects specially. It's actively dangerous (to actually getting updates) to mingle Last-Modified dates from the original URL and the redirection URL; you either have to not use Last-Modified at all, or track the Last-Modified values separately. For things that update regularly, any 'missing the current version' problems will cure themselves eventually, but for infrequently updated things you could go quite a while thinking that you have the current content when you don't.

In theory this is also true of ETag values; the specification allows them to be calculated in ways that are URL-specific (the specification mentions that the ETag might be a 'revision number'). A plausible implementation of serving a collection of pages from a Git repository could use the repository's Git revision as the common ETag for all pages; after all, the URL (the page) plus that git revision uniquely identifies it, and it's very cheap to provide under the right circumstances (eg, you can record the checked out git revision).

In practice, common ways of generating ETags will make them different across different URLs, potentially unless the contents are the same. DWiki generates ETag values using a cryptographic hash, so two different URLs will only have the same ETag if they have the same contents, which I believe is a common approach for pages that are generated dynamically. Apache generates ETag values for static files using various file attributes that will be different for different files, which is probably also a common approach for things that serve static files. Pragmatically you're probably much safer sending an ETag value from one URL in an If-None-Match header to another URL (for example, through repeating it while following a HTTP redirection). It's still technically wrong, though, and it may cause problems someday.

(This feels obvious but it was only today that I realized how it interacts with conditional GETs and HTTP redirects.)

If-None-Match values must come from the actual URL itself

By: cks

Because I recently looked at the web server logs for Wandering Thoughts, I said something on the Fediverse:

It's impressive how many ways feed readers screw up ETag values. Make up their own? Insert ETags obtained from the target of a HTTP redirect of another request? Stick suffixes on the end? Add their own quoting? I've seen them all.

(And these are just the ones that I can readily detect from the ETag format being wrong for the ETags my techblog generates.)

(Technically these are If-None-Match values, not ETag values; it's just that the I-N-M value is supposed to come from an ETag you returned.)

One of these mistakes deserves special note, and that's the HTTP redirect case. Suppose you request a URL, receive a HTTP 302 temporary redirect, follow the redirect, and get a response at the new URL with an ETag value. As a practical matter, you cannot then present that ETag value in an If-None-Match header when you re-request the original URL, although you could if you re-requested the URL that you were redirected to. The two URLs are not the same and they don't necessarily have the same ETag values or even the same format of ETags.

(This is an especially bad mistake for a feed fetcher to make here, because if you got a HTTP redirect that gives you a different format of ETag, it's because you've been redirected to a static HTML page served directly by Apache (cf) and it's obviously not a valid syndication feed. You shouldn't be saving the ETag value for responses that aren't valid syndication feeds, because you don't want to get them again.)

This means that feed readers can't just store 'an ETag value' for a feed. They need to associate the ETag value with a specific, final URL, which may not be the URL of the feed (because said feed URL may have been redirected). They also need to (only) make conditional requests when they have an ETag for that specific URL, and not copy the If-None-Match header from the initial GET into a redirected GET.

This probably clashes with many low level HTTP client APIs, which I suspect want to hide HTTP redirects from the caller. For feed readers, such high level APIs are a mistake. They actively need to know about HTTP redirects so that, for example, they can consider updating their feed URL if they get permanent HTTP redirects to a new URL. And also, of course, to properly handle conditional GETs.

A hack: outsourcing web browser/client checking to another web server

By: cks

A while back on the Fediverse, I shared a semi-cursed clever idea:

Today I realized that given the world's simplest OIDC IdP (one user, no password, no prompting, the IdP just 'logs you in' if your browser hits the login URL), you could put @cadey's Anubis in front of anything you can protect with OIDC authentication, including anything at all on an Apache server (via mod_auth_openidc). No need to put Anubis 'in front' of anything (convenient for eg static files or CGIs), and Anubis doesn't even have to be on the same website or machine.

This can be generalized, of course. There are any number of filtering proxies and filtering proxy services out there that will do various things for you, either for free or on commercial terms; one example of a service is geoblocking that's maintained by someone else who's paid to be on top of it and be accurate. Especially with services, you may not want to put them in front of your main website (that gives the service a lot of power), but you would be fine with putting a single-purpose website behind the service or the proxy, if your main website can use the result. With the world's simplest OIDC IdP, you can do that, at least for anything that will do OIDC.

(To be explicit, yes, I'm partly talking about Cloudflare.)

This also generalizes in the other direction, in that you don't necessarily need to use OIDC. You just need some system for passing authenticated information back and forth between your main website and your filtered, checked, proxied verification website. Since you don't need to carry user identity information around this can be pretty simple (although it's going to involve some cryptography, so I recommend just using OIDC or some well-proven option if you can). I've thought about this a bit and I'm pretty certain you can make a quite simple implementation.

(You can also use SAML if you happen to have an extremely simple SAML server and appropriate SAML clients, but really, why. OIDC is today's all-purpose authentication hammer.)

A custom system can pass arbitrary information back and forth between the main website and the verifier, so you can know (for example) if the two saw the same client details. I think you can do this to some extent with OIDC as well if you have a custom IdP, because nothing stops your IdP and your OIDC client from agreeing on some very custom OIDC claims, such as (say) 'clientip'.

(I don't know of any such minimal OIDC server, although I wouldn't be surprised if one exists, probably as a demonstration or test server. And I suppose you can always put a banner on your OIDC IdP's login page that tells people what login and password to use, if you can only find a simple IdP that requires an actual login.)

Why Firefox's media autoplay settings are complicated and imperfect

By: cks

In theory, a website that wanted to play video or audio could throw in a '<video controls ...>' or '<audio controls ...>' element in the HTML of the page and be done with it. This would make handling playing media simple and blocking autoplay reliable; you'd ignore the autoplay element and the person using the browser would directly trigger playing media by interacting with things that the browser directly controlled and so the browser could know for sure that a person had directly clicked on them and the media should be played.

As anyone who's seen websites with audio and video on the web knows, in practice almost no one does it this way, with browser controls on the <video> or <audio> element. Instead, everyone displays controls of their own somehow (eg as HTML elements styled through CSS), attaches JavaScript actions to them, and then uses the HTMLMediaElement browser API to trigger playback and various other things. As a result of this use of JavaScript, browsers in general and Firefox in particular no longer have a clear, unambiguous view of your intentions to play media. At best, all they can know is that you interacted with the web page, this interaction triggered some JavaScript, and the JavaScript requested that media play.

(Browsers can know somewhat of how you interacted with a web page, such as whether you clicked or scrolled or typed a key.)

On good, well behaved websites, this interaction is with visually clear controls (such as a visual 'play' button) and the JavaScript that requests media playing is directly attached to those controls. And even on these websites, JavaScript may later legitimately act asynchronously to request more playing of things, or you may interact with media playback in other ways (such as spacebar to pause and then restart media playing). On not so good websites, well, any piece of JavaScript that manages to run can call HTMLMediaElement.play() to try to start playing the media. There are lots of ways to have JavaScript run automatically and so a web page can start trying to play media the moment its JavaScript starts running, and it can keep trying to trigger playback over and over again if it wants to through timers or suchlike.

Since Firefox only blocking the actual autoplay attribute and allowing JavaScript to trigger media playing any time it wants to would be a pretty obviously bad 'Block Autoplay' experience, it must try harder. Firefox's approach is to (also) block use of HTMLMediaElement.play() until you have done some 'user gesture' on the page. As far as I can tell from Firefox's description of this, the list of 'user gestures' is fairly expansive and covers much of how you interact with a page. Certainly, if a website can cause you to click on something, regardless of what it looks like, this counts as a 'user gesture' in Firefox.

(I'm sure that Firefox's selection of things that count as 'user gestures' are drawn from real people on real hardware doing things to deliberately trigger playback, including resuming playback after it's been paused by, for example, tapping spacebar.)

In Firefox, this makes it quite hard to actually stop a bad website from playing media while preserving your ability to interact with the site. Did you scroll the page with the spacebar? I think that counts as a user gesture. Did you use your mouse scroll wheel? Probably a user gesture. Did you click on anything at all, including to dismiss some banner? Definitely a user gesture. As far as I can tell, the only reliable way you can prevent a web page from starting media playback is to immediately close the page. Basically anything you do to use it is dangerous.

Firefox does have a very strict global 'no autoplay' policy that you can turn on through about:config, which they call click-to-play, where Firefox tries to limit HTMLMediaElement.play() to being called as the direct result of a JavaScript event handler. However, their wiki notes that this can break some (legitimate) websites entirely (well, for media playback), and it's a global setting that gets in the way of some things I want; you can't set it only for some sites. And even with click-to-play, if a website can get you to click on something of its choice, it's game over as far as I know; if you have to click or tap a key to dismiss an on-page popup banner, the page can trigger media playing from that event handler.

All of this is why I'd like a per-website "permanent mute" option for Firefox. As far as I know, there's literally no other way in standard Firefox to reliably prevent a potentially bad website (or advertising network that it uses) from playing media on you.

(I suspect that you can defeat a lot of such websites with click-to-play, though.)

PS: Muting a tab in Firefox is different from stopping media playback (or blocking it from starting). All it does is stop Firefox from outputting audio from that tab (to wherever you're having Firefox send audio). Any media will 'play' or continue to play, including videos displaying moving things and being distracting.

HTTP headers that tell syndication feed fetchers how soon to come back

By: cks

Programs that fetch syndication feeds should fetch them only every so often. But how often? There are a variety of ways to communicate this, and for my own purposes I want to gather them in one place.

I'll put the summary up front. For Atom syndication feeds, your HTTP feed responses should contain a Cache-Control: max-age=... HTTP header that gives your desired retry interval (in seconds), such as '3600' for pulling the feed once an hour. If and when people trip your rate limits and get HTTP 429 responses, your 429s should include a Retry-After header with how long you want feed readers to wait (although they won't).

There are two syndication feed formats in general usage, Atom and RSS2. Although generally not great (and to be avoided), RSS2 format feeds can optionally contain a number of elements to explicitly tell feed readers how frequently they should poll the feed. The Atom syndication feed format has no standard element to communicate polling frequency. Instead, the nominally standard way to do this is through a general Cache-Control: max-age=... HTTP header, which gives a (remaining) lifetime in seconds. You can also set an Expires header, which gives an absolute expiry time, but not both.

(This information comes from Daniel Aleksandersen's Best practices for syndication feed caching. One advantage of HTTP headers over feed elements is that they can be returned on HTTP 304 Not Modified responses; one drawback is that you need to be able to set HTTP headers.)

If you have different rate limit policies for conditional GET requests and unconditional ones, you have a choice to make about the time period you advertise on successful unconditional GETs of your feed. Every feed reader has to do an unconditional GET the first time it fetches your feed, and many of them will periodically do unconditional GETs for various reasons. You could choose to be optimistic, assume that the feed reader's next poll will be a conditional GET, and give it the conditional GET retry interval, or you could be pessimistic and give it a longer unconditional GET one. My personal approach is to always advertise the conditional GET retry interval, because I assume that if you're not going to do any conditional GETs you're probably not paying attention to my Cache-Control header either.

As rachelbythebay's ongoing work on improving feed reader behavior has uncovered, a number of feed readers will come back a bit earlier than your advertised retry interval. So my view is that if you have a rate limit, you should advertise a retry interval that is larger than it. On Wandering Thoughts my current conditional GET feed rate limit is 45 minutes, but I advertise a one hour max-age (and I would like people to stick to once an hour).

(Unconditional GETs of my feeds are rate limited down to once every four hours.)

Once people trip your rate limits and start getting HTTP 429 responses, you theoretically can signal how soon they can come back with a Retry-After header. The simplest way to implement this is to have a constant value that you put in this header, even if your actual rate limit implementation would allow a successful request earlier. For example, if you rate limit to one feed fetch every half hour and a feed fetcher polls after 20 minutes, the simple Retry-After value is '1800' (half an hour in seconds), although if they tried again in just over ten minutes they could succeed (depending on how you implement rate limits). This is what I currently do, with a different Retry-After (and a different rate limit) for conditional GET requests and unconditional GETs.

My suspicion is that there are almost no feed fetchers that ignore your Cache-Control max-age setting but that honor your HTTP 429 Retry-After setting (or that react to 429s at all). Certainly I see a lot of feed fetchers here behaving in ways that very strongly suggest they ignore both, such as rather frequent fetch attempts. But at least I tried.

Sidebar: rate limit policies and feed reader behavior

When you have a rate limit, one question is whether failed (rate limited) requests should count against the rate limit, or if only successful ones count. If you nominally allow one feed fetch every 30 minutes and a feed reader fetches at T (successfully), T+20, and T+33, this is the difference between the third fetch failing (since it's less than 30 minutes from the previous attempt) or succeeding (since it's more than 30 minutes from the last successful fetch).

There are various situations where the right answer is that your rate limit counts from the last request even if the last request failed (what Exim calls a strict ratelimit). However, based on observed feed reader behavior, doing this strict rate limiting on feed fetches will result in quite a number of syndication feed readers never successfully fetching your feed, because they will never slow down and drop under your rate limit. You probably don't want this.

Mapping from total requests per day to average request rates

By: cks

Suppose, not hypothetically, that a single IP address with a single User-Agent has made 557 requests for your blog's syndication feed in about 22 and a half hours (most of which were rate-limited and got HTTP 429 replies). If we generously assume that these requests were distributed evenly over one day (24 hours), what was the average interval between requests (the rate of requests)? The answer is easy enough to work out and it's about two and a half minutes between requests, if they were evenly distributed.

I've been looking at numbers like this lately and I don't feel like working out the math each time, so here is a table of them for my own future use.

Total requests Theoretical interval (rate)
6 Four hours
12 Two hours
24 One hour
32 45 minutes
48 30 minutes
96 15 minutes
144 10 minutes
288 5 minutes
360 4 minutes
480 3 minutes
720 2 minutes
1440 One minute
2880 30 seconds
5760 15 seconds
8640 10 seconds
17280 5 seconds
43200 2 seconds
86400 One second

(This obviously isn't comprehensive; instead I want it to give me a ballpark idea, and I care more about higher request counts than lower ones. But not too high because I mostly don't deal with really high rates. Every four hours and every 45 minutes are relevant to some ratelimiting I do.)

Yesterday there were about 20,240 requests for the main syndication feed for Wandering Thoughts, which is an aggregate rate of more than one request every five seconds. About 10,570 of those requests weren't blocked in various ways or ratelimited, which is still more than one request every ten seconds (if they were evenly spread out, which they probably weren't).

(There were about 48,000 total requests to Wandering Thoughts, and about 18,980 got successful responses, although almost 2,000 of those successful responses were a single rogue crawler that's now blocked. This is of course nothing compared to what a busy website sees. Yesterday my department's web server saw 491,900 requests, although that seems to have been unusually high. Interested parties can make their own tables for that sort of volume level.)

It's a bit interesting to see this table written out this way. For example, if I thought about it I knew there was a factor of ten difference between one request every ten seconds and one request every second, but it's more concrete when I see the numbers there with the extra zero.

I wish Firefox had some way to permanently mute a website

By: cks

Over on the Fediverse, I had a wish:

My kingdom for a way to tell Firefox to never, ever play audio and/or video for a particular site. In other words, a permanent and persistent mute of that site. AFAIK this is currently impossible.

(For reasons, I cannot set media.autoplay.blocking_policy to 2 generally. I could if Firefox had a 'all subdomains of ...' autoplay permission, but it doesn't, again AFAIK.)

(This is in a Firefox setup that doesn't have uMatrix and that runs JavaScript.)

Sometimes I visit sites in my 'just make things work' Firefox instance that has JavaScript and cookies and so on allowed (and throws everything away when it shuts down), and it turns out that those sites have invented exceedingly clever ways to defeat Firefox's default attempts to let you block autoplaying media (and possibly their approach is clever enough to defeat even the strict 'click to start' setting for media.autoplay.blocking_policy). I'd like to frustrate those sites, especially ones that I keep winding up back on for various reasons, and never hear unexpected noises from Firefox.

(In general I'd probably like to invert my wish, so that Firefox never played audio or video by default and I had to specifically enable it on a site by site basis. But again this would need an 'all subdomains of' option. This version might turn out to be too strict, I'd have to experiment.)

You can mute a tab, but only once it starts playing, and your mute isn't persistent. As far as I know there's no (native) way to get Firefox to start a tab muted, or especially to always start tabs for a site in a muted state, or to disable audio and/or video for a site entirely (the way you can deny permission for camera or microphone access). I'm somewhat surprised that Firefox doesn't have any option for 'this site is obnoxious, put them on permanent mute', because there are such sites out there.

Both uMatrix and apparently NoScript can selectively block media, but I'd have to add either of them to this profile and I broadly want it to be as plain as reasonable. I do have uBlock Origin in this profile (because I have it in everything), but as far as I can tell it doesn't have a specific (and selective) media blocking option, although it's possible you can do clever things with filter rules, especially if you care about one site instead of all sites.

(I also think that Firefox should be able to do this natively, but evidently Firefox disagrees with me.)

PS: If Firefox actually does have an apparently well hidden feature for this, I'd love to know about it.

Why Wandering Thoughts has fewer comment syndication feeds than yesterday

By: cks

Over on the Fediverse I said:

My techblog used to offer Atom syndication feeds for the comments on individual entries. I just turned that off because it turns out to be a bad idea on the modern web when you have many years of entries. There are (were) any number of 'people' (feed things) that added the comment feeds for various entries years ago and then never took them out, despite those entries being years old and in some cases never having gotten comments in the first place.

DWiki, the engine behind Wandering Thoughts, is nothing if not general. Syndication feeds, for example, are a type of 'view' over a directory hierarchy, and are available for both pages and comments. A regular (page) syndication feed view can only be done over (on) a directory, because if it was applied to an individual page the feed would only ever contain that page. However, when I wrote DWiki it was obvious that a comment syndication feed for a particular page made sense; it would give you all of the comments 'under' that page (ie, on it). And so for almost all of the time that Wandering Thoughts has been in operation, you could have looked down to the bottom of an entry's page (on the web) and seen in small type 'Atom Syndication: Recent Comments' (with the 'recent comments' being a HTML link giving you the URL of that page's comment feed).

(The comment syndication feed for a directory is all comments on all pages underneath the directory.)

That's gone now, because I decided that it didn't make sense in what Wandering Thoughts has become and because I was slowly accumulating feed readers that were pulling the comment syndication feeds for more and more entries. This is exactly the behavior I should have expected from feed readers from the start; once someone puts a feed in, that feed is normally forever even if it's extremely inactive or has never had an entry. The feed reader will dutifully poll every feed for years to come (well, certainly every feed that responds with HTTP success and a valid syndication feed, which all of my comment feeds did).

(There weren't very many pages having their comment syndication feeds hit, but there were enough that I kept noticing them, especially when I added things like hacky rate limiting for feed fetching. I actually put in some extra hacks to deal with how requests for these feeds interacted with my rate limiting.)

There are undoubtedly places on the Internet where discussion (in the form of comments) continues on for years on certain pages, and so a comment feed for an individual page could make sense; you really might keep up (in your feed reader) with a slow moving conversation that lasts years. Other places on the Internet put definite cut-offs on further discussion (comments) on individual pages, which provides a natural deadline to turn off the page's comment syndication feed. But neither of those profiles describes Wandering Thoughts, where my entries remain open for comments more or less forever (and sometimes people do comment on quite old entries), but comments and discussions don't tend to go on for very long.

Of course, the other thing that this change prevents is that it stops (LLM) web crawlers from trying to crawl all of those URLs for comment syndication feeds. You can't crawl URLs that aren't advertised any more and no longer exist (well, sort of, they technically exist but the code for handling them arranges to return 404s if the new 'no comment feeds for actual pages' configuration option is turned on).

Websites and web developers mostly don't care about client-side problems

By: cks

In response to my entry on the fragility of the web in the face of the crawler plague, Jukka said in a comment:

While I understand the server-side frustrations, I think the corresponding client-side frustrations have largely been lacking from the debates around the Web.

For instance, CloudFlare now imposes heavy-handed checks that take a few seconds to complete. [...]

This is absolutely true but it's not new, and it goes well beyond anti-crawler and anti-robot defenses. As covered by people like Alex Russell, it's routine for websites to ignore most real world client side concerns (also, and including on desktops). Just recently (as of August 2025), Github put out a major update that many people are finding immensely slow even on developer desktops. If we can't get web developers to care about common or majority experiences for their UI, which in some sense has relatively little on the line, the odds of web site operators caring when their servers are actually experiencing problems (or at least annoyances) is basically nil.

Much like browsers have most of the power in various relationships with, for example, TLS certificate authorities, websites have most of the power in their relationship to clients (ie, us). If people don't like what a website is doing, their only option is generally a boycott. Based on the available evidence so far, any boycotts over things like CAPTCHA challenges have been ineffective so far. Github can afford to give people a UI with terrible performance because the switching costs are sufficiently high that they know most people won't.

(Another view is that the server side mostly doesn't notice or know that they're losing people; the lost people are usually invisible, with websites only having much visibility into the people who stick around. I suspect that relatively few websites do serious measurement of how many people bounce off or stop using them.)

Thus, in my view, it's not so much that client-side frustrations have been 'lacking' from debates around the web, which makes it sound like client side people haven't been speaking up, as that they've been actively ignored because, roughly speaking, no one on the server side cares about client-side frustrations. Maybe they vaguely sympathize, but they care a lot more about other things. And it's the web server side who decides how things operate.

(The fragility exposed by LLM crawler behavior demonstrates that clients matter in one sense, but it's not a sense that encourages website operators to cooperate or listen. Rather the reverse.)

I'm in no position to throw stones here, since I'm actively making editorial decisions that I know will probably hurt some real clients. Wandering Thoughts has never been hammered by crawler load the way some sites have been; I merely decided that I was irritated enough by the crawlers that I was willing to throw a certain amount of baby out with the bathwater.

The current (2025) crawler plague and the fragility of the web

By: cks

These days, more and more people are putting more and more obstacles in the way of the plague of crawlers (many of them apparently doing it for LLM 'AI' purposes), me included. Some of these obstacles involve attempting to fingerprint unusual aspects of crawler requests, such as using old browser User-Agents or refusing to accept compressed things in an attempt to avoid gzip bombs; other obstacles may involve forcing visitors to run JavaScript, using CAPTCHAs, or relying on companies like Cloudflare to block bots with various techniques.

On the one hand, I sort of agree that these 'bot' (crawler) defenses are harmful to the overall ecology of the web. On the other hand, people are going to do whatever works for them for now, and none of the current alternatives are particularly good. There's a future where much of the web simply isn't publicly available any more, at least not to anonymous people.

One thing I've wound up feeling from all this is that the current web is surprisingly fragile. A significant amount of the web seems to have been held up by implicit understandings and bargains, not by technology. When LLM crawlers showed up and decided to ignore the social things that had kept those parts of the web going, things started coming down all over the place.

(This isn't new fragility; the fragility was always there.)

Unfortunately, I don't see a technical way out from this (and I'm not sure I see any realistic way in general). There's no magic wand that we can wave to make all of the existing websites, web apps, and so on not get impaired by LLM crawlers when the crawlers persist in visiting everything despite being told not to, and on top of that we're not going to make bandwidth free. Instead I think we're looking at a future where the web ossifies for and against some things, and more and more people see catgirls.

(I feel only slightly sad about my small part in ossifying some bits of the web stack. Another part of me feels that a lot of web client software has gotten away with being at best rather careless for far too long, and now the consequences are coming home to roost.)

How not to check or poll URLs, as illustrated by Fediverse software

By: cks

Over on the Fediverse, I said some things:

[on April 27th:]
A bit of me would like to know why the Akkoma Fediverse software is insistently polling the same URL with HEAD then GET requests at five minute intervals for days on end. But I will probably be frustrated if I turn over that rock and applying HTTP blocks to individual offenders is easier.

(I haven't yet blocked Akkoma in general, but that may change.)

[the other day:]
My patience with the Akkoma Fediverse server software ran out so now all attempts by an Akkoma instance to pull things from my techblog will fail (with a HTTP redirect to a static page that explains that Akkoma mis-behaves by repeatedly fetching URLs with HEAD+GET every few minutes). Better luck in some future version, maybe, although I doubt the authors of Akkoma care about this.

(The HEAD and GET requests are literally back to back, with no delay between them that I've ever observed.)

Akkoma is derived from Pleroma and I've unsurprisingly seen Pleroma also do the HEAD then GET thing, but so far I haven't seen any Pleroma server showing up with the kind of speed and frequency that (some) Akkoma servers do.

These repeated HEADs and GETs are for Wandering Thoughts entries that haven't changed. DWiki is carefully written to supply valid HTTP Last-Modified and ETag, and these values are supplied in replies to both HEAD and GET requests. Despite all of this, Akkoma is not doing conditional GETs and is not using the information from the HEAD to avoid doing a GET if neither header has changed its value from the last time. Since Akkoma is apparently completely ignoring the result of its HEAD request, it might as well not make the HEAD request in the first place.

If you're going to repeatedly poll a URL, especially every five or ten minutes, and you want me to accept your software, you must do conditional GETs. I won't like you and may still arrange to give you HTTP 429s for polling so fast, but I most likely won't block you outright. Polling every five or ten minutes without conditional GET is completely unacceptable, at least to me (other people probably don't notice or care).

My best guess as to why Akkoma is polling the URL at all is that it's for "link previews". If you link to something in a Fediverse post, various Fediverse software will do the common social media thing of trying to embed some information about the target of the URL into the post as it presents it to local people; for plain links with no special handling, this will often show the page title. As far as the (rapid) polling goes, I can only guess that Akkoma has decided that it is extremely extra special and it must update its link preview information very rapidly should the linked URL do something like change the page title. However, other Fediverse server implementations manage to do link previews without repeatedly polling me (much less the HEAD then immediately a GET thing).

(On the global scale of things this amount of traffic is small beans, but it's my DWiki and I get to be irritated with bad behavior if I want to, even if it's small scale bad behavior.)

Introducing the illumos Cafe: Another Cozy Corner for OS Diversity

illumos Cafe logo - a coffee cup with an illumos logo

Introducing the illumos Cafe: Another Cozy Corner for OS Diversity

From the BSD Cafe to illumos Cafe

The idea for this new project was born from the success of the BSD Cafe, an initiative I introduced to the world in July 2023, which received an incredibly positive response. Far more than I ever anticipated. The BSD community already had its well-established hubs: in the Fediverse, places like bsd.network, exquisite.social, and others were already thriving, not to mention all the forums, channels, and Reddit communities.

But in my vision, something was still missing: a hub of services with a positive spirit, built exclusively with open-source tools, where people could come to share, learn, and experience technology with a positive mindset. The BSD Cafe is therefore not just an instance, but a true Cafe - I’ll be speaking more about the BSD Cafe in detail at the next EuroBSDCon.

Why Another Cafe?

In a world increasingly dominated by centralized services under the control (or lack thereof) of the usual big players, it has become essential to create free, independent communities, devoid of the algorithmic and commercial controls that influence our overall experience. From day one, the BSD Cafe has embodied this spirit.

Linux is a good kernel, and there are excellent distributions based on it (some using the GNU userland, others only partially, like Alpine Linux), but it cannot and should not become a monoculture. The alternatives are extremely capable, and for many use cases - in my opinion and experience - they are even more suitable. BSD systems have served me exceptionally well for over 20 years, providing stability and security. At the same time, many other operating systems are renowned for their robustness, reliability, and the quality of their design and implementation.

Why illumos?

illumos is one of them. As the open-source descendant of OpenSolaris, it is an operating system known for its enterprise-grade stability and innovative technologies like ZFS, DTrace, and "zones". It was born from the solid foundations of Solaris and has evolved over time while remaining true to many of its core principles. I have always seen illumos and its distributions as kindred spirits to the BSDs, despite their differences. The philosophy is one of evolution without revolution, of guaranteeing long-term continuity and reliability rather than chasing the latest hype. This is precisely why, for some time now (and thanks in part to the inspiring posts by Joel Carnat, which further sparked my curiosity), I have been running OmniOS and SmartOS alongside my BSD-based setups for certain workloads.

However, there is very little information online about services running on them. So, a few months ago, I began to consider a new project: the illumos Cafe.

The illumos Cafe Project

The illumos Cafe is a project similar to the BSD Cafe (though perhaps less complex, at least initially). It shares the same spirit of positivity and inclusivity and aims to provide services running on illumos-based operating systems to demonstrate that there are no reasons not to use them. Just like with the BSD Cafe, diversifying the operating systems we use - even while using the same platforms - is fundamental to improving the reliability and resilience of the Internet. The Internet was born as a decentralized network, but for most people, it has sadly become just a tool to access the services of big players.

Community and Philosophy

But we want to connect. We want relationships with people, between people. We don't want algorithms. We don't want our data to be monetized by "us and our 65535 partners". We want a network that serves us, an OS that serves us - not an OS that just serves as a vehicle to store our data in "someone else's house". The illumos Cafe, therefore, aims to be a home for anyone interested in developing, using, or who is simply curious about illumos-based operating systems.

Technical Setup

As with the BSD Cafe, the entire setup will be documented. For now, it is very simple: there is a VM (running on FreeBSD and bhyve, on hardware I manage) where I have installed SmartOS. The physical host also runs the reverse proxy (in a jail). Inside the SmartOS VM, there are a series of zones:

  • Zone 1: nginx (Web Server) - Currently serving the project's homepage.

  • Zone 2: Mastodon (Social) - Hosting the Mastodon instance and its dependencies at https://mastodon.illumos.cafe.

  • Zone 3: PostgreSQL (Database) - The Mastodon database, on a dedicated zone.

  • Zone 4: Redis (Cache) - The Mastodon cache, on a dedicated zone.

  • Zone 5: snac (LX Zone) - Currently in an LX zone (Alpine) as I ran into some issues getting it to work in a native zone. It will be moved to a native zone as soon as I resolve them. It's serving the snac instance at https://snac.illumos.cafe

Media files are stored on an external physical server (running FreeBSD, the same one as the BSD Cafe, but in a dedicated jail) with SeaweedFS. I was able to compile and run SeaweedFS on illumos without any problems, but at the moment, I don't have a host with enough storage space for the media.

Available Services

More services will arrive over time. For now, two gateways to the Fediverse are already available:

Both instances share the same rules as the BSD Cafe. Positivity. Supporters, not haters. I want them to be places of enjoyment, not venting. Of friendship, not hate.

Registrations and Logo

Registrations for the Mastodon instance are now open, and the available themes are the default ones plus the colorful TangerineUI - whose orange hue echoes the illumos logo.

The project's logo was not generated by an AI. I made it myself by hastily sticking the illumos SVG onto a coffee cup. Basic, perhaps. But authentic.

Looking Ahead

The BSD Cafe will, of course, remain my primary home. But I want to bring illumos into the Fediverse and provide a home for anyone who wishes to share their interest in this excellent OS.

I will document the entire process, just as I did with Mastodon on FreeBSD, as it is a bit more intricate. Because in my dreams, I see Fediverse statistics showing instances spread fairly evenly across the major open-source operating systems. Because relying on a single OS, even if it's open-source, and ceasing to support the others is also a single point of failure.

Typepad Is Shutting Down Next Month

By: Nick Heer

Typepad:

AfterΒ September 30, 2025, access to Typepad – including account management, blogs, and all associated content – will no longer be available. Your account and all related services will be permanently deactivated.Β Β Β 

I have not thought about Typepad in years, and I am certain I am not alone. That is not a condemnation; Typepad occupies a particular time and place on the web. As with anything hosted, however, users are unfortunately dependent on someone else’s interest in maintaining it.

If you have anything hosted at Typepad, now is a good time to back it up.

βŒ₯ Permalink

Interview With MacSurfer’s New Owner, Ken Turner

By: Nick Heer

Nice scoop from Eric Schwarz:

Over the past week, I’ve been working to track down the new owner of MacSurfer’s Headline News, a beloved site that shut down in 2020 and has recently had somewhat mysterious revival. Fortunately, after some digging that didn’t really lead anywhere, I received an email from its new owner, Ken Turner, and he graciously took the time to answer a few questions about the new project.

Turner sounds like a great steward to carry on the MacSurfer legacy. Even in an era of well-known aggregators like Techmeme and massive forums like Hacker News and Reddit, I think there is still a role for a smaller and more focused media tracking site.

I am uncertain what the role of BackBeat Media is in all this. I have not heard from Dave Hamilton or anyone there to confirm if they even have a role.

βŒ₯ Permalink

MacSurfer Returns

By: Nick Heer

Five years ago, Apple and tech news aggregator MacSurfer announced it was shutting down. The site was still accessible albeit in a stopped-time state, and it seemed that is how it would sit until the server died.

In June, though, MacSurfer was relaunched. The design has been updated and it is no longer as technically simple as it once was, but β€” charmingly β€” the logo appears to be the exact same static GIF as always. I cannot find any official announcement of its return.

Eric Schwarz:

It looks like Macsurfer is coming back, but I can’t find any details or who’s behind it? I really hope it’s not AI slop or someone trying to make a buck off nostalgia like iLounge or TUAW.

I had the same question, so I started digging. MxToolbox reveals a txt record on the domain for validating with Google apps, registered to BackBeat Media. BackBeat’s other properties include the Mac Observer, AppleInsider, and PowerPage. A review of historical MacSurfer txt records using SecurityTrails indicates the site has been with Backbeat Media since at least 2011, even though BackBeat’s site has not listed MacSurfer even when it was actively updated.

I cannot confirm the ownership is the same yet but I have asked Dave Hamilton, of BackBeat, and will update this if I hear back.

βŒ₯ Permalink

New Article on BSD Cafe Journal: WordPress on FreeBSD with BastilleBSD

Web Text - a terminal

New Article Published

I'm excited to announce that I have published a new, in-depth article on the BSD Cafe Journal: "WordPress on FreeBSD with BastilleBSD: A Secure Alternative to Linux/Docker".

This piece explores how to create a robust and secure WordPress installation on FreeBSD using BastilleBSD, leveraging the power and isolation of FreeBSD jails as a compelling alternative to the more common Linux and Docker stack.

Future Technical Content

I'm excited to announce that I'm expanding my writing to a new platform! From now on, some of my more technical, long-form articles and tutorials will be published on The BSD Cafe Journal, a fantastic hub for BSD-related content that I'm happy to now contribute to.

This new collaboration complements the work I do here. My personal blog will continue to be my home base, and you won't miss a thing! I'll still be posting my own articles and announcements right here, and I'll always include a direct link to any new content I publish elsewhere. This space will remain as active as ever.

Thank you for reading

A logic to Apache accepting query parameters for static files

By: cks

One of my little web twitches is the lax handling of unknown query parameters. As part of this twitch I've long been a bit irritated that Apache accepts query parameters even on static files, when they definitely have no meaning at all. You could say that this is merely Apache being accepting in general, but recently I noticed a combination of Apache features that can provide an additional reason for Apache to do this.

Apache has various features to redirect from old URLs on your site to new URLs, such as Redirect and RewriteRule. As covered in the relevant documentation for each of them, these rewrites preserve query parameters (although for RewriteRule you can turn that off with the QSD flag). This behavior makes sense in a lot of cases; if you've moved an application from one URL to another (or from one host to another) and it uses query parameters, you almost certainly want the query parameters to carry over with the HTTP redirection that people using old URLs will get.

(Here by 'an application' I mean anything that accepts and acts on query parameters. It might be a CGI, a PHP page or set of pages, a reverse proxy to something else, a Django application implemented with mod_wsgi, or various other things.)

A lot of the time if you use a redirect in Apache on URLs for an application, you'll be sending people to the new location of that application or its replacement. However, some of the time you'll be redirecting from an application to a static page, for example a page that says "this application has gone away". At least by default, your redirection from the application to the static page will carry query parameters along with it, and it would be a bad experience (for the people visiting and you) if the default result was that Apache served some sort of error page because it received query parameters on a static file.

(A closely related change is replacing a single-URL application, such as a basic CGI, with a static web page. Maybe the whole thing is no longer supported, or maybe everything now has a single useful response regardless of query parameters. Here again you can legitimately receive query parameters on a static file.)

Realizing this made me more sympathetic to Apache's behavior of accepting query parameters on static files. It's a relatively reasonable pragmatic choice even if (like me) you're not one of the people who feel unknown query parameters should always be ignored (which is the de facto requirement on the modern web, so my feelings about it are irrelevant).

Two tools I've been using to look into my web traffic volume

By: cks

These days, there's an unusually large plague of web crawlers, many of them attributed to LLM activities and most of them acting anonymously, with forged user agents and sometimes widely distributed source IPs. Recently I've been using two tools more and more to try to identify and assess suspicious traffic sources.

The first tool is Anarcat's asncounter. Asncounter takes IP addresses, for example from your web server logs, and maps them to ASNs (roughly who owns an IP address) and to CIDR netblocks that belong to those ASNs (a single ASN can have a lot of netblocks). This gives you information like:

count   percent ASN     AS
1460    7.55    24940   HETZNER-AS, DE
[...]
count   percent prefix  ASN     AS
1095    5.66    66.249.64.0/20  15169   GOOGLE, US
[...]
85      0.44    49.13.0.0/16    24940   HETZNER-AS, DE
85      0.44    65.21.0.0/16    24940   HETZNER-AS, DE
82      0.42    138.201.0.0/16  24940   HETZNER-AS, DE
71      0.37    135.181.0.0/16  24940   HETZNER-AS, DE
68      0.35    65.108.0.0/16   24940   HETZNER-AS, DE
[...]

While Hetzner is my biggest traffic source by ASN, it's not my biggest source by 'prefix' (a CIDR netblock), because this Hetzner traffic is split up across a bunch of their networks. Since most software operates by CIDR netblocks, not by ASNs, this difference can be important (and unfortunate if you want to block all traffic from a particular ASN).

The second tool is grepcidr. Grepcidr will let you search through a log file, such as your web server logs, for traffic from any particular netblock (or a group of netblocks), such as Google's '66.249.64.0/20'. This lets me find out what sort of requests came from a potentially suspicious network block, for example 'grepcidr 49.13.0.0/16 /var/log/...'. If what I see looks suspicious and has little or no legitimate traffic, I can consider taking steps against that netblock.

Asncounter is probably not (yet) packaged in your Linux distribution. Grepcidr may be, but if it's not it's a C program and simple to compile.

(It wouldn't be too hard to put together an 'asngrep' that would cut out the middleman, but I've so far not attempted to do this.)

PS: Both asncounter and grepcidr can be applied to other sorts of logs with IP addresses, for example sources of SSH brute force password scans. But my web logs are all that I've used them for so far.

Doing web things with CGIs is mostly no longer a good idea

By: cks

Recently I saw Serving 200 million requests per day with a cgi-bin (via, and there's a follow-up), which talks about how fast modern CGIs can be in compiled languages like Rust and Go (Rust more so than Go, because Go has a runtime that it has to start every time a Go program is executed). I'm a long standing fan of CGIs (and Wandering Thoughts, this blog, runs as a CGI some of the time), but while I admire these articles, I think that you mostly shouldn't consider trying to actually write a CGI these days.

Where and how CGI programs shine is when they have a simple deployment and development model. You write a little program, you put the little program somewhere, and it just works (and it's not going to be particularly slow these days). The programs run only when they get used, and if you're using Apache, you can also make these little programs run as the user who owns that web area instead of the web server user.

Where CGI programs fall down today is that they're unpopular, no longer well supported in various programming environments and frameworks, and they don't integrate with various other tools because these days the tools expect to operate as HTTP (reverse) proxies in front of your HTTP service (for example, Anubis for anti-crawler protections). It's easy to write, for example, a Go HTTP based web service; you can find lots of examples of how to do it (and the pieces are part of Go's standard library). If you want to write a Go CGI, you're actually in luck because Go put that in the standard library, but you're not going to find anywhere near as many examples and of course you won't get that integration with other HTTP reverse proxy tools. Other languages are not necessarily going to be as friendly as Go (including Python, which has removed the 'cgi' standard library package in 3.13).

(Similarly, many modern web servers are less friendly to CGIs than Apache is and will make you assemble more pieces to run them, reducing a number of the deployment advantages of CGIs.)

Only running these 'backend' HTTP server programs when they're needed is not easy today (although it's possible with systemd), so if you have a lot of little things that you can't bundle together into one server program, CGIs may still make sense despite what is generally the extra hassle of developing and running them. But otherwise, a HTTP based service that you run behind your general purpose web server is what modern web development is steering you toward and it's almost certainly going to be the easiest path.

(There's also a lot of large scale software support for deploying things that are HTTP services, with things like load balancers and smart routing frontends and so on and so forth, never mind containers and orchestration environments. If you want to use CGIs in this environment you basically get to add in a little web server as the way the outside world invokes them.)

Quick numbers on how common HTTP/2 is on our departmental web server

By: cks

Our general purpose departmental web server has supported HTTP/2 for a while. When we added HTTP/2 support it was basically because it was there; HTTP/2 was the new and shiny thing, our Apache configuration could support it, and so it seemed like a friendly gesture to turn HTTP/2 on. Until now, I've never looked at the statistics for how many HTTP requests use HTTP/2 and how many use other HTTP versions.

Our general purpose web server supports both HTTP access and HTTPS access, unless people opt to forcefully redirect their own pages from one to the other (we have plenty of old pages with mixed content problems, so we can't do such a redirection globally). However, these days that may not be much of an issue and browsers may force HTTPS on the initial connection, which will succeed with our server. I mention all of this because unfortunately our logs don't let me see how many requests are HTTP versus HTTPS. In some environments I could assume that all HTTP/2.0 requests were HTTPS, but the standard Ubuntu Apache HTTP/2 configuration enables h2c so I believe we can do HTTP/2.0 over HTTP connections without any sign of this in our current logs.

The overall number is that about 55% of the requests are HTTP/2.0 and all but a tiny trace of the remaining 45% are HTTP/1.1. However, this isn't uniform. For instance, we've somehow become a load bearing source of commonly used ML training data, and requests for this data are about 70% HTTP/2.0. Meanwhile, a URL hierarchy that maps to our anonymous FTP area sees much less activity, probably much of it from automated crawlers, and only 21% of the requests were HTTP/2.0.

If I look at the claimed User-Agents for HTTP/1.1 requests, some things jump out. A lot of requests come from 'pytorch/vision', along with 'GoogleOther', GPTBot, something claiming to be Chrome 83, PetalBot, Applebot, no User-Agent at all, Scrapy, and a whole menagerie of other crawlers. Actual probably authentic browser user agent values are mostly absent, which isn't a really big surprise since I think browsers aggressively do HTTP/2.0 these days.

(A lot of those 'pytorch/vision' requests were for that commonly used ML training data, but they seem to have been dwarfed by the HTTP/2.0 requests from browsers.)

Given even this cursory log analysis, I suspect that for our web server, HTTP/1.1 requests are significantly correlated with access from non-browsers, including crawlers (both overt and covert). Again this isn't really a surprise if modern browsers are trying to use HTTP/2 as much as possible, since most people are running modern browsers (especially Chrome).

What would a multi-user web server look like? (A thought experiment)

By: cks

Every so often my thoughts turn to absurd ideas. Today's absurd idea is sparked by my silly systemd wish for moving processes between systemd units, which in turn was sparked by a local issue with Apache CGIs (and suexec). This got me thinking about what a modern 'multi-user' web server would look like, where by multi-user I mean a web server that's intended to serve content operated by many different people (such as many different people's CGIs). Today you can sort of do this for CGIs through Apache suexec, but as noted this has limits.

The obvious way to implement this would be to run a web server process for every different person's web area and then reverse proxy to the appropriate process. Since there might be a lot of people and not all of them are visited very often, you would want these web server processes to be started on demand and then shut down automatically after a period of inactivity, rather than running all of the time (on Linux you could sort of put this together with systemd socket units). These web server processes would run as appropriate Unix UIDs, not as the web server UID, and on Linux under appropriate systemd hierarchies with appropriate limits set.

(Starting web server units through systemd would also mean that your main web server process didn't have to be privileged or have a privileged helper, as Apache does with suexec. You could have the front end web server do the process starting and supervision itself, but then it would also need the privileges to change UIDs and the support for setting other per-user context information, some of which is system dependent.)

Although I'm not entirely fond of it, the simplest way to communicate between the main web server and the per-person web server would be through HTTP. Since HTTP reverse proxies are widely supported, this would also allow people to choose what program they'd use as their 'web server', rather than your default. However, you'd want to provide a default simple web server to handle static files, CGIs, and maybe PHP (which would be even simpler than my idea of a modern simple web server).

The main (or front-end) web server would still want to have a bunch of features like global rate limiting, since it's the only thing in a position to see aggregate requests across everyone's individual server. If you wanted to make life more complicated but also potentially more convenient, you could chose different protocols to handle different people's areas. One person could be handled via a HTTP reverse proxy, but another person might be handled through FastCGI because they purely use PHP and that's most convenient for them (provided that their FastCGI server could handle being started on demand and then stopping later).

While I started thinking of this in the context of personal home pages and personal CGIs, as we support on our main web server, you could also use this for having different people and groups manage different parts of your URL hierarchy, or even different virtual hosts (by making the URL hierarchy of the virtual host that was handed to someone be '(almost) everything').

With a certain amount of work you could probably build this today on Linux with systemd (Unix) socket activation, although I don't know what front-end or back-end web server you'd want to use. To me, it feels like there's a certain elegance to the 'everyone gets their own web server running under their own UID, go wild' aspect of this, rather than having to try to make one web server running as one UID do everything.

Make Your Own Internet Presence with NetBSD and a 1 euro VPS – Part 1: Your Blog

Photo: Terminal screen with htop

Why NetBSD?

For many years, I've been using (and appreciating) NetBSD because it's stable, efficient, and reliable. The codebase has proven its reliability, running without reboots for years without issues. It supports ZFS (though differently than FreeBSD), LVM (useful for those accustomed to it on Linux), the ability to take filesystem snapshots (UFS2, making ZFS less crucial), and it's an excellent virtualization platform. Installation and updates are easy (including via sysupgrade - which I'll cover in a future article). Since it focuses on portable and optimized code (running on ancient architectures requires cleanliness and correctness), it's particularly efficient on low-power devices, like embedded systems or cheap VMs. Therefore, it's one of the best solutions for a small personal setup that can still deliver excellent results and simple management.

Indeed, the market offers very cheap VPS, often with just a single core and little RAM. But a modern single core packs power that a multi-core from just a few years ago could only dream of, and often, the I/O of these machines (a bottleneck for many services) is still decent. I personally use 1 euro per month VPS (VAT included - for those not subject to it, that's less than one euro per month!) with a public IPv4 address and (often) a /64 IPv6 block, ensuring full reachability across the entire network. I'm not providing direct links as I have no affiliations, but netcup's "piko" VPS are among the types I use most often (a 4 euro/month netcup VM handles the entire FediMeteo project), and this type of VM is ideal for our purpose because some providers (like netcup) allow you to upload your own ISO and install your preferred operating system. On VPS like these, I've installed everything - including OmniOS and SmartOS - without problems. And even such a small VPS, with an efficient operating system, can be extremely satisfying.

Why BSSG?

In this article, I'll describe how to create and publish a blog using BSSG as it exemplifies my concept of portability and minimalism. BSSG on NetBSD currently doesn't leverage parallelism provided by tools like GNU Parallel, but for small to medium-sized blogs, this won't be an issue, especially considering these small VMs only have 1 core. Obviously, you can use any Static Site Generator (SSG) (like Hugo, Nikola, 11ty, Pelican, Zola, etc.) - the important thing is to have a static site served by a simple web server.

Let's Start with the Installation

Installing NetBSD is quite straightforward and is clearly covered, complete with explanatory screenshots, in the excellent official NetBSD documentation, which I recommend using as a reference during the process, especially if it's your first time.

In my case, I made sure to use the proposed disk geometry, use the standard automatic partitioning, but enable the "log" and "noatime" options for the filesystem. Both these options will provide a huge advantage in I/O operations, especially with BSSG, as the first enables journaling and the second prevents updating file metadata on every access. BSSG is more I/O bound than CPU bound, so any optimization is beneficial.

Moving forward, I also recommend configuring the network (although installation can be done from packages on the installation ISO). For netcup, you can use DHCPv4 (even though it's a bit slow and sometimes seems to fail, the DHCP client will continue running in the background and eventually work).

For IPv6, I usually configure it manually later, so I'll describe that further down.

I also recommend enabling SSH, adding a regular user (and adding them to the wheel group so they can gain root privileges) - in this case, I'll call the user blog. Also, enable the installation of binary packages, as it will be convenient later to use pkgin to install and update all necessary packages. All these steps are described clearly and in detail in the guide, so I won't detail them here. But they are simple and logical, like all operations on BSD systems.

After installation, reboot. If everything went correctly, you should be able to log in via console or SSH using the "blog" user (or whatever you named it).

First, I suggest configuring the IPv6 address and installing the necessary packages.

For IPv6, in the case of netcup, simply add one of the assigned addresses to the interface. In NetBSD, network interface configurations are stored (similar to OpenBSD) in specific files. For the first virtio interface, the file will be /etc/ifconfig.vioif0.

You need to elevate your privileges to root, open that file with your preferred editor, and add the configuration to the file itself:

nb1euro$ su -l
nb1euro# vi /etc/ifconfig.vioif0

inet6 your-ipv6-addr/64
up

To test everything, perform a reboot and try pinging an IPv6 address (I often use ping6 google.com).

If all goes well, after a few seconds, you should see ping replies, confirming everything is configured correctly.

Regarding packages, the only two strictly necessary ones are bash and a markdown processor (by default, BSSG will use commonmark; otherwise, it can be configured to use pandoc or Markdown.pl). rsync can be useful for deployment. sudo (or doas) can be useful for elevating privileges for certain operations, at least at this stage.

nb1euro$ su -l
nb1euro# pkgin in bash cmark rsync sudo

If you're used to Linux, you can also install the "nano" editor:

nb1euro# pkgin in nano

If sudo was installed, it's now appropriate to grant users in the "wheel" group (like the regular user created during installation) the ability to elevate privileges. Edit the sudoers file (I suggest using the visudo command) and uncomment this line:

## Uncomment to allow members of group wheel to execute any command
%wheel ALL=(ALL:ALL) ALL

At this point, you can switch back to operating as the regular user, downloading and unpacking BSSG:

nb1euro$ ftp https://brew.bsd.cafe/stefano/BSSG/archive/0.15.1.tar.gz
nb1euro$ tar zxfv 0.15.1.tar.gz

Now that BSSG is ready, just initialize a directory with the structure for the new site:

nb1euro$ cd bssg
nb1euro$ ./bssg.sh init /home/blog/myblog

Everything is set to start generating your blog. I recommend reading BSSG's README.md. There are many options, themes, etc., but to get started, you just need to set the site's public URL. For example, if the site will be published as myblog.example.com - just create a file at /home/blog/myblog/config.sh.local (the path defined by the init command) and set the public URL:

SITE_URL="https://myblog.example.com"

This way, all URLs will be absolute URLs, which is necessary to ensure the correct functioning of RSS feeds, sitemaps, etc. This setting assumes HTTPS - if you just want to test the site over HTTP, simply use http and then, optionally, change it to https and regenerate the site later.

You can already create your first test post, directly from the BSSG directory:

nb1euro$ ./bssg.sh post

The system will use nano if it's installed, otherwise it will use vi. Don't worry, in the latter case, BSSG will write the procedure for exiting vi as the post's text πŸ™‚

Once you save the post, BSSG will automatically generate the site. If everything went well, the /home/blog/myblog/output directory will contain the final result. We are therefore ready for the first deployment, which can be done in many different ways. I will cover three:

  • Using bozohttpd, present by default in NetBSD's base system. It can be used via inetd (launching an httpd process for each connection) or as a daemon. I'll describe the first option, showing in the final benchmarks how, even when used as a daemon, it remains a less performant solution.

  • Using nginx

  • Using Caddy

First, it's advisable to obtain a certificate to configure and use HTTPS. If you only want to test using HTTP, this part can be safely bypassed. For solutions 1 and 2, I'll use certbot, which is well-known to many users with Linux experience. Caddy, on the other hand, manages certificates automatically, so there's no need for other solutions and thus no need to install certbot.

nb1euro$ sudo pkgin in py313-certbot

To use bozohttpd, no further installation is necessary. At this point, the options diverge.

Using NetBSD's Integrated httpd

bozohttpd is integrated into NetBSD and, by default, can be launched directly via inetd. This solution, while not extremely efficient or scalable, is simple and requires few resources. It's fine if you expect only a few visits per day, but when used via inetd, the initial latency for each connection is tangible. It can still be useful for some tests or small deployments.

The /etc/inetd.conf file already contains the options to handle this situation:

#http           stream  tcp     nowait:600      _httpd  /usr/libexec/httpd      httpd /var/www
#http           stream  tcp6    nowait:600      _httpd  /usr/libexec/httpd      httpd /var/www

By uncommenting these two lines and restarting inetd (service inetd restart), the server will start responding to HTTP requests on both IPv4 and IPv6.

If you want to add HTTPS support, no problem. Just request a certificate via certbot and specify the webroot.

Run:

nb1euro$ sudo certbot-3.13 certonly

Choose option 2 - the one where you specify the webroot - enter the domain, and when prompted, provide /var/www/ as the webroot.

The certificate will be created. Then, modify the /etc/inetd.conf file to also include support for HTTPS, adding two lines similar to these (obviously, change the certificate paths):

https            stream  tcp     nowait:600      _httpd  /usr/libexec/httpd      httpd -Z /usr/pkg/etc/letsencrypt/live/myblog.example.com/fullchain.pem /usr/pkg/etc/letsencrypt/live/myblog.example.com/privkey.pem /var/www
https            stream  tcp6    nowait:600      _httpd  /usr/libexec/httpd      httpd -Z /usr/pkg/etc/letsencrypt/live/myblog.example.com/fullchain.pem /usr/pkg/etc/letsencrypt/live/myblog.example.com/privkey.pem /var/www

Warning: httpd will run with the permissions of the _httpd user, so make sure all certificates are readable by that user:

nb1euro# chown -R _httpd /usr/pkg/etc/letsencrypt/

Restart inetd, and the server will also respond over HTTPS.

To make your blog public, simply copy the files from the site's output directory to /var/www/ - this time using sudo to bypass permission issues:

nb1euro$ sudo rsync -avhHPx /home/blog/myblog/output/ /var/www/

The site will be immediately visible.

Using nginx

Nginx is fast and efficient, and the performance difference is noticeable (some benchmarks follow below). For an efficient setup ready for a high number of visits, it's advisable to use a web server suited for the purpose, just like nginx.

First, install nginx and the certbot plugin for nginx. This will simplify the installation and renewal of certificates:

nb1euro$ sudo pkgin in py313-certbot-nginx nginx

Copy the startup script to /etc/rc.d - as indicated by the post-installation message. In NetBSD, this operation must be done manually, but it's always pointed out:

nb1euro$ sudo cp /usr/pkg/share/examples/rc.d/nginx /etc/rc.d

Warning: If you previously used httpd from inetd following the previous solution, you must disable it in inetd.conf and restart inetd to free up ports 80 and 443.

Now you can create a virtual host for our new site.

nb1euro$ sudo vi /usr/pkg/etc/nginx/nginx.conf

and add, at the end of the file and before the final closing curly brace:

server {
        listen 80;
        # If you also have configured IPv6 support
        listen [::]:80;

        root /var/www;
        index index.html index.htm;

        server_name myblog.example.com;

        # If you want a long cache for media and css - be careful, this means that if you change to a new theme, it might not be visible immediately as the browser might still use the old cached one
        location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
            expires 30d;
            add_header Cache-Control "public, no-transform";
        }

        location / {
                try_files $uri $uri/ =404;
        }
}

Now, it's time to configure the system to enable nginx. Just edit /etc/rc.conf:

nb1euro$ sudo vi /etc/rc.conf

and add:

nginx=YES

Now, you can start nginx:

nb1euro$ sudo service nginx start

Nginx will start listening on port 80. Generating and installing the certificate is very simple:

nb1euro$ sudo certbot-3.13 --nginx -d myblog.example.com

This command will request the certificate and install it, so nginx will already be configured to use it.

As with the previous method, to make your blog public, simply copy the files from the site's output directory to /var/www/ - using sudo to bypass permission issues:

nb1euro$ sudo rsync -avhHPx /home/blog/myblog/output/ /var/www/

The site will be immediately visible.

Using Caddy

Caddy is a convenient and all-in-one solution, efficient and fast. It's packaged for NetBSD and allows you to go online in a flash. I won't delve into the configuration because there are many tutorials (including the official ones), but you just need to install it and run it:

nb1euro$ sudo pkgin in caddy

Once installed, go to the directory you want to serve (e.g., /var/www or directly /home/blog/myblog/output) and run:

nb1euro$ sudo caddy file-server --domain myblog.example.com

Caddy will start, request the certificate, and begin serving your blog over HTTPS as well. To install Caddy as a service (i.e., with a configuration file, etc.), you can proceed similarly to how it's done on Linux. The NetBSD Caddy package doesn't include the rc.d script, but you can copy and paste one (into /etc/rc.d/caddy) from a thread posted on UnitedBSD.

Performance Comparison

I performed some performance tests on these solutions. Here are the results, on a single-core 1 euro/month VPS, from my home connection (which also has its own limitations):

  • NetBSD httpd via inetd:
Running 10s test @ https://myblog.example.com/
  4 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   213.52ms  173.10ms   1.11s    76.01%
    Req/Sec    12.92      9.19    50.00     75.91%
  371 requests in 10.10s, 1.39MB read
Requests/sec:     36.72
Transfer/sec:    140.65KB

These numbers are quite poor, linked to high latency caused by having to launch bozohttpd for each incoming connection.

  • NetBSD httpd as a daemon:
Running 10s test @ https://myblog.example.com/
  4 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    35.74ms    6.96ms 108.80ms   81.36%
    Req/Sec    18.29      9.45    50.00     70.88%
  676 requests in 10.10s, 2.53MB read
Requests/sec:     66.92
Transfer/sec:    256.32KB

Here the situation is decidedly better, but not exceptional. httpd isn't designed for high loads or performance.

  • Nginx as a daemon, 1 worker:
Running 10s test @ https://myblog.example.com/
  4 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    30.69ms    4.87ms  64.14ms   66.01%
    Req/Sec   379.39     65.94   464.00     90.91%
  15026 requests in 10.04s, 56.50MB read
Requests/sec:   1496.65
Transfer/sec:      5.63MB

Here we are on another level, showing truly solid performance. This type of result can handle significantly high loads without particular difficulty. The efficiency of both NetBSD and nginx pays off.

  • Caddy:
Running 10s test @ https://myblog.example.net/
  4 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    32.10ms    5.75ms  95.04ms   87.44%
    Req/Sec   362.74     64.29   434.00     91.67%
  14374 requests in 10.05s, 54.63MB read
Requests/sec:   1430.82
Transfer/sec:      5.44MB

Caddy shows results comparable to nginx, so the choice between them depends solely on the type of configuration you want to achieve and the experience each person has with the specific platforms.

Conclusion: Efficient Minimalism

We've seen how it's possible to create a personal, professional, and performant online presence with minimal investment. This solution, based on NetBSD and a 1€/month VPS, offers several advantages:

  • Negligible Cost: For 12€ per year, you can have a website (and more!) completely under your control.
  • Surprising Performance: As demonstrated by the benchmarks, excellent performance can be achieved even with limited resources (up to 1400-1500 requests/second with nginx or Caddy).
  • Security and Stability: NetBSD is renowned for its reliability and security, fundamental characteristics for any online service.
  • Total Control: Unlike free blogging platforms, you have full control over every aspect of your site.
  • Learning Experience: Managing a BSD system allows you to acquire valuable system administration skills.

This minimalist configuration demonstrates that you don't need to invest in expensive cloud solutions or oversized VPS to have a quality online presence. In an era where the tendency is to think "moooar powaaaar = better results", NetBSD reminds us that efficiency and good design can yield excellent results even with limited resources.

After all, you don't need a thousand-node cloud to write something worth reading.

In the upcoming articles in this series, we will explore how to expand this basic installation with other useful services and how to keep the system updated and secure over time.

My blocking of some crawlers is an editorial decision unrelated to crawl volume

By: cks

Recently I read a lobste.rs comment on one of my recent entries that said, in part:

Repeat after me everyone: the problem with these scrapers is not that they scrape for LLM’s, it’s that they are ill-mannered to the point of being abusive. LLM’s have nothing to do with it.

This may be some people's view but it is not mine. For me, blocking web scrapers here on Wandering Thoughts is partly an editorial decision of whether I want any of my resources or my writing to be fed into whatever they're doing. I will certainly block scrapers for doing what I consider an abusive level of crawling, and in practice most of the scrapers that I block come to my attention due to their volume, but I will block low-volume scrapers because I simply don't like what they're doing it for.

Are you a 'brand intelligence' firm that scrapes the web and sells your services to brands and advertisers? Blocked. In general, do you charge for access to whatever you're generating from scraping me? Probably blocked. Are you building a free search site for a cause (and with a point of view) that I don't particularly like? Almost certainly blocked. All of this is an editorial decision on my part on what I want to be even vaguely associated with and what I don't, not a technical decision based on the scraping's effects on my site.

I am not going to even bother trying to 'justify' this decision. It's a decision that needs no justification to some and to others, it's one that can never be justified. My view is that ethics matter. Technology and our decisions of what to do with technology are not politically neutral. We can make choices, and passively not doing anything is a choice too.

(I could say a lot of things here, probably badly, but ethics and politics are in part about what sort of a society we want, and there's no such thing as a neutral stance on that. See also.)

I would block LLM scrapers regardless of how polite they are. The only difference them being politer would make is that I would be less likely to notice (and then block) them. I'm probably not alone in this view.

A thought on JavaScript "proof of work" anti-scraper systems

By: cks

One of the things that people are increasingly using these days to deal with the issue of aggressive LLM and other web scrapers is JavaScript based "proof of work" systems, where your web server requires visiting clients to run some JavaScript to solve a challenge; one such system (increasingly widely used) is Xe Iaso's Anubis. One of the things that people say about these systems is that LLM scrapers will just start spending the CPU time to run this challenge JavaScript, and LLM scrapers may well have lots of CPU time available through means such as compromised machines. One of my thoughts is that things are not quite as simple for the LLM scrapers as they look.

An LLM scraper is operating in a hostile environment (although its operator may not realize this). In a hostile environment, dealing with JavaScript proof of work systems is not as simple as simply running it, because you can't particularly tell a JavaScript proof of work system from JavaScript that does other things. Letting your scraper run JavaScript means that it can also run JavaScript for other purposes, for example for people who would like to exploit your scraper's CPU to do some cryptocurrency mining, or simply have you run JavaScript for as long as you'll let it keep going (perhaps because they've recognized you as a LLM scraper and want to waste as much of your CPU as possible).

An LLM scraper can try to recognize a JavaScript proof of work system but this is a losing game. The other parties have every reason to make themselves look like a proof of work system, and the proof of work systems don't necessarily have an interest in being recognized (partly because this might allow LLM scrapers to short-cut their JavaScript with optimized host implementations of the challenges). And as both spammers and cryptocurrency miners have demonstrated, there is no honor among thieves. If LLM scrapers dangle free computation in front of people, someone will spring up to take advantage of it. This leaves LLM scrapers trying to pick a JavaScript runtime limit that doesn't cut them off from too many sites, while sites can try to recognize LLM scrapers and increase their proof of work difficulty if they see a suspect.

(This is probably not an original thought, but it's been floating around my head for a while.)

PS: JavaScript proof of work systems aren't the greatest thing, but they're going to happen unless someone convincingly demonstrates a better alternative.

What keeps Wandering Thoughts more or less free of comment spam (2025 edition)

By: cks

Like everywhere else, Wandering Thoughts (this blog) gets a certain amount of automated comment spam attempts. Over the years I've fiddled around with a variety of anti-spam precautions, although not all of them have worked out over time. It's been a long time since I've written anything about this, because one particular trick has been extremely effective ever since I introduced it.

That one trick is a honeypot text field in my 'write a comment' form. This field is normally hidden by CSS, and in any case the label for the field says not to put anything in it. However, for a very long time now, automated comment spam systems seem to operate by stuffing some text into every (text) form field that they find before they submit the form, which always trips over this. I log the form field's text out of curiosity; sometimes it's garbage and sometimes it's (probably) meaningful for the spam comment that the system is trying to submit.

Obviously this doesn't stop human-submitted spam, which I get a small amount of every so often. In general I don't expect anything I can reasonably do to stop humans who do the work themselves; we've seen this play out in email and I don't have any expectations that I can do better. It also probably wouldn't work if I was using a popular platform that had this as a general standard feature, because then it would be worth the time of the people writing automated comment spam systems to automatically recognize it and work around it.

Making comments on Wandering Thoughts also has an additional small obstacle in the way of automated comment spammers, which is that you must initially preview your comment before you can submit it (although you don't have to submit the comment that you previewed, you can edit it after the first preview). Based on a quick look at my server logs, I don't think this matters to the current automated comment spam systems that try things here, as they only appear to try submitting once. I consider requiring people to preview their comment before posting it to be a good idea in general, especially since Wandering Thoughts uses a custom wiki-syntax and a forced preview gives people some chance of noticing any mistakes.

(I think some amount of people trying to write comments here do miss this requirement and wind up not actually posting their comment in the end. Or maybe they decide not to after writing one version of it; server logs give me only so much information.)

In a world that is increasingly introducing various sorts of aggressive precautions against LLM crawlers, including 'proof of work' challenges, all of this may become increasingly irrelevant. This could go either way; either the automated comment spammers die off as more and more systems have protections that are too aggressive for them to deal with, or the automated systems become increasingly browser-based and sidestep my major precaution because they no longer 'see' the honeypot field.

Thinking about what you'd want in a modern simple web server

By: cks

Over on the Fediverse, I said:

I'm currently thinking about what you'd want in a simple modern web server that made life easy for sites that weren't purely static. I think you want CGI, FastCGI, and HTTP reverse proxying, plus process supervision. Automatic HTTPS of course. Rate limiting support, and who knows what you'd want to make it easier to deal with the LLM crawler problem.

(This is where I imagine a 'stick a third party proxy in the middle' mode of operation.)

What I left out of my Fediverse post is that this would be aimed at small scale sites. Larger, more complex sites can and should invest in the power, performance, and so on of headline choices like Apache, Nginx, and so on. And yes, one obvious candidate in this area is Caddy, but at the same time something that has "more scalable" (than alternatives) as a headline features is not really targeting the same area as I'm thinking of.

This goal of simplicity of operation is why I put "process supervision" into the list of features. In a traditional reverse proxy situation (whether this is FastCGI or HTTP), you manage the reverse proxy process separately from the main webserver, but that requires more work from you. Putting process supervision into the web server has the goal of making all of that more transparent to you. Ideally, in common configurations you wouldn't even really care that there was a separate process handling FastCGI, PHP, or whatever; you could just put things into a directory or add some simple configuration to the web server and restart it, and everything would work. Ideally this would extend to automatically supporting PHP by just putting PHP files somewhere in the directory tree, just like CGI; internally the web server would start a FastCGI process to handle them or something.

(Possibly you'd implement CGI through a FastCGI gateway, but if so this would be more or less pre-configured into the web server and it'd ship with a FastCGI gateway for this (and for PHP).)

This is also the goal for making it easy to stick a third party filtering proxy in the middle of processing requests. Rather than having to explicitly set up two web servers (a frontend and a backend) with an anti-LLM filtering proxy in the middle, you would write some web server configuration bits and then your one web server would split itself into a frontend and a backend with the filtering proxy in the middle. There's no technical reason you can't do this, and even control what's run through the filtering proxy and what's served directly by the front end web server.

This simple web server should probably include support for HTTP Basic Authentication, so that you can easily create access restricted areas within your website. I'm not sure if it should include support for any other sort of authentication, but if it did it would probably be OpenID Connect (OIDC), since that would let you (and other people) authenticate through external identity providers.

It would be nice if the web server included some degree of support for more or less automatic smart in-memory (or on-disk) caching, so that if some popular site linked to your little server, things wouldn't explode (or these days, if a link to your site was shared on the Fediverse and all of the Fediverse servers that it propagated to immediately descended on your server). At the very least there should be enough rate limiting that your little server wouldn't fall over, and perhaps some degree of bandwidth limits you could set so that you wouldn't wake up to discover you had run over your outgoing bandwidth limits and were facing large charges.

I doubt anyone is going to write such a web server, since this isn't likely to be the kind of web server that sets the world on fire, and probably something like Caddy is more or less good enough.

(Doing a good job of writing such a server would also involve a fair amount of research to learn what people want to run at a small scale, how much they know, what sort of server resources they have or want to use, what server side languages they wind up using, what features they need, and so on. I certainly don't know enough about the small scale web today.)

PS: One reason I'm interested in this is that I'd sort of like such a server myself. These days I use Apache and I'm quite familiar with it, but at the same time I know it's a big beast and sometimes it has entirely too many configuration options and special settings and so on.

In Apache, using OIDC instead of SAML makes for easier testing

By: cks

In my earlier installment, I wrote about my views on the common Apache modules for SAML and OIDC authentication, where I concluded that OpenIDC was generally easier to use than Mellon (for SAML). Recently I came up with another reason to prefer OIDC, one sufficiently strong enough that we converted one of our remaining Mellon uses over to OIDC. The advantage is that OIDC is easier to test if you're building a new version of your web server under another name.

Suppose that you're (re)building a version of your Apache based web server with authentication on, for example, a new version of Ubuntu, using a test server name. You want to test that everything still works before you deploy it, including your authentication. If you're using Mellon, as far as I can see you have to generate an entirely new SP configuration using your test server's name and then load it into your SAML IdP. You can't use your existing SAML SP configuration from your existing web server, because it specifies the exact URL the SAML IdP needs to use for various parts of the SAML protocol, and of course those URLs point to your production web server under its production name. As far as I know, to get another set of URLs that point to your test server, you need to set up an entirely new SP configuration.

OIDC has an equivalent thing in its redirect URI, but the OIDC redirect URL works somewhat differently. OIDC identity providers typically allow you to list multiple allowed redirect URIs for a given OIDC client, and it's the client that tells the server what redirect URI to use during authentication. So when you need to test your new server build under a different name, you don't need to register a new OIDC client; you can just add some more redirect URIs to your existing production OIDC client registration to allow your new test server to provide its own redirect URI. In the OpenIDC module, this will typically require no Apache configuration changes at all (from the production version), as the module automatically uses the current virtual host as the host for the redirect URI. This makes testing rather easier in practice, and it also generally tests the Apache OIDC configuration you'll use in production, instead of a changed version of it.

(You can put a hostname in the Apache OIDCRedirectURI directive, but it's simpler to not do so. Even if you did use a full URL in this, that's a single change in a text file.)

The HTTP status codes of responses from about 22 hours of traffic to here (part 2)

By: cks

A few months ago, I wrote an entry about this topic, because I'd started putting in some blocks against crawlers, including things that claimed to be old versions of browsers, and I'd also started rate-limiting syndication feed fetching. Unfortunately, my rules at the time were flawed, rejecting a lot of people that I actually wanted to accept. So here are some revised numbers from today, a day when my logs suggest that I've seen what I'd call broadly typical traffic and traffic levels.

I'll start with the overall numbers (for HTTP status codes) for all requests:

  10592 403		[26.6%]
   9872 304		[24.8%]
   9388 429		[23.6%]
   8037 200		[20.2%]
   1629 302		[ 4.1%]
    114 301
     47 404
      2 400
      2 206

This is a much more balanced picture of activity than the last time around, with a lot less of the overall traffic being HTTP 403s. The HTTP 403s are from aggressive blocks, the HTTP 304s and HTTP 429s are mostly from syndication feed fetchers, and the HTTP 302s are mostly from things with various flaws that I redirect to informative static pages instead of giving HTTP 403s. The two HTTP 206s were from Facebook's 'externalhit' agent on a recent entry. A disturbing amount of the HTTP 403s were from Bing's crawler and almost 500 of them were from something claiming to be an Akkoma Fediverse server. 8.5% of the HTTP 403s were from something using Go's default User-Agent string.

The most popular User-Agent strings today for successful requests (of anything) were for versions of NetNewsWire, FreshRSS, and Miniflux, then Googlebot and Applebot, and then Chrome 130 on 'Windows NT 10'. Although I haven't checked, I assume that all of the first three were for syndication feeds specifically, with few or no fetches of other things. Meanwhile, Googlebot and Applebot can only fetch regular pages; they're blocked from syndication feeds.

The picture for syndication feeds looks like this:

   9923 304		[42%]
   9535 429		[40%]
   1984 403		[ 8.5%]
   1600 200		[ 6.8%]
    301 302
     34 301
      1 404

On the one hand it's nice that 42% of syndication feed fetches successfully did a conditional GET. On the other hand, it's not nice that 40% of them got rate-limited, or that there were clearly more explicitly blocked requests that there were HTTP 200 responses. On the sort of good side, 37% of the blocked feed fetches were from one IP that's using "Go-http-client/1.1" as its User-Agent (and which accounts for 80% of the blocks of that). This time around, about 58% of the requests were for my syndication feed, which is better than it was before but still not great.

These days, if certain problems are detected in a request I redirect the request to a static page about the problem. This gives me some indication of how often these issues are detected, although crawlers may be re-visiting the pages on their own (I can't tell). Today's breakdown of this is roughly:

   78%  too-old browser
   13%  too generic a User-Agent
    9%  unexpectedly using HTTP/1.0

There were slightly more HTTP 302 responses from requests to here than there were requests for these static pages, so I suspect that not everything that gets these redirects follows them (or at least doesn't bother re-fetching the static page).

I hope that the better balance in HTTP status codes here is a sign that I have my blocks in a better state than I did a couple of months ago. It would be even better if the bad crawlers would go away, but there's little sign of that happening any time soon.

Chrome and the burden of developing a browser

By: cks

One part of the news of the time interval is that the US courts may require Google to spin off Chrome (cf). Over on the Fediverse, I felt this wasn't a good thing:

I have to reluctantly agree that separating Chrome from Google would probably go very badlyΒΉ. Browsers are very valuable but also very expensive public goods, and our track record of funding and organizing them as such in a way to not wind up captive to something is pretty bad (see: Mozilla, which is at best questionable on this). Google is not ideal but at least Chrome is mostly a sideline, not a main hustle.

ΒΉ <Lauren Weinstein Fediverse post> [...]

One possible reaction to this is that it would be good for everyone if people stopped spending so much money on browsers and so everything involving them slowed down. Unfortunately, I don't think that this would work out the way people want, because popular browsers are costly beasts. To quote what I said on the Fediverse:

I suspect that the cost of simply keeping the lights on in a modern browser is probably on the order of plural millions of dollars a year. This is not implementing new things, this is fixing bugs, keeping up with security issues, monitoring CAs, and keeping the development, CI, testing, and update infrastructure running. This has costs for people, for servers, and for bandwidth.

The reality of the modern Internet is that browsers are load bearing infrastructure; a huge amount of things run through them, including and especially on minority platforms. Among other things, no browser is 'secure' and all of them are constantly under attack. We want browser projects that are used by lots of people to have enough resources (in people, build infrastructure, update servers, and so on) to be able to rapidly push out security updates. All browsers need a security team and any browser with addons (which should be all of them) needs a security team for monitoring and dealing with addons too.

(Browsers are also the people who keep Certificate Authorities honest, and Chrome is very important in this because of how many people use it.)

On the whole, it's a good thing for the web that Chrome is in the hands of an organization that can spend tens of millions of dollars a year on maintaining it without having to directly monetize it in some way. It would be better if we could collectively fund browsers as the public good that they are without having corporations in the way, because Google absolutely corrupts Chrome (also) and Mozilla has stumbled spectacularly (more than once). But we have to deal with the world that we have, not the world that we'd like to have, and in this world no government seems to be interested in seriously funding obvious Internet public goods (not only browsers but also, for example, free TLS Certificate Authorities).

(It's not obvious that a government funded browser would come out better overall, but at least there would be a chance of something different than the narrowing status quo.)

PS: Another reason that spending on browsers might not drop is that Apple (with Safari) and Microsoft (with Edge) are also in the picture. Both of these companies might take the opportunity to slow down, or they might decide that Chrome's potentially weak new position was a good moment to push for greater dominance and maybe lock-in through feature leads.

The appeal of serving your web pages with a single process

By: cks

As I slowly work on updating the software behind this blog to deal with the unfortunate realities of the modern web (also), I've found myself thinking (more than once) how much simpler my life would be if I was serving everything through a single process, instead of my eccentric, more or less stateless CGI-based approach. The simple great thing about doing everything through a single process (with threads, goroutines, or whatever inside it for concurrency) is that you have all the shared state you could ever want, and that shared state makes it so easy to do so many things.

Do you have people hitting one URL too often from a single IP address? That's easy to detect, track, and return HTTP 429 responses for until they cool down. Do you have an IP making too many requests across your entire site? You can track that sort of volume information. There's all sorts of potential bad stuff that it's at least easier to detect when you have easy shared global state. And the other side of this is that it's also relatively easy to add simple brute force caching in a single process with global state.

(Of course you have some practical concerns about memory and CPU usage, depending on how much stuff you're keeping track of and for how long.)

You can do a certain amount of this detection with a separate 'database' process of some sort (or a database file, like sqlite), and there's various specialized software that will let you keep this sort of data in memory (instead of on disk) and interact with it easily. But this is an extra layer or two of overhead over simply updating things in your own process, especially if you have to set up things like a database schema for what you're tracking or caching.

(It's my view that ease of implementation is especially useful when you're not sure what sort of anti-abuse measures are going to be useful. The easier it is to implement something and at least get logs of what and how much it would have done, the more you're going to try and the more likely you are to hit on something that works for you.)

Unfortunately it seems like we're only going to need more of this kind of thing in our immediate future. I don't expect the level of crawling and abuse to go down any time soon; if anything, I expect it to keep going up, especially as more and more websites move behind effective but heavyweight precautions and the crawlers turn more of their attention to the rest of us.

Mandatory short duration TLS certificates are probably coming soon

By: cks

The news of the time interval is that the maximum validity period for TLS certificates will be lowered to 47 days by March 2029, unless the CA/Browser Forum changes its mind (or is forced to) before then. The details are discussed in SC-081. In skimming the mailing list thread on the votes, a number of organizations that voted to abstain seem unenthused (and uncertain that it can actually be implemented), so this may not come to pass, especially on the timeline proposed here.

If and when this comes to pass, I feel confident that this will end manual certificate renewals at places that are still doing them. With that, it will effectively end Certificate Authorities that don't have an API that you can automatically get certificates through (not necessarily a free or public API). I'm not sure what it's going to do to the Certificate Authority business models for commercial CAs, but I also don't think the browsers care about that issue and the browsers are driving.

This will certainly cause pain. I know of places around the university that are still manually handling one-year TLS certificates; those places will have to change over the course of a few years. This pain will arrive well before 2029; based on the proposed changes, starting March 15, 2027, the maximum certificate validity period will be 100 days, which is short enough to be decidedly annoying. Even a 250 200 day validity period (starting March 15 2026) will be somewhat painful to do by hand.

I expect one consequence to be that some number of (internal) devices stop having valid TLS certificates, because they can only have certificates loaded into them manually and no one is going to do that every 40-dd or even every 90-odd days. You might manually get and load a valid TLS certificate every year; you certainly won't do it every three months (well, almost no one will).

I hope that this will encourage the creation and growth of more alternatives to Let's Encrypt, even if not all of them are free, since more and more CAs will be pushed to have an API and one obvious API to adopt is ACME.

(I can also imagine ways to charge for an ACME based API, even with standard ACME clients. One obvious way would be to only accept ACME requests for domains that the CA had some sort of site license with. You'd establish the site license through out of band means, not ACME.)

Launching BSSG - My Journey from Dynamic CMS to Bash Static Site Generator

Photo by Patrick Fore on Unsplash

I've had my own website practically forever. Back in the late '90s, I already had a web page on my ISP's server, and since at least 2001, I've had my own homepage on my own server. I've never been a great graphic designer, let alone a skilled webmaster, so I've always tried to keep things minimal and compatible.

Initially, like many others, I wrote HTML pages by hand. Then I used WYSIWYG creation tools, and eventually, I landed on CMS (Content Management Systems).

The Era of Dynamic CMS

I liked CMS because they allowed me to focus on the content and not on the correctness of the generated HTML. Thanks to them, I started writing my first blog shortly afterward.

Over the years, I've used many tools like PHPNuke, FlatNuke (created and developed by my friend Simone Vellei), eventually moving through Joomla and Wordpress. Wordpress always seemed like the most suitable tool for the job, and I used it for many years. Even today, mainly on the sysadmin side, I manage hundreds of Wordpress sites, and they are reasonably reliable, aside from the plugins (because the problem with Wordpress isn't the software itself, but many of the external plugins).

But this is precisely the problem: all dynamic CMS require constant and continuous security updates because, without them, the chances of defacement are extremely high.

Discovering Static Site Generators

And that's precisely why, when I discovered Carlos Fenollosa's bashblog in 2014, it immediately became clear that, indeed, there was no reason to continue down the path of dynamic CMS. I don't write often, I don't update often, there's no reason to regenerate all the content with every visit. Sure, WordPress caching plugins are often quite effective, but they are still add-ons that need to be kept up to date. And I'm not a fan of adding things to streamline. Often, less is more.

So, I started using bashblog for some 'secondary' projects until, in 2015, I migrated my 'old' Italian blog from WordPress to Pelican. Shortly after, I moved from Pelican to Nikola, and that blog is still generated by Nikola, although (that blog's) updates are now extremely rare (so much so that I consider it almost abandoned). I also created the first Docker container for Nikola and, for a long time, it was listed among the deployment methods on their site.

Building My Own: BSSG

But bashblog continued to fascinate me. So in 2015, for fun, I started developing my own Static Site Generator from scratch. I called it (with little imagination), BSSG - Bash Static Site Generator. The plan was for it to be compatible with the main OSes I use, to remain sufficiently simple and straightforward (!!!), and to be tailored to my needs. I intended to use it only and exclusively for small private things, starting with a sort of diary of mine - more professional than personal - and leave the 'official' blogs to more tested and 'professional' tools.

As time went by, I added some small features I liked: theming support, archives, tags (initially absent). Over time, many functions were added, and the script grew large – large enough to make me pause and ask myself some questions about the long-term stability of this solution. So, it remained only for my 'diary', which, however, grew year after year to the point where I needed to devise some kind of optimization. I then developed (more for fun than out of real necessity) a caching system. On rebuild, only what needs to be rebuilt is reconstructed, making the operation sufficiently fast even as the number of posts grows. Obviously, there are limits: using bash and external tools, the efficiency cannot be compared to that of a proper programming language.

Brief Detour: ITNBlog

And it's here that I decided, in preparation for opening a new blog (this one), to create a new tool called ITNBlog. I would develop it in Python and focus a bit more on performance and completeness. But ITNBlog stalled very quickly: time was limited, I'm not a full-time developer, so I realized I would spend too much time on development and too little on content creation.

Therefore, in 2018, I launched this blog but using Ghost, a solution that gave me good results, including performance-wise. I chose Ghost because I thought that, writing content also from my phone while on the go, a real CMS would be useful. Spoiler: no, it didn't turn out that way, so a few years later I decided to migrate this blog to Hugo. Nevertheless, I continued to develop ITNBlog on and off, as a hobby, without any particular ambitions.

At some point, however, I found myself in a particular situation: Hugo deprecated some features, and the theme I had chosen moved forward. But I ended up in an unpleasant situation: using the latest version of Hugo and the current version of the theme would produce unacceptable output; staying with the old version of Hugo while waiting for the theme update meant making a compromise. I actually build the blog from different devices, and they all have different versions of Hugo installed. Change the theme? Feasible, but I would have had to modify almost the entire site.

I considered migrating to manpageblog by gyptazy – I personally love its simplicity and retro look, and it was the main candidate to replace Hugo. I also created a script and migrated all my posts into the correct format.

BSSG to the Rescue (and ITNBlog's Role)

That's when I realized: I would implement the few missing features needed to make ITNBlog sufficiently complete, and this blog would be published using it, ensuring I'd be committed to its development. However, ITNBlog is not mature enough to be released publicly, so for now, it will remain the engine just for my blog. Then I thought again about BSSG – development had stalled some time ago, but it was still in use – and figured that perhaps, with a little tidying up, I could release it.

Because I'm tired of seeing people use dynamic CMS even to implement primarily static blogs or websites – and BSSG, despite its limitations and inefficiencies, works. And there are many themes to choose from. In short, you can install it and generate your blog in seconds.

Why Choose BSSG?

BSSG is the result of a 10-year evolution. The code isn't extremely consistent, some interesting features are missing (which I plan to implement), and it could use refactoring as the build script is monstrously large. But it works, it's portable (and much of the complexity increased precisely because of portability), and it generates sites that achieve very high accessibility and speed scores.

Here are some highlights:

  • βœ… Portability: Uses native OS tools (e.g., md5sum on Linux, md5 on OpenBSD and NetBSD). Portability itself added much of the complexity!
  • βœ… Simple Theming: Themes are just simple CSS files, so the structure remains the same – simplifying theme switching or creating new ones. More than 50 themes are already available!
  • βœ… Essential Features: Supports RSS feed generation, sitemap.xml, OpenGraph tags (to improve social sharing), internationalization (the blog can be in languages other than English – but not multilingual, at least for now), etc.
  • βœ… Built-in Backup and Restore script: It will just copy the configuration file, posts, and pages. Nothing else.
  • βœ… Minimal Dependencies.
  • βœ… Markdown Support: Posts and pages are in Markdown (CommonMark, Pandoc, and markdown.pl are supported).
  • βœ… Feature Images.
  • βœ… Optional GNU Parallel Integration: To speed up build times when there are many posts. This feature significantly impacts the code and has caused me numerous headaches over time. But it's optional (if parallel isn't found, it proceeds traditionally) and only provides benefits when the number of posts increases: with few posts, performance actually degrades.
  • βœ… High Accessibility and Performance Scores: Sites built with BSSG achieve excellent scores.
  • βœ… BSD Licensed: Released under a BSD license.

One of the problems I've always had with all CMS and SSGs has been choosing a theme. In some cases (like Hugo), the theme heavily influences the output, which is both good and bad. Good because it makes each site unique, but bad because it makes switching themes difficult. In the past, I've sometimes found myself having to change themes because they were abandoned and no longer updated. BSSG works differently: theming comes from using a different CSS file, which makes its structure more rigid, but switching from one theme to another is trivial. To help with the choice, I created a script that will build your site using all the themes present in the themes directory, just like on the examples page of the official website. This way, it will be easy to see and test your site with all available themes. If you want to add a touch of originality, you can choose the 'random' theme, and one will be chosen randomly from the list at each site regeneration.

Admin Interface (Experimental)

BSSG is in production use by some clients (for their internal sites), for whom I also created a basic admin interface (using Node Express, partly to chew on a bit of Node), but I don't feel ready to release it immediately as it's not sufficiently tested. It has an integrated Markdown editor and allows post scheduling, generating the files and launching BSSG with the right options at the right time. This could be that connecting link between traditional CMS and SSGs. There are others, but this one is tightly integrated with BSSG.

BSSG is Available Today

Starting today, BSSG is publicly available. It's not perfect, it probably doesn't make sense to do something of this complexity in bash, development will proceed slowly – but it's here, available to anyone who might find it useful.

Happy blogging everyone!

Trapping Misbehaving Bots in an A.I. Labyrinth

By: Nick Heer

Reid Tatoris, Harsh Saxena, and Luis Miglietti, of Cloudflare:

Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect β€œno crawl” directives. When you opt in, Cloudflare will automatically deploy an AI-generated set of linked pages when we detect inappropriate bot activity, without the need for customers to create any custom rules.

Two thoughts:

  1. This is amusing. Nothing funnier than using someone’s own words or, in this case, technology against them.

  2. This is surely going to lead to the same arms race as exists now between privacy protections and hostile adtech firms. Right?

βŒ₯ Permalink

Doing multi-tag matching through URLs on the modern web

By: cks

So what happened is that Mike Hoye had a question about a perfectly reasonable ideas:

Question: is there wiki software out there that handles tags (date, word) with a reasonably graceful URL approach?

As in, site/wiki/2020/01 would give me all the pages tagged as 2020 and 01, site/wiki/foo/bar would give me a list of articles tagged foo and bar.

I got nerd-sniped by a side question but then, because I'd been nerd-sniped, I started thinking about the whole thing and it got more and more hair-raising as a thing done in practice.

This isn't because the idea of stacking selections like this is bad; 'site/wiki/foo/bar' is a perfectly reasonable and good way to express 'a list of articles tagged foo and bar'. Instead, it's because of how everything on the modern web eventually gets visited combined with how, in the natural state of this feature, 'site/wiki/bar/foo' is just a valid a URL for 'articles tagged both foo and bar'.

The combination, plus the increasing tendency of things on the modern web to rattle every available doorknob just to see what happens, means that even if you don't advertise 'bar/foo', sooner or later things are going to try it. And if you do make the combinations discoverable through HTML links, crawlers will find them very fast. At a minimum this means crawlers will see a lot of essentially duplicated content, and you'll have to go through all of the work to do the searches and generate the page listings and so on.

If I was going to implement something like this, I would define a canonical tag order and then, as early in request processing as possible, generate a HTTP redirect from any non-canonical ordering to the canonical one. I wouldn't bother checking if the tags were existed or anything, just determine that they are tags, put them in canonical order, and if the request order wasn't canonical, redirect. That way at least all of your work (and all of the crawler attention) is directed at one canonical version. Smart crawlers will notice that this is a redirect to something they already have (and hopefully not re-request it), and you can more easily use caching.

(And if search engines still matter, the search engines will see only your canonical version.)

This probably holds just as true for doing this sort of tag search through query parameters on GET queries; if you expose the result in a URL, you want to canonicalize it. However, GET query parameters are probably somewhat safer if you force people to form them manually and don't expose links to them. So far, web crawlers seem less likely to monkey around with query parameters than with URLs, based on my limited experience with the blog.

Some views on the common Apache modules for SAML or OIDC authentication

By: cks

Suppose that you want to restrict access to parts of your Apache based website but you want something more sophisticated and modern than Apache Basic HTTP authentication. The traditional reason for this was to support 'single sign on' across all your (internal) websites; the modern reason is that a central authentication server is the easiest place to add full multi-factor authentication. The two dominant protocols for this are SAML and OIDC. There are commonly available Apache authentication modules for both protocols, in the form of Mellon (also) for SAML and OpenIDC for OIDC.

I've now used or at least tested the Ubuntu 24.04 version of both modules against the same SAML/OIDC identity provider, primarily because when you're setting up a SAML/OIDC IdP you need to be able to test it with something. Both modules work fine, but after my experiences I'm more likely to use OpenIDC than Mellon in most situations.

Mellon has two drawbacks and two potential advantages. The first drawback is that setting up a Mellon client ('SP') is more involved. Most of annoying stuff is automated for you with the mellon_create_metadata script (which you can get from the Mellon repository if it's not in your Mellon package), but you still have to give your IdP your XML blob and get their XML blob. The other drawback is that Mellon isn't integrated into the Apache 'Require' framework for authorization decisions; instead you have to make do with Mellon-specific directives.

The first potential advantage is that Mellon has a straightforward story for protecting two different areas of your website with two different IdPs, if you need to do that for some reason; you can just configure them in separate <Location> or <Directory> blocks and everything works out. If anything, it's a bit non-obvious how to protect various disconnected bits of your URL space with the same IdP without having to configure multiple SPs, one for each protected section of URL space. The second potential advantage is that in general SAML has an easier story for your IdP giving you random information, and Mellon will happily export every SAML attribute it gets into the environment your CGI or web application gets.

The first advantage of OpenIDC is that it's straightforward to configure when you have a single IdP, with no XML and generally low complexity. It's also straightforward to protect multiple disconnected URL areas with the same IdP but possibly different access restrictions. A third advantage is that OpenIDC is integrated into Apache's 'Require' system, although you have to use OpenIDC specific syntax like 'Require claim groups:agroup' (see the OpenIDC wiki on authorization).

In exchange for this, it seems to be quite involved to use OpenIDC if you need to use multiple OIDC identity providers to protect different bits of your website. It's apparently possible to do this in the same virtual host but it seems quite complex and requires a lot of parts, so if I was confronted with this problem I would try very hard to confine each web thing that needed a different IdP into a different virtual host. And OpenIDC has the general OIDC problem that it's harder to expose random information.

(All of the important OpenIDC Apache directives about picking an IdP can't be put in <Location> or <Directory> blocks, only in a virtual host as a whole. If you care about this, see the wiki on Multiple Providers and also access to different URL paths on a per-provider basis.)

We're very likely to only ever be working with a single IdP, so for us OpenIDC is likely to be easier, although not hugely so.

Sidebar: The easy approach for group based access control with either

Both Mellon and OpenIDC work fine together with the traditional Apache AuthGroupFile directive, provided (of course) that you have or build an Apache format group file using what you've told Mellon or OpenIDC to use as the 'user' for Apache authentication. If your IdP is using the same user (and group) information as your regular system is, then you may well already have this information around.

(This is especially likely if you're migrating from Apache Basic HTTP authentication, where you already needed to build this sort of stuff.)

Building your own Apache group file has the additional benefit that you can augment and manipulate group information in ways that might not fit well into your IdP. Your IdP has the drawback that it has to be general; your generated Apache group file can be narrowly specific for the needs of a particular web area.

The web browser as an enabler of minority platforms

By: cks

Recently, I got involved in a discussion on the Fediverse over what I will simplify to the desirability (or lack of it) of cross platform toolkits, including the browser, and how they erase platform personality and opinions. This caused me to have a realization about what web browser based applications are doing for me, which is that being browser based is what lets me use them at all.

My environment is pretty far from being a significant platform; I think Unix desktop share is in the low single percent under the best of circumstances. If people had to develop platform specific versions of things like Grafana (which is a great application), they'd probably exist for Windows, maybe macOS, and at the outside, tablets (some applications would definitely exist on phones, but Grafana is a bit of a stretch). They probably wouldn't exist on Linux, especially not for free.

That the web browser is a cross platform environment means that I get these applications (including the Fediverse itself) essentially 'for free' (which is to say, it's because of the efforts of web browsers to support my platform and then give me their work for free). Developers of web applications don't have to do anything to make them work for me, not even so far as making it possible to build their software on Linux; it just happens for them without them even having to think about it.

Although I don't work in the browser as much as some people do, looking back the existence of implicitly cross platform web applications has been a reasonably important thing in letting me stick with Linux.

This applies to any minority platform, not just Linux. All you need is a sufficiently capable browser and you have access to a huge range of (web) applications.

(Getting that sufficiently capable browser can be a challenge on a sufficiently minority platform, especially if you're not on a major architecture. I'm lucky in that x86 Linux is a majority minority platform; people on FreeBSD or people on architectures other than x86 and 64-bit ARM may be less happy with the situation.)

PS: I don't know if what we have used the web for really counts as 'applications', since they're mostly HTML form based things once you peel a few covers off. But if they do count, the web has been critical in letting us provide them to people. We definitely couldn't have built local application versions of them for all of the platforms that people here use.

(I'm sure this isn't a novel thought, but the realization struck (or re-struck) me recently so I'm writing it down.)

HTTP connections are part of the web's long tail

By: cks

I recently read an article that, among other things, apparently seriously urging browser vendors to deprecate and disable plain text HTTP connections by the end of October of this year (via, and I'm deliberately not linking directly to the article). While I am a strong fan of HTTPS in general, I have some feelings about a rapid deprecation of HTTP. One of my views is that plain text HTTP is part of the web's long tail.

As I'm using the term here, the web's long tail (also is the huge mass of less popular things that are individually less frequently visited but which in aggregate amount to a substantial part of the web. The web's popular, busy sites are frequently updated and can handle transitions without problems. They can readily switch to using modern HTML, modern CSS, modern JavaScript, and so on (although they don't necessarily do so), and along with that update all of their content to HTTPS. In fact they mostly or entirely have done so over the last ten to fifteen years. The web's long tail doesn't work like that. Parts of it use old JavaScript, old CSS, old HTML, and these days, plain HTTP (in addition to the people who have objections to HTTPS and deliberately stick to HTTP).

The aggregate size and value of the long tail is part of why browsers have maintained painstaking compatibility back to old HTML so far, including things like HTML Image Maps. There's plenty of parts of the long tail that will never be updated to have HTTPS or work properly with it. For browsers to discard HTTP anyway would be to discard that part of the long tail, which would be a striking break with browser tradition. I don't think this is very likely and I certainly hope that it never comes to pass, because that long tail is part of what gives the web its value.

(It would be an especially striking break since a visible percentage of page loads still happen with HTTP instead of HTTPS. For example, Google's stats say that globally 5% of Windows Chrome page loads apparently still use HTTP. That's roughly one in twenty page loads, and the absolute number is going to be very large given how many page loads happen with Chrome on Windows. This large number is one reason I don't think this is at all a serious proposal; as usual with this sort of thing, it ignores that social problems are the ones that matter.)

PS: Of course, not all of the HTTP connections are part of the web's long tail as such. Some of them are to, for example, manage local devices via little built in web servers that simply don't have HTTPS. The people with these devices aren't in any rush to replace them just because some people don't like HTTP, and the vendors who made them aren't going to update their software to support (modern) HTTPS even for the devices which support firmware updates and where the vendor is still in business.

(You can view them as part of the long tail of 'the web' as a broad idea and interface, even though they're not exposed to the world the way that the (public) web is.)

More potential problems for people with older browsers

By: cks

I've written before that keeping your site accessible to very old browsers is non-trivial because of issues like them not necessarily supporting modern TLS. However, there's another problem that people with older browsers are likely to be facing, unless circumstances on the modern web change. I said on the Fediverse:

Today in unfortunate web browser developments: I think people using older versions of browsers, especially Chrome, are going to have increasing problems accessing websites. There are a lot of (bad) crawlers out there forging old Chrome versions, perhaps due to everyone accumulating AI training data, and I think websites are going to be less and less tolerant of them.

(Mine sure is currently, as an experiment.)

(By 'AI' I actually mean LLM.)

I covered some request volume information yesterday and it (and things I've seen today) strongly suggest that there is a lot of undercover scraping activity going on. Much of that scraping activity uses older browser User-Agents, often very old, which means that people who don't like it are probably increasingly going to put roadblocks in the way of anything presenting those old User-Agent values (there are already open source projects designed to frustrate LLM scraping and there will probably be more in the future).

(Apparently some LLM scrapers start out with honest User-Agents but then switch to faking them if you block their honest versions.)

There's no particular reason why scraping software can't use current User-Agent values, but it probably has to be updated every so often when new browser versions come out and people haven't done that so far. Much like email anti-spam efforts changing email spammer behavior, this may change if enough websites start reacting to old User-Agents, but I suspect that it will take a while for that to come to pass. Instead I expect it to be a smaller scale, distributed effort from 'unimportant' websites that are getting overwhelmed, like LWN (see the mention of this in their 'what we haven't added' section).

Major websites probably won't outright reject old browsers, but I suspect that they'll start throwing an increased amount of blocks in the way of 'suspicious' browser sessions with those User-Agents. This is probably likely to include CAPTCHAs and other such measures that they already use some of the time. CAPTCHAs aren't particularly effective at stopping bad actors in practice but they're the hammer that websites already have, so I'm sure they'll be used on this nail.

Another thing that I suspect will start happening is that more sites will start insisting that you run some JavaScript to pass a test in order to access them (whether this is an explicit CAPTCHA or just passive JavaScript that has to execute). This will stop LLM scrapers that don't run JavaScript, which is not all of them, and force the others to spend a certain amount of CPU and memory, driving up the aggregate cost of scraping your site dry. This will of course adversely affect people without JavaScript in their browser and those of us who choose to disable it for most sites, but that will be seen as the lesser evil by people who do this. As with anti-scraper efforts, there are already open source projects for this.

(This is especially likely to happen if LLM scrapers modernize their claimed User-Agent values to be exactly like current browser versions. People are going to find some defense.)

PS: I've belatedly made the Wandering Thoughts blocks for old browsers now redirect people to a page about the situation. I've also added a similar page for my current block of most HTTP/1.0 requests.

The HTTP status codes of responses from about 21 hours of traffic to here

By: cks

You may have heard that there are a lot of crawlers out there these days, many of them apparently harvesting training data for LLMs. Recently I've been getting more strict about access to this blog, so for my own interest I'm going to show statistics on what HTTP status codes all of the requests to here got in the past roughly 21 hours and a bit. I think this is about typical, although there may be more blocked things than usual.

I'll start with the overall numbers for all requests:

 22792 403      [45%]
  9207 304      [18.3%]
  9055 200      [17.9%]
  8641 429      [17.1%]
   518 301
    58 400
    33 404
     2 206
     1 302

HTTP 403 is the error code that people get on blocked access; I'm not sure what's producing the HTTP 400s. The two HTTP 206s were from LinkedIn's bot against a recent entry and completely puzzle me. Some of the blocked access is major web crawlers requesting things that they shouldn't (Bing is a special repeat offender here), but many of them are not. Between HTTP 403s and HTTP 429s, 62% or so of the requests overall were rejected and only 36% got a useful reply.

(With less thorough and active blocks, that would be a lot more traffic for Wandering Thoughts to handle.)

The picture for syndication feeds is rather different, as you might expect, but not quite as different as I'd like:

  9136 304    [39.5%]
  8641 429    [37.4%]
  3614 403    [15.6%]
  1663 200    [ 7.2%]
    19 301

Some of those rejections are for major web crawlers and almost a thousand are for a pair of prolific, repeat high volume request sources, but a lot of them aren't. Feed requests account for 23073 requests out of a total of 50307, or about 45% of the requests. To me this feels quite low for anything plausibly originated from humans; most of the time I expect feed requests to significantly outnumber actual people visiting.

(In terms of my syndication feed rate limiting, there were 19440 'real' syndication feed requests (84% of the total attempts), and out of them 44.4% were rate-limited. That's actually a lower level of rate limiting than I expected; possibly various feed fetchers have actually noticed it and reduced their attempt frequency. 46.9% made successful conditional GET requests (ones that got a HTTP 304 response) and 8.5% actually fetched feed data.)

DWiki, the wiki engine behind the blog, has a concept of alternate 'views' of pages. Syndication feeds are alternate views, but so are a bunch of other things. Excluding syndication feeds, the picture for requests of alternate views of pages is:

  5499 403
   510 200
    39 301
     3 304

The most blocked alternate views are:

  1589 ?writecomment
  1336 ?normal
  1309 ?source
   917 ?showcomments

(The most successfully requested view is '?showcomments', which isn't really a surprise to me; I expect search engines to look through that, for one.)

If I look only at plain requests, not requests for syndication feeds or alternate views, I see:

 13679 403   [64.5%]
  6882 200   [32.4%]
   460 301
    68 304
    58 400
    33 404
     2 206
     1 302

This means the breakdown of traffic is 21183 normal requests (42%), 45% feed requests, and the remainder for alternate views, almost all of which were rejected.

Out of the HTTP 403 rejections across all requests, the 'sources' break down something like this:

  7116 Forged Chrome/129.0.0.0 User-Agent
  1451 Bingbot
  1173 Forged Chrome/121.0.0.0 User-Agent
   930 PerplexityBot ('AI' LLM data crawler)
   915 Blocked sources using a 'Go-http-client/1.1' User-Agent

Those HTTP 403 rejections came from 12619 different IP addresses, in contrast to the successful requests (HTTP 2xx and 3xx codes), which came from 18783 different IP addresses. After looking into the ASN breakdown of those IPs, I've decided that I can't write anything about them with confidence, and it's possible that part of what is going on is that I have mis-firing blocking rules (alternately, I'm being hit from a big network of compromised machines being used as proxies, perhaps the same network that is the Chrome/129.0.0.0 source). However, some of the ASNs that show up highly are definitely ones I recognize from other contexts, such as attempted comment spam.

Update: Well that was a learning experience about actual browser User-Agents. Those 'Chrome/129.0.0.0' User-Agents may well not have been so forged (although people really should be running more current versions of Chrome). I apologize to the people using real current Chrome versions that were temporarily unable to read the blog because of my overly-aggressive blocks.

Web application design and the question of what is a "route"

By: cks

So what happened is that Leah Neukirchen ran a Fediverse poll on how many routes your most complex web app had, and I said that I wasn't going to try to count how many DWiki had and then gave an example of combining two things in a way that I felt was a 'route' (partly because 'I'm still optimizing the router' was one poll answer). This resulted in a discussion where one of the questions I draw from it is "what is a route, exactly".

At one level counting up routes in your web application seems simple. For instance, in our Django application I could count up the URL patterns listed in our 'urlpatterns' setting (which gives me a larger number than I expected for what I think of as a simple Django application). Pattern delegation may make this a bit tedious, but it's entirely tractable. However, I think that this only works for certain sorts of web applications that are designed in a particular way, and as it happens I have an excellent example of where the concept of "route" gets fuzzy.

DWiki, the engine behind this blog, is actually a general filesystem based wiki (engine). As a filesystem based wiki, what it started out doing was to map any URL path to a filesystem object and then render the filesystem object in some appropriate way; for example, directories turn into a listing of their contents. With some hand-waving you could say that this is one route, or two once we through in an optional system for handling static assets. Alternately you could argue that this is two (or three) routes, one route for directories and one route for files, because the two are rendered differently (although that's actually implemented in templates, not in code, so maybe they're one route after all).

Later I added virtual directories, which are added to the end of directory paths and are used to restrict what things are visible within the directory (or directory tree). Both the URL paths involved and the actual matching against them look like normal routing (although they're not handled through a traditional router approach), so I should probably count them as "routes", adding four or so more routes, so you could say that DWiki has somewhere between five and seven routes (if you count files and directories separately and throw in a third route for static asset files).

However, I've left out a significant detail, which is visible in how both the blog's front page and the Atom syndication feed of the blog use the same path in their URLs, and the blog's front page looks nothing like a regular directory listing. What's going on is that how DWiki presents both files and especially directories depends on the view they're shown in, and DWiki has a bunch of views; all of the above differences are because of different views being used. Standard blog entry files can be presented in (if I'm counting right) five different views. Directories have a whole menagerie of views that they support, including a 'blog' view. Because views are alternate presentations of a given filesystem object and thus URL path, they're provided as a query parameter, not as part of the URL's path.

Are DWiki's views routes, and if they are, how do we count them? Is each unique combination of a page type (including virtual directories) and a view a new route? One thing that may affect your opinion of this is that a lot of the implementation of views is actually handled in DWiki's extremely baroque templates, not code. However, DWiki's code knows a full list of what views exist (and templates have to be provided or you'll get various failures).

(I've also left out a certain amount of complications, like redirections and invalid page names.)

The broad moral I draw from this exercise is that the model of distinct 'routes' is one that only works for certain sorts of web application design. When and where it works well, it's a quite useful model and I think it pushes you toward making good decisions about how to structure your URLs. But in any strong form, it's not a universal pattern and there are ways to go well outside it.

(Interested parties can see a somewhat out of date version of DWiki's code and many templates, although note that both contain horrors. At some point I'll probably update both to reflect my recent burst of hacking on DWiki.)

Web spiders (or people) can invent unfortunate URLs for your website

By: cks

Let's start with my Fediverse post:

Today in "spiders on the Internet do crazy things": my techblog lets you ask for a range of entries. Normally the range that people ask for is, say, ten entries (the default, which is what you normally get links for). Some deranged spider out there decided to ask for a thousand entries at once and my blog engine sighed, rolled up its sleeves, and delivered (slowly and at large volume).

In related news, my blog engine can now restrict how large a range people can ask for (although it's a hack).

DWiki is the general wiki engine that creates Wandering Thoughts. As part of its generality, it has a feature that shows a range of 'pages' (in Wandering Thoughts these are entries, in general these are files in a directory tree), through what I call virtual directories. As is usual with these things, the range of entries (pages, files) that you're asking for is specified in the URL, with syntax like '<whatever>/range/20-30'.

If you visit the blog front page or similar things, the obvious and discoverable range links you get are for ten entries. You can under some situations get links for slightly bigger ranges, but not substantially larger ones. However, the engine didn't particularly restrict the size of these ranges, so if you wanted to create URLs by hand you could ask for very large ranges.

Today, I discovered that two IPs had asked for 1000-entry ranges today, and the blog engine provided them. Based on some additional log information, it looks like it's not the first time that giant ranges have been requested. One of those IPs was an AWS IP, for which my default assumption is that this is a web spider of some source. Even if it's not a conventional web spider, I doubt anyone is asking for a thousand entries at once with the plan of reading them all; that's a huge amount of text, so it's most likely being done to harvest a lot of my entries at once for some purpose.

(Partly because of that and partly because it puts a big load on DWiki, I've now hacked in a mentioned feature to restrict how large a range you can request. Because it's a hack, too-large ranges get HTTP 404 responses instead of something more useful.)

Sidebar: on the "virtual directories" name and feature

All of DWiki's blog parts are alternate views of a directory hierarchy full of files, where each file is a 'page' and in the context of Wandering Thoughts, almost all pages are blog entries (on the web, the 'See as Normal' link at the bottom will show you the actual directory view of something). A 'virtual directory' is a virtual version of the underlying real directory or directory hierarchy that only shows some pages, for example pages from 2025 or a range of pages based on how recent they are.

All of this is a collection of hacks built on top of other hacks, because that's what happens when you start with a file based wiki engine and decide you can make it be a blog too with only a few little extra features (as a spoiler, it did not wind up requiring only a few extra things). For example, you might wonder how the blog's front page winds up being viewed as a chronological blog, instead of a directory, and the answer is a hack.

Some learning experiences with HTTP cookies in practice

By: cks

Suppose, not hypothetically, that you have a dynamic web site that makes minor use of HTTP cookies in a way that varies the output, and also this site has a caching layer. Naturally you need your caching layer to only serve 'standard' requests from cache, not requests that should get something non-standard. One obvious and simple approach is to skip your cache layer for any request that has a HTTP cookie. If you (I) do this, I have bad news about HTTP requests in practice, at least for syndication feed fetchers.

(One thing you might do with HTTP cookies is deliberately bypass your own cache, for example to insure that someone who posts a new comment can immediately see their own comment, even if an older version of the page is in the cache.)

The thing about HTTP cookies is that the HTTP client can send you anything it likes as a HTTP cookie and unfortunately some clients will. For example, one feed reader fetcher deliberately attempts to bypass Varnish caches by sending a cookie with all fetch requests, so if the presence of any HTTP cookie causes you to skip your own cache (and other things you do that use the same logic), well, feeder.co is bypassing your caching layer too. Another thing that happens is that some syndication feed fetching clients appear to sometimes leak unrelated cookies into their HTTP requests.

(And of course if your software is hosted along side other software that might set unrestricted cookies for the entire website, those cookies may leak into requests made to your software. For feed fetching specifically, this is probably most likely in feed readers that are browser addons.)

The other little gotcha is that you shouldn't rely on merely the presence or absence of a 'Cookie:' header in the request to tell you if the request has cookies, because a certain number of HTTP clients appear to send a blank Cookie: header (ie, just 'Cookie:'). You might be doing this directly in a CGI by checking for the presence of $HTTP_COOKIE, or you might be doing this indirectly by parsing any Cookie: header in the request into a 'Cookies' object of some sort (even if the value is blank), in which case you'll wind up with an empty Cookies object.

(You can also receive cookies with a blank value in a Cookies: header, eg 'JSESSIONID=', which appears to be a deliberate decision by the software involved, and seems to be to deal with a bad feed source.)

If you actually care about all of this, as I do now that I've discovered it all, you'll want to specifically check for the presence of your own cookies and ignore any other cookies you see, as well as a blank 'Cookie:' HTTP header. Doing extra special things if you see a 'bypass_varnish=1' cookie is up to you.

(In theory I knew that the HTTP Cookies: header was untrusted client data and shouldn't be trusted, and sometimes even contained bad garbage (which got noted every so often in my logs). In practice I didn't think about the implications of that for some of my own code until now.)

Syndication feeds here are now rate-limited on a per-IP basis

By: cks

For a long time I didn't look very much at the server traffic logs for Wandering Thoughts, including what was fetching my syndication feeds and how, partly because I knew that looking at web server logs invariably turns over a rock or two. In the past few months I started looking at my feed logs, and then I spent some time trying to get some high traffic sources to slow down on an ad-hoc basis, which didn't have much success (partly because browser feed reader addons seem bad at this). Today I finally gave in to temptation and added general per-IP rate limiting for feed requests. A single IP that requests a particular syndication feed too soon after its last successful request will receive a HTTP 429 response.

(The actual implementation is a hack, which is one reason I didn't do it before now; DWiki, the engine behind Wandering Thoughts, doesn't have an easy place for dynamically updated shared state.)

This rate-limiting will probably only moderately reduce the load on Wandering Thoughts, for various reasons, but it will make me happier. I'm also looking forward to having a better picture of what I consider 'actual traffic' to Wandering Thoughts, including actual User-Agent usage, without the distortions added by badly behaved browser addons (I'm pretty sure that my casual view of Firefox's popularity for visitors has been significantly distorted by syndication feed over-fetching).

In applying this rate limiting, I've deliberately decided not to exempt various feed reader providers like NewsBlur, Feedbin, Feedly, and so on. Hopefully all of these places will react properly to receiving periodic HTTP 429 requests and not, say, entirely give up fetching my feeds after a while because they're experiencing 'too many errors'. However, time will tell if this is correct (and if my HTTP 429 responses cause them to slow down their often quite frequent syndication feed requests).

In general I'm going to have to see how things develop, and that's a decent part of why I'm doing this at all. I'm genuinely curious how clients will change their behavior (if they do) and what will emerge, so I'm doing a little experiment (one that's nowhere as serious and careful as rachelbythebay's ongoing work).

PS: The actual rate limiting applies a much higher minimum interval for unconditional HTTP syndication feed requests than for conditional ones, for the usual reason that I feel repeated unconditional requests for syndication feeds is rather antisocial, and if a feed fetcher is going to be antisocial I'm not going to talk to it very often.

More features for web page generation systems doing URL remapping

By: cks

A few years ago I wrote about how web page generation systems should support remapping external URLs (this includes systems that convert some form of wikitext to HTML). At the time I was mostly thinking about remapping single URLs and mentioned things like remapping prefixes (so you could remap an entire domain into web.archive.org) as something for a fancier version. Well, the world turns and things happen and I now think that such prefix remapping is essential; even if you don't start out with it, you're going to wind up with it in the longer term.

(To put it one way, the reality of modern life is that sometimes you no longer want to be associated with some places. And some day, my Fediverse presence may also move.)

In light of a couple of years of churn in my website landscape (after what was in hindsight a long period of stability), I now have revised views on the features I want in a (still theoretical) URL remapping system for Wandering Thoughts. The system I want should be able to remap individual URLs, entire prefixes, and perhaps regular expressions with full scale rewrites (or maybe some scheme with wildcard matching), although I don't currently have a use for full scale regular expression rewrites. As part of this, there needs to be some kind of priority or hierarchy between different remappings that can all potentially match the same URL, because there's definitely at least one case today where I want to remap 'asite/a/*' somewhere and all other 'asite/*' URLs to something else. While it's tempting to do something like 'most specific thing matches', working out what is most specific from a collection of different sorts of remapping rules seems a bit hard, so I'd probably just implement it as 'first match wins' and manage things by ordering matches in the configuration file.

('Most specific match wins' is a common feature in web application frameworks for various reasons, but I think it's harder to implement here, especially if I allow arbitrary regular expression matches.)

Obviously the remapping configuration file should support comments (every configuration system needs to). Less obviously, I'd support file inclusion or the now common pattern of a '<whatever>.d' directory for drop in files, so that remapping rules can be split up by things like the original domain rather than having to all be dumped into an ever-growing single configuration file.

(Since more and more links rot as time passes, we can pretty much guarantee that the number of our remappings is going to keep growing.)

Along with the remapping, I may want something (ie, a tiny web application) that dynamically generates some form of 'we don't know where you can find this now but here is what the URL used to be' page for any URL I feed it. The obvious general reason for this is that sometimes old domain names get taken over by malicious parties and the old content is nowhere to be found, not even on web.archive.org. In that case you don't want to keep a link to what's now a malicious site, but you also don't have any other valid target for your old link. You could rewrite the link to some invalid domain name and leave it to the person visiting you and following the link to work out what happened, but it's better to be friendly.

(This is where you want to be careful about XSS and other hazards of operating what is basically an open 'put text in and we generate a HTML page with it shown in some way' service.)

The programmable web browser was and is inevitable

By: cks

In a comment on my entry on why the modern web is why web browsers can't have nice things, superkuh wrote in part:

In the past it was seen as crazy to open every executable file someone might send you over the internet (be it email, ftp, web, or whatever). But sometime in the 2010s it became not only acceptable, but standard practice to automatically run every executable sent to you by any random endpoint on the internet.

For 'every executable' you should read 'every piece of JavaScript', which is executable code that is run by your browser as a free and relatively unlimited service provided to every web page you visit. The dominant thing restraining the executables that web pages send you is the limited APIs that browsers provide, which is why they provide such limited APIs. This comment sparked a chain of thoughts that led to a thesis.

I believe that the programmable web browser was (and is) inevitable. I don't mean this just in the narrow sense that if it hadn't been JavaScript it would have been Flash or Java applets or Lua or WASM or some other relatively general purpose language that the browser would up providing. Instead, I mean it in a broad and general sense, because 'programmability' of the browser is driven by a general and real problem.

For almost as long as the web has existed, people have wanted to create web pages that had relatively complex features and interactions. They had excellent reasons for this; they wanted drop-down or fold-out menus to save screen space so that they could maximize the amount of space given to important stuff instead of navigation, and they wanted to interactively validate form contents before submission for fast feedback to the people filling them in, and so on. At the same time, browser developers didn't want to (and couldn't) program every single specific complex feature that web page authors wanted, complete with bespoke HTML markup for it and so on. To enable as many of these complex features as possible with as little work on their part as possible, browser developers created primitives that could be assembled together to create more sophisticated features, interactions, layouts, and so on.

When you have a collection of primitives that people are expected to use to create their specific features, interactions, and so on, you have a programming language and a programming environment. It doesn't really matter if this programming language is entirely declarative (and isn't necessarily Turing complete), as in the case of CSS; people have to program the web browser to get what they want.

So my view is that we were always going to wind up with at least one programming language in our web browsers, because a programming language is the meeting point between what web page authors want to have and what browser developers want to provide. The only question was (and is) how good of a programming language (or languages) we were going to get. Or perhaps an additional question was whether the people designing the 'programming language' were going to realize that they were doing so, or if they were going to create one through an accretion of features.

(My view is that CSS absolutely is a programming language in this sense, in that you must design and 'program' it in order to achieve the effects you want, especially if you want sophisticated ones like drop down menus. Modern CSS has thankfully moved beyond the days when I called it an assembly language.)

(This elaborates on a Fediverse post.)

The modern web is why web browsers don't have "nice things" (platform APIs)

By: cks

Every so often I read something that says or suggests that the big combined browser and platform vendors (Google, Apple, and to a lesser extent Microsoft) have deliberately limited their browser's access to platform APIs that would put "progressive web applications" on par with native applications. While I don't necessarily want to say that these vendors are without sin, in my view this vastly misses the core reason web browsers have limited and slow moving access to platform APIs. To put it simply, it's because of what the modern web has turned into, namely "a hive of scum and villainy" to sort of quote a famous movie.

Any API the browser exposes to web pages is guaranteed to be used by bad actors, and this has been true for a long time. Bad actors will use these APIs to track people, to (try to) compromise their systems, to spy on them, or basically for anything that can make money or gain information. Many years ago I said this was why native applications weren't doomed and basically nothing has changed since then. In particular, browsers are no better at designing APIs that can't be abused or blocking web pages that abuse these APIs, and they probably never will be.

(One of the problems is the usual one in security; there are a lot more attackers than there are browser developers designing APIs, and the attackers only have to find one oversight or vulnerability. In effect attackers are endlessly ingenious while browser API designers have finite time they can spend if they want to ship anything.)

The result of this is that announcements of new browser APIs are greeted not with joy but with dread, because in practice they will mostly be yet another privacy exposure and threat vector (Chrome will often ship these APIs anyway because in practice as demonstrated by their actions, Google mostly doesn't care). Certainly there are some web sites and in-browser applications that will use them well, but generally they'll be vastly outnumbered by attackers that are exploiting these APIs. Browser vendors (even Google with Chrome) are well aware of these issues, which is part of why they create and ship so few APIs and often don't give them very much power.

(Even native APIs are increasingly restricted, especially on mobile devices, because there are similar issues on those. Every operating system vendor is more and more conscious of security issues and the exposures that are created for malicious applications.)

You might be tempted to say that the answer is forcing web pages to ask for permission to use these APIs. This is a terrible idea for at least two reasons. The first reason is alert (or question) fatigue; at a certain point this becomes overwhelming and people stop paying attention. The second reason is that people generally want to use websites that they're visiting, and if faced with a choice between denying a permission and being unable to use the website or granting the permission and being able to use the website, they will take the second choice a lot of the time.

(We can see both issues in effect in mobile applications, which have similar permissions requests and create similar permissions fatigue. And mobile applications ask for permissions far less often than web pages often would, because most people visit a lot more web pages than they install applications.)

Thinking about how to tame the interaction of conditional GET and caching

By: cks

Due to how I do caching here, Wandering Thoughts has a long standing weird HTTP behavioral quirk where a non-conditional GET for a syndication feed here can get a different answer than a conditional GET. One (technical) way to explain this issue is that the cache validity interval for non-conditional GETs is longer than the cache validity interval for conditional GETs. In theory this could be the complete explanation of the issue, but in practice there's another part to it, which is that DWiki doesn't automatically insert responses into the cache on a cache miss.

(The cache is normally only filled for responses that were slow to generate, either due to load or because they're expensive. Otherwise I would rather dynamically generate the latest version of something and not clutter up cache space.)

There are various paths that I could take, but which ones I want to take depends on what my goals are and I'm actually not entirely certain about that. If my goal is to serve responses to unconditional GETs that are as fresh as possible but come from cache for as long as possible, what I should probably do is make conditional GETs update the cache when the cached version of the feed exists and would still have been served to an unconditional GET. I've already paid the cost to dynamically generate the feed, so I might as well serve it to unconditional GET requests. However, in my current cache architecture this would have the side effect of causing conditional GETs to get that newly updated cached copy for the conditional GET cache validity period, instead of generating the very latest feed dynamically (what would happen today).

(A sleazy approach would be to backdate the newly updated cache entry by the conditional GET validity interval. My current code architecture doesn't allow for that, so I can avoid the temptation.)

On the other hand, the entire reason I have a different (and longer) cache validity interval for unconditional GET requests is that in some sense I want to punish them. It's a deliberate feature that unconditional GETs receive stale responses, and in some sense the more stale the response the better. Even though updating the cache with a current response I've already generated is in some sense free, doing it cuts against this goal, both in general and in specific. In practice, Wandering Thoughts sees frequent enough conditional GETs for syndication feeds that making conditional GETs refresh the cached feed would effectively collapse the two cache validity intervals into one, which I can already do without any code changes. So if this is my main goal for cache handling of unconditional GETs of my syndication feed, the current state is probably fine and there's nothing to fix.

(A very approximate number is that about 15% of the syndication feed requests to Wandering Thoughts are unconditional GETs. Some of the offenders should definitely know and do better, such as 'Slackbot 1.0'.)

Syndication feed fetchers and their behavior on HTTP 429 status responses

By: cks

For reasons outside of the scope of this entry, recently I've been looking at the behavior of syndication feed fetchers here on Wandering Thoughts (which are generally from syndication feed readers), and in the process I discovered some that were making repeated requests at a quite aggressive rate, such as every five minutes. Until recently there was some excuse for this, because I wasn't setting a 'Cache-Control: max-age=...' header (also), which is (theoretically) used to tell Atom feed fetchers how soon they should re-fetch. I feel there was not much of an excuse because no feed reader should default to fetching every five minutes, or even every fifteen, but after I set my max-age to an hour there definitely should be no excuse.

Since sometimes I get irritated with people like this, I arranged to start replying to such aggressive feed featchers with a HTTP 429 "Too Many Requests" status response (the actual implementation is a hack because my entire software is more or less stateless, which makes true rate limiting hard). What I was hoping for is that most syndication feed fetching software would take this as a signal to slow down how often it tried to fetch the feed, and I'd see excessive sources move from one attempt every five minutes to (much) slower rates.

That basically didn't happen (perhaps this is no surprise). I'm sure there's good syndication feed fetching software that probably would behave that way on HTTP 429 responses, but whatever syndication feed software was poking me did not react that way. As far as I can tell from casually monitoring web access logs, almost no mis-behaving feed software paid any attention to the fact that it was specifically getting a response that normally means "you're doing this too fast". In some cases, it seems to have caused programs to try to fetch even more than before.

(Perhaps some of this is because I didn't add a 'Retry-After' header to my HTTP 429 responses until just now, but even without that, I'd expect clients to back off on their own, especially after they keep getting 429s when they retry.)

Given the HTTP User-Agents presented by feed fetchers, some of this is more or less expected, for two reasons. First, some of the User-Agents are almost certainly deliberate lies, and if a feed crawler is going to actively lie about what it is there's no reason for it to respect HTTP 429s either. Second, some of the feed fetching is being done by stateless programs like curl, where the people building ad-hoc feed fetching systems around them would have to go (well) out of their way to do the right thing. However, a bunch of the aggressive feed fetching is being done by either real feed fetching software with a real user-agent (such as "RSS Bot" or the Universal Feed Parser) or by what look like browser addons running in basically current versions of Firefox. I'd expect both of these to respect HTTP 429s if they're programmed decently. But then, if they were programmed decently they probably wouldn't be trying every five minutes in the first place.

(Hopefully the ongoing feed reader behavior project by rachelbythebay will fix some of this in the long run; there are encouraging signs, as covered in eg the October 25th score report.)

Keeping your site accessible to old browsers is non-trivial

By: cks

One of the questions you could ask about whether or not to block HTTP/1.0 requests is what this does to old browsers and your site's accessibility to (or from) them (see eg the lobste.rs comments on my entry). The reason one might care about this is that old systems can usually only use old browsers, so to keep it possible to still use old systems you want to accommodate old browsers. Unfortunately the news there is not really great, and taking old browsers and old systems seriously has a lot of additional effects.

The first issue is that old systems generally can't handle modern TLS and don't recognize modern certificate authorities, like Let's Encrypt. This situation is only going to get worse over time, as websites increasingly require TLS 1.2 or better (and then in the future, TLS 1.3 or better). If you seriously care about keeping your site accessible to old browsers, you need to have a fully functional HTTP version. Increasingly, it seems that modern browsers won't like this, but so far they're willing to put up with it. I don't know if there's any good way to steer modern visitors to your HTTPS version instead of your HTTP version.

(This is one area where modern browsers preemptively trying HTTPS may help you.)

Next, old browsers obviously only support old versions of CSS, if they have very much CSS support at all (very old browsers probably won't). This can present a real conflict; you can have an increasingly basic site design that sticks within the bounds of what will render well on old browsers, or you can have one that looks good to what's probably the vast majority of your visitors and may or may not degrade gracefully on old browsers. Your CSS, if any, will probably also be harder to write, and it may be hard to test how well it actually works on old browsers. Some modern accessibility features, such as adjusting to screen sizes, may be (much) harder to get. If you want a multi-column layout or a sidebar, you're going to be back in the era of table based layouts (which this blog has never left, mostly because I'm lazy). And old browsers also mean old fonts, although with fonts it may be easier to degrade gracefully down to whatever default fonts the browser has.

(If you use images, there's the issue of image sizes and image formats. Old browsers are generally used on low resolution screens and aren't going to be the fastest or the best at scaling images down, if you can get them to do it as well. And you need to stick to image formats that they support.)

It's probably not impossible to do all of this, and you can test some of it by seeing how your site looks in text mode browsers like Lynx (which only really supports HTTP/1.0, as it turns out). But's certainly constraining; you have to really care, and it will cut you off from some things that are important and useful.

PS: I'm assuming that if you intend to be as fully usable as possible by old browsers, you're not even going to try to have JavaScript on your site.

The question of whether to still allow HTTP/1.0 requests or block them

By: cks

Recently, I discovered something and noted it on the Fediverse:

There are still a small number of things making HTTP/1.0 requests to my techblog. Many of them claim to be 'Chrome/124.<something>'. You know, I don't think I believe you, and I'm not sure my techblog should still accept HTTP/1.0 requests if all or almost all of them are malicious and/or forged.

The pure, standards-compliant answer to this is that of course you should still allow HTTP/1.0 requests. It remains a valid standard, and apparently some things may still default to it, and one part of the web's strength is its backward compatibility.

The pragmatic answer starts with the observation that HTTP/1.1 is now 25 years old, and any software that is talking HTTPS to you is demonstrably able to deal with standards that are more recent than that (generally much more recent, as sites require TLS 1.2 or better). And as a practical matter, pure HTTP/1.0 clients can't talk to many websites because such websites are name-based virtual hosts where the web server software absolutely requires a HTTP Host header before it will serve the website to you. If you leave out the Host header, at best you will get some random default site, perhaps a stub site.

(In a HTTPS context, web servers will also require TLS SNI and some will give you errors if the HTTP Host doesn't match the TLS SNI or is missing entirely. These days this causes HTTP/0.9 requests to be not very useful.)

If HTTP/1.0 requests were merely somewhere between a partial lie (in that everything that worked was actually supplying a Host header too) and useless (for things that didn't supply a Host), you could simply leave them be, especially if the volume was low. But my examination suggests strongly that approximately everything that is making HTTP/1.0 requests to Wandering Thoughts is actually up to no good; at a minimum they're some form of badly coded stealth spiders, quite possibly from would-be comment spammers that are trawling for targets. On a spot check, this seems to be true of another web server as well.

(A lot of the IPs making HTTP/1.0 requests provide claimed User-Agent headers that include ' Not-A.Brand/99 ', which appears to have been a Chrome experiment in putting random stuff in the User-Agent header. I don't see that in modern real Chrome user-agent strings, so I believe it's been dropped or de-activated since then.)

My own answer is that for now at least, I've blocked HTTP/1.0 requests to Wandering Thoughts. I'm monitoring what User-Agents get blocked, partly so I can perhaps exempt some if I need to, and it's possible I'll rethink the block entirely.

(Before you do this, you should certainly look at your own logs. I wouldn't expect there to be very many real HTTP/1.0 clients still out there, but the web has surprised me before.)

The importance of name-based virtual hosts (websites)

By: cks

I recently read Geoff Huston's The IPv6 Transition, which is actually about why that transition isn't happening. A large reason for that is that we've found ways to cope with the shortage of IPv4 addresses, and one of the things Huston points to here is the introduction of the TLS Server Name Indicator (SNI) as drastically reducing the demand for IPv4 addresses for web servers. This is a nice story, but in actuality, TLS SNI was late to the party. The real hero (or villain) in taming what would otherwise have been a voracious demand for IPv4 addresses for websites is the HTTP Host header and the accompanying idea of name-based virtual hosts. TLS SNI only became important much later, when a mass movement to HTTPS hosts started to happen, partly due to various revelations about pervasive Internet surveillance.

In what is effectively the pre-history of the web, each website had to have its own IP(v4) address (an 'IP-based virtual host', or just your web server). If a single web server was going to support multiple websites, it needed a bunch of IP aliases, one per website. You can still do this today in web servers like Apache, but it has long since been superseded with name-based virtual hosts, which require the browser to send a Host: header with the other HTTP headers in the request. HTTP Host was officially added in HTTP/1.1, but I believe that back in the days basically everything accepted it even for HTTP 1.0 requests and various people patched it into otherwise HTTP/1.0 libraries and clients, possibly even before HTTP/1.1 was officially standardized.

(Since HTTP/1.1 dates from 1999 or so, all of this is ancient history by now.)

TLS SNI only came along much later. The Wikipedia timeline suggests the earliest you might have reasonably been able to use it was in 2009, and that would have required you to use a bleeding edge Apache; if you were using an Apache provided by your 'Long Term Support' Unix distribution, it would have taken years more. At the time that TLS SNI was introduced this was okay, because HTTPS (still) wasn't really seen as something that should be pervasive; instead, it was for occasional high-importance sites.

One result of this long delay for TLS SNI is that for years, you were forced to allocate extra IPv4 addresses and put extra IP aliases on your web servers in order to support multiple HTTPS websites, while you could support all of your plain-HTTP websites from a single IP. Naturally this served as a subtle extra disincentive to supporting HTTPS on what would otherwise be simple name-based virtual hosts; the only websites that it was really easy to support were ones that already had their own IPs (sometimes because they were on separate web servers, and sometimes for historical reasons if you'd been around long enough, as we had been).

(For years we had a mixed tangle of name-based and ip-based virtual hosts, and it was often difficult to recover the history of just why something was ip-based instead of name-based. We eventually managed to reform it down to only a few web servers and a few IP addresses, but it took a while. And even today we have a few virtual hosts that are deliberately ip-based for reasons.)

Syndication feed readers now seem to leave Last-Modified values alone

By: cks

A HTTP conditional GET is a way for web clients, such as syndication feed readers, to ask for a new copy of a URL only if the URL has changed since they last fetched it. This is obviously appealing for things, like syndication feed readers, that repeatedly poll URLs that mostly don't change, although syndication feed readers not infrequently get parts of this wrong. When a client makes a conditional GET, it can present an If-Modified-Since header, an If-None-Match header, or both. In theory, the client's If-None-Match value comes from the server's ETag, which is an opaque value, and the If-Modified-Since comes from the server's Last-Modified, which is officially a timestamp but which I maintain is hard to compare except literally.

I've long believed and said that many clients treat the If-Modified-Since header as a timestamp and so make up their own timestamp values; one historical example is Tiny Tiny RSS, and another is NextCloud-News. This belief led me to consider pragmatic handling of partial matches for HTTP conditional GET, and due to writing that entry, it also led me to actually instrument DWiki so I could see when syndication feed clients presented If-Modified-Since timestamps that were after my feed's Last-Modified. The result has surprised me. Out of the currently allowed feed fetchers, almost no syndication feed fetcher seems to present its own, later timestamp in requests, and on spot checks, most of them don't use too-old timestamps either.

(Even Tiny Tiny RSS may have changed its ways since I last looked at its behavior, although I'm keeping my special hack for it in place for now.)

Out of my reasonably well behaved, regular feed fetchers (other than Tiny Tiny RSS), only two uncommon ones regularly present timestamps after my Last-Modified value. And there are a lot of different User-Agents that managed to do a successful conditional GET of my syndication feed.

(There are, unfortunately, quite a lot of User-Agents that fetched my feed but didn't manage even a single successful conditional GET. But that's another matter, and some of them may have an extremely low polling interval. It would take me a lot more work to correlate this with which requests didn't even try any conditional GETs.)

This genuinely surprises me, and means I have to revise my belief that everyone mangles If-Modified-Since. Mostly they don't. As a corollary, parsing If-Modified-Since strings into timestamps and doing timestamp comparisons on them is probably not worth it, especially if Tiny Tiny RSS has genuinely changed.

(My preliminary data also suggests that almost no one has a different timestamp but a matching If-None-Match value, so my whole theory on pragmatic partial matches is irrelevant. As mentioned in an earlier entry, some feed readers get it wrong the other way around.)

PS: I believe that rachelbythebay's more systematic behavioral testing of feed readers has unearthed a variety of feed readers that have more varied If-Modified-Since behavior than I'm seeing; see eg this recent roundup. So actual results on your website may vary significantly depending on your readers and what they use.

Potential pragmatic handling of partial matches for HTTP conditional GET

By: cks

In HTTP, a conditional GET is a GET request that potentially can be replied with a HTTP '304 Not Modified' status; this is quite useful for polling relatively unchanging resources like syndication feeds (although syndication feed readers don't always do so well at it). Generally speaking, there are two potential validators for conditional GET requests; the If-None-Match header, validated against the ETag of the reply, and the If-Modified-Since header, validated against the Last-Modified of the reply. A HTTP client can remember and use either or both of your ETag and your Last-Modified values (assuming you provide both).

When a HTTP client sends both If-Modified-Since and If-None-Match, the fully correct, specifications compliant validation is to require both to match. This makes intuitive sense; both your ETag and your Last-Modified values are part of the state of what you're replying with, and if one doesn't match, the client has a different view of the URL's state than you do so you shouldn't claim it's 'not modified' from their state. Instead you should return the entire response so that they can update their view of your Last-Modified state.

In practice, two things potentially get in the way. First, it's common for syndication feed readers and other things to treat the 'If-Modified-Since' value they provide as a timestamp, not as an opaque string that echoes back your previous Last-Modified. Programs will put in what's probably some default time value, they'll use timestamps from internal events, and various other fun things. By contrast, your ETag value is opaque and has no meaning for programs to interpret, guess at, and make up; if a HTTP client sends an ETag, it's very likely to be one you provided (although this isn't certain). Second, it's not unusual for your ETag to be a much stronger validator than your Last-Modified; for example, your ETag may be a cryptographic hash of the contents and will definitely change if they do, while your Last-Modified is an imperfect approximation and may not change even if the content does.

In this situation, if a client presents an If-None-Match header that matches your current ETag and a If-Modified-Since that doesn't match your Last-Modified, it's extremely likely that they have your current content but have done one of the many things that make their 'timestamp' not match your Last-Modified. If you know you have a strong validator in your ETag and they're doing something like fetching your syndication feed (where it's very likely that they're going to do this a lot), it's pragmatically tempting to give them a HTTP 304 response even though you're technically not supposed to.

To reduce the temptation, you can change to comparing your Last-Modified value against people's If-Modified-Since as a timestamp (if you can parse their value that way), and giving people a HTTP 304 response if their timestamp is equal to or after yours. This is what I'd do today given how people actually handle If-Modified-Since, and it would work around many of the bad things that people do with If-Modified-Since (since usually they'll create timestamps that are more recent than your Last-Modified, although not always).

Despite everything I've written above, I don't know if this happens all that often. It's entirely possible that syndication feed readers and other programs that invent things for their If-Modified-Since values are also not using If-None-Match and ETag values. I've recently added instrumentation to the software here so that I can tell, so maybe I'll have more to report soon.

(If I was an energetic person I would hunt through the data that rachelbythebay has accumulated in their feed reader behavioral testing project to see what it has to say about this (the most recent update for which is here and I don't know of an overall index, see their archives). However, I'm not that energetic.)

Things syndication feed readers do with 'conditional GET'

By: cks

In HTTP, a conditional GET is a nice way of saving bandwidth (but not always work) when a web browser or other HTTP agent requests a URL that hasn't changed. Conditional GET is very useful for things that fetch syndication feeds (Atom or RSS), because they often try fetches much more often than the syndication feed actually changes. However, just because it would be a good thing if feed readers and other things did conditional GETs to fetch feeds doesn't mean that they actually do it. And when feed readers do try conditional GETs, they don't always do it right; for instance, Tiny Tiny RSS at least used to basically make up the 'If-Modified-Since' timestamps it sent (which I put in a hack for).

For reasons beyond the scope of this entry, I recently looked at my feed fetching logs for Wandering Thoughts. As usually happens when you turn over any rock involving web server logs, I discovered some multi-legged crawling things underneath, and in this case I was paying attention to what feed readers do (or don't do) for conditional GETs. Consider this a small catalog.

  • Some or perhaps all versions of NextCloud-News send an If-Modified-Since header with the value 'Wed, 01 Jan 1800 00:00:00 GMT'. This is always going to fail validation and turn into a regular GET request, whether you compare If-Modified-Since values literally or consider them as a timestamp and do timestamp comparisons. NextCloud-News might as well not bother sending an If-Modified-Since header at all.

  • A number of feed readers appear to only update their stored ETag value for your feed if your Last-Modified value also changes. In practice there are a variety of things that can change the ETag without changing the Last-Modified value, and some of them regularly happen here on Wandering Thoughts, which causes these feed readers to effectively decay into doing unconditional GET requests the moment, for example, someone leaves a new comment.

  • One feed reader sends If-Modified-Since values that use a numeric time offset, as in 'Mon, 07 Oct 2024 12:00:07 -0000'. This is also not a reformatted version of a timestamp I've ever given out, and is after the current Last-Modified value at the time the request was made. This client reliably attempts to pull my feed three times a day, at 02:00, 08:00, and 20:00, and the times of the If-Modified-Since values for those fetches are reliably 00:00, 06:00, and 12:00 respectively.

    (I believe it may be this feed fetcher, but I'm not going to try to reverse engineer its If-Modified-Since generation.)

  • Another feed fetcher, possibly Firefox or an extension, made up its own timestamps that were set after the current Last-Modified of my feed at the time it made the request. It didn't send an If-None-Match header on its requests (ie, it didn't use the ETag I return). This is possibly similar to the Tiny Tiny RSS case, with the feed fetcher remembering the last time it fetched the feed and using that as the If-Modified-Since value when it makes another request.

All of this is what I turned over in a single day of looking at feed fetchers that got a lot of HTTP 200 results (as opposed to HTTP 304 results, which shows a conditional GET succeeding). Probably there are more fun things lurking out there.

(I'm happy to have people read my feeds and we're not short on bandwidth, so this is mostly me admiring the things under the rock rather than anything else. Although, some feed readers really need to slow down the frequency of their checks; my feed doesn't update every few minutes.)

❌